Finished data Migration from old version(0.13) to 0.17, how to merge data from different segments?


We have migrated our deep storage data from S3 to GS, and our new druid cluster work well, but we have a problem on different segments how to merge to one, this is our use scene:

Old Druid Cluster(0.13.0) aws S3

New Druid Cluster(0.17.0) gcp GS

Cause of the cluster migration contains several hours, different segment data distribute on the S3 and GS, we have migrated GS data to GS, but we don’t know how to merge these segments( file) and roll up with the same time based. (eg: A is our old cluster, B is our new cluster, on the 2020-03-13 13 ~ 2020-03-13 16, we have

2020-03-13T13:05:20.285Z/  both on A and B,  and data is belong to the same datasource.)

We only use druid with Kafka stream ingestion, and I could’t find the way how to read the segments and merge, so could you please give some suggestions?


Do you think that you have any duplicate data in your segment files or do you think it’s clean?

If your data is clean and you simply have a portion of the data that arrived into system a and a portion of the data that arrived into system b, I believe as long as you alter the data in the Druid segments table to know how to retrieve the data from deep storage for your new system… That you should not worry about merging the data and let Druid handle how it accesses the data in the segments.

Yes, it’s clean, different cluster receiver separated traffic, so the data shouldn’t be duplicate. If we migrate the metadata on table druid_segments, the primary key may be duplicate, because the primary id is generated by date and time, so we couldn’t finish all the segments.

Chris Goll 于2020年3月16日周一 下午8:26写道: