Is there a good way to check data integrity after data migration from one druid to another druid

Hi guys,

We recently want to migrate a part of old druid data into another druid. Intuitively, we use select query to do this task. We split a whole time range into small unit time interval (a couple of hours) and use a tail recursion druid select query wrapper function to pull data from old druid and index into new druid within each time interval.

So everything works fine using the above approach, only one thing we want to ask for help about the data integrity. Basically, we want to make sure migrated data in new druid must match old data in old druid. Currently, we use a simple way to check data integrity: for each unit time interval (a couple of hours), we use timeseries query aggregating on counts to get total counts of segment rows (we don’t do any roll-up) for this time period which should always match the size of data return by our druid select query wrapper function for same time period. And at the end, we use the same logic to double-check the total count of migrated data for whole time period is correct.

Again this simple way works fine. But is there a better way to do data integrity, more specifically, how to make sure data pulled by select query (in tail recursion way) is correct. I’m thinking is it possible, druid query can efficiently return a checksum (md5, SHA-1 or any) on the whole segments data for a given time period, and we can use this checksum to verify pulled data on remote server.

Thanks

By Linbo

Our current way to check data integrity:

Linbo,

What’s the difference between the two clusters? If you are still using the same deep storage, all you should have to do is copy the metadata out of the segments table from the metadata store and your new cluster will have exactly the same data as the old cluster.

If you are changing deep storage, then you can just copy the segments to the new deep storage and then copy the metadata and update the paths in the “loadSpec” to point to the new location. Then it will load the exact segments from the old cluster.

Lastly, if there’s a reason you really must re-index the data, then queries like what you are doing should be fine. You could also do topN queries on each dimension and make sure they come out the exact same between the clusters.

If you want to generate a “checksum” for the segment you could do that with a JavaScript aggregator that walks over all columns, just be careful with thy because the reinforcing process could choose to layout the data differently, so it might be the exact same set of rows but in a different layout. If that’s the case, a naive checksum could tell you that they are different when they aren’t.

–Eric

Hi Eric,

Thanks for your detailed answer. I can give more info about our use case:

  1. Two clusters are totally different which will not share the same deep storage, so we can not simply migrate metadata of the segments table.

  2. What we really want to migrate is a part of the whole data set (data belongs to a specific dimension value), so we need to query the data (which means we need to split data from row segments) rather than moving the whole segments.

  3. The new cluster is not an empty one: it’s also running normally indexing tasks at the same time, which results that there are already segments in the same time range with the migrating data. So we need to re-indexing the data, more specifically, merge migrating data into existing data for same time period. We used delta-indexing before and now change to kafka indexing service.

Thanks for your decent suggestions. I will try to use topN queries on each dimension and make sure they come out the exact same between the clusters as well as JavaScript aggregator.

By Linbo