We recently want to migrate a part of old druid data into another druid. Intuitively, we use select query to do this task. We split a whole time range into small unit time interval (a couple of hours) and use a tail recursion druid select query wrapper function to pull data from old druid and index into new druid within each time interval.
So everything works fine using the above approach, only one thing we want to ask for help about the data integrity. Basically, we want to make sure migrated data in new druid must match old data in old druid. Currently, we use a simple way to check data integrity: for each unit time interval (a couple of hours), we use timeseries query aggregating on counts to get total counts of segment rows (we don’t do any roll-up) for this time period which should always match the size of data return by our druid select query wrapper function for same time period. And at the end, we use the same logic to double-check the total count of migrated data for whole time period is correct.
Again this simple way works fine. But is there a better way to do data integrity, more specifically, how to make sure data pulled by select query (in tail recursion way) is correct. I’m thinking is it possible, druid query can efficiently return a checksum (md5, SHA-1 or any) on the whole segments data for a given time period, and we can use this checksum to verify pulled data on remote server.