Druid periodically reload all datasources and stuck in very low percentage

My group is trying to migrate from an old druid cluster to a new druid cluster.
When new druid cluster was set up, we redirected all datasources’ pipelines to write to both druid clusters and compare the result. Until that point, new druid performed as perfect as the old one.

After several days running, we decide to move old data stored in old druid to new druid.

We used to record every data sent to old druid. However, that data is partially damaged for some reason.
So the way I try to migrate is as followings:

For dataSource A:

  1. Remove every A’s segment in deep storage of new druid

  2. Go to deep storage of old druid, and copy all the segment data of A to the desired segment directory(which will be used by new druid)

  3. Go to metadata storage in old druid and dump the records in druid_segment table where the dataSource=‘A’ to local file B

  4. Change the metadata I just dumped to the correct directory. Keep other things same.

  5. Clear new druid’s druid_segment table in metadata where the dataSource=‘A’

  6. Store B back to druid_segment table.

  7. Restart the whole druid and disable and enable A in coordinator console

So, everything works well for a week after migrating datasource A. It shows same data as old druid(we have a ui to verify because we compare lots of query results), and in coordinator, we see the loading status is 100% available.

I begin to migrate data for other datasources in old druid.

Initially, datasources are all performing great. However, after couple days, I can only see a proportion of data loaded in every datasource. Something like 80% available, 70% available. if I query, some data is missing. Moreover, it affects all datasources in new druid, not only the dataSource I migrated from old druid.

Then, I restart historical nodes and all dataSources become 100% immediately! And all data is verified and the result are same between old druid and new druid.

What I found from the historical nodes log is, after I restart all historical nodes and all datasource are available in druid, historical nodes will keep dropping some old data which is not understandable


interesting issue… would you share your coordinator and historical logs if possible?