Switching deepstores from HDFS to Swift

Hi all,

We are running a Druid cluster with the deepstore configured to be HDFS and intend to move the deepstore to Swift.

In order to do this without having to re-index all our data**,** we were roughly thinking about the following strategy. The assumption is that the coordinator picks up the segments in Swift as new segments so that they are added to the loadQueue and metastore and processed as if they were freshly indexed segments:

  • Copy all data from HDFS to Swift, keeping the same directory structure

  • Replace the loadSpec.path in descriptor.json to be the new Swift path

  • Reconfigure the cluster to use Swift as a deepstore

  • Restart the cluster

  • Wait until the historicals finish loading all the “new” segments from Swift

  • Delete all the “old” segments from the historicals

However, since we are not sure about the behavior of the coordinator and historicals in this scheme we would like to know what happens when we execute this plan. More specifically:

  • Will the coordinator pick up the segments in the new deepstore (Swift)?

  • If so, will these segments be seen as new segments and processed as if they were new segments?

Or maybe this strategy is not feasible at all and there are better ways of going about this.

Thanks in advance and a Happy new year to all!

Kees

Hey Kees,

The process you outlined sounds about right, except for the last couple of steps. As far as Druid is concerned, after you do this migration, the segment ids will not change, so it won’t redownload them from Swift. So there won’t be any loading of “new” segments or deleting of “old” segments.

Hey Gian,

Thanks for your response.

I understand that, after the migration, the cluster still thinks that the segments are stored on HDFS because all the metadata on the segments is still pointing to HDFS paths.

Therefor, what I essentially need to do after the migration, is update the metadata on all the existing segments in Druid, modifying all the deepstorage paths to point to Swift in stead of HDFS.

I could do this by manually updating the descriptors in the historical segment cache and the metastore (I don’t know if there are any other places) but this seems like a very hacky solution.

As an alternative, I thought about using the IngestSegmentFirehose but as far as I can understand from the documentation:

  • the segments need to exist in Druid in order to do this

  • there is no explicit mention of using this firehose for changing metadata on a segment.

Do you have any suggestions as to how I would re-submit the copied segments in Swift to the Druid cluster after the migration?

Thanks again!

Kees

I think it’s actually fine to just update the payloads in the metadata store only. The descriptor.jsons aren’t used for anything other than restoring segment metadata if you lost your metadata store for some reason. (The insert-segment-to-db tool reads them.) But that approach in general – reloading descriptor.jsons into a metadata store – is frowned upon, since it potentially loses valuable information, like which segments were really published and which were rolled back. So I would have a mindset to plan not to do that, and to rely on the metadata store.

Ahh, I see. Well that is great! We have backup policies in place for the metastore so the probability of us having use that tool is quite small. Thanks a lot for this answer!