Migration of Druid data between environments

Hi,
I’ve a Druid platform running in the last months, I’m now moving to a new platform (completely different environment) that is now ingesting in parallel with the old one.

I’d like to have previous data in new environment, which is the best way to migrate data from one platform to the other?

Thanks

Maurizio

I took a brief look at

https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/common/task/MoveTask.java

to see if it would fit your use case and I don’t think it will because the mover assumes you’re not going across loadSpec types. (aka, you’re staying within S3, HDFS, cassandra, azure, etc… not moving from one type to another)

As such it would require development effort to get a task which does proper locking, copying locally then remotely, updating segment metadata, and verification.

I’m curious if there’s a way to get a hadoop task to do this as part of distcp or similar. I’ll try and ping one of the other devs about it.

Cheers,

Charles Allen

I should also note that moving the metadata store without downtime is non-trivial and will require the assistance of a DBA or other expert.

Moving the old data requires

1) Moving the actual segments
2) Copying the metadata in the segments table on the metadata store
3) Updating the meta to point to the new location of the files. If
you are changing the type of deep storage, that can require not just
adjusting the path but also some other parts of the "payload" portion
of the segments table. If you analyze the things that are currently
loading in parallel and line them up with the data you've copied over,
it should work.

--Eric

Hi Eric,

Found this thread as I am currently trying to migrate data. I’m currently at point 3 of your steps listed below, where I need to update the “payload” column of the segments table. I understand it’s stored as a JSON blob, so my thought would be to write a script to convert it to JSON, edit it, convert it back to a blob, and updating the “payload” column. Apart from this, is there an easier way to achieve this goal?

Regards,
Jason

Hi Jason,
what I’ve personally done was to:

  • dump the data from MySQL

  • replace the path into the payload column using sed/awk

  • restoring the modified dump into new MySQL

Thanks

Maurizio

Hi Jason,
Just to mention there is a tool added by bingkun in 0.9.0 which allows you to update the payload of mysql segments.

docs on how to use that can be found here -

http://druid.io/docs/0.9.0-rc3/operations/insert-segment-to-db.html

Hi Maurizio and Nishant,

Thanks for the suggestions. I’ll try the insert-segment-to-db tool since it’s a coming tool that will be released soon.

Regards,
Jason

Would please anybody add an s3 deep storage example to http://druid.io/docs/0.9.0/operations/insert-segment-to-db.html ?

If this is for hdfs :

–workingDir hdfs://host:port/druid/storage/wikipedia

Would it work with s3 deep storage like this ? :

–workingDir s3://bucket/druid/storage/wikipedia

I’m planning to have 2 clusters sharing a single s3 deep storage location. Only one cluster will be indexing. So I should just create a new cluster and use the migration tool

to hook into the existing segments on s3…

Hi Jakub,
Right now the insert segment tool does not work for S3.

To make it work, I believe you would need to implement S3DataSegmentFinder similar to (https://github.com/druid-io/druid/blob/master/extensions-core/hdfs-storage/src/main/java/io/druid/storage/hdfs/HdfsDataSegmentFinder.java)

Feel free to create an issue or submit a PR for it.

Hi fellas,

I implemented it https://github.com/druid-io/druid/pull/3446 … If anybody needs it.