Data migration/restore between druid clusters selectively

Hi Druid User Group

What is the best way to migrate data selectively (using dimension filter) from one druid cluster to another one?

We need this for client-wise data migration, so every migration needs a filter that picks data only for the given client.

Any information/thoughts will be helpful

Thanks

Ashish

Hi Ashish, without fully understanding what you are trying to do, I don’t think you need multiple Druid clusters for multiple clients. You can have multiple datasources for clients or even have a large datasource for all client data and have your appl do the filtering.

See: http://druid.io/docs/0.9.1.1/querying/multitenancy.html

Thanks Fangjin for your response.

The requirement is to move data selectively to another Druid installation.

If druid supports some kind of backup-restore function, where backup can be made using a dimension filter, and it can be restored on another Druid installation, that serves the purpose.

Even if backup-restore can be done only for the whole datasource (without filter), we can create separate datasources, and that serves the purpose.

So is there way to clone a datasource to another Druid installation?

It will be even better if it can be done with a filter to pick data selectively

The easiest way to migrate data to another Druid cluster, assuming that the same deep storage can be used, is to copy the original metadata segment table to the next metadata table for the new installation. You can selectively copy information about which segments you require.

– FJ

In this case, will multiple druid coordinators (from different clusters having their own coordinator and postgres metadata store) be trying to manage the same segments?

Can there be conflict situations where a segment is replaced by reindexing from one cluster but the other one still sees the old segment?

Hi Ashish, segments are all backed up in deep storage and historicals are making a local copy from deep storage, so having multiple coordinators in different clusters manage the same segments isn’t really an issue as long as they have different metadata tables.

If I understand correctly, you are suggesting that both druid clusters can write and read their own segments and keep metadata of the segment files that they wrote.

In this case, the only way for one cluster to see other’s data will be to run insert-segment-to-db? or there is some other way as well?

Hi Ashish,

Apologies for the confusion. Each cluster requires its own dedicated set of tables for metadata storage, however, multiple druid clusters can share the same deep storage. You can copy the contents of one metadata store for one cluster to another metadata store for another cluster, and the new cluster should be able to copy segments from the common deep storage.

Best,

FJ

Thanks Fangjin for clarifying.

In the case where these 2 clusters are managing their own data with same datasource name, and segments for cluster-1 are imported to the cluster-2 using ‘insert-segment-to-db’, will cluster-2 index have data for both or the new data with overwrite the existing cluster-2 data (used marked to false in metadata?) for the same time-range?

Thanks

Ashish