Dealing with multiple datasources

Hi everyone,

After reading the thread [1], it seems like the best / easiest (from the context of ease of re-indexing) way to handle dimensions with isolated cardinality is to have multiple Druid datasources.

Since I’m trying to use tranquility with Storm, creating datasources on fly don’t seem to be feasible because datasource information has to be defined in static manner [2] as a BeamFactory.

I am curious how other people are dealing with multiple datasources.

Thanks.

[1] https://groups.google.com/forum/#!msg/druid-user/lR9AVV-Y1-c/OET2QpdFLtAJ
[2] https://github.com/metamx/tranquility#storm

Hi Prajwal,

We approach this by having a separate Samza job (similar to Storm topology) for each datasource. This makes sense for us because each datasource is generally big enough to deserve at least one dedicated Samza container. This may be the case for you too, especially since each Druid datasource needs at least one dedicated indexing task anyway.

But if you do want to load into all of your Druid datasources with a single Storm topology, you can do that. You’ll have to write a custom Storm Bolt that dynamically creates Tranquility Beams at runtime. You won’t be able to use Tranquility’s builtin BeamBolt for this, but if you look at the code for that, it might give you some ideas about how to write your own.

You can also look at Tranquility’s builtin Samza BeamProducer. It already has the ability to dynamically add new datasources.

Hi Gian,

We approach this by having a separate Samza job (similar to Storm topology) for each datasource. This makes sense for us because each datasource is generally big enough to deserve at least one dedicated Samza container. This may be the case for you too, especially since each
Druid datasource needs at least one dedicated indexing task anyway.

That’s pretty interesting approach. I am assuming, you have either isolated stream source (individual Kafka topic) or some kind of custom routing sending data to different Samza jobs?

But if you do want to load into all of your Druid datasources with a single Storm topology, you can do that.

Currently, I am opting for creating multiple datasources within single topology. I am using Tranquility’s Direct API with Storm’s Trident API. Something similar to hbase-storm [1].

So, my question is:

Since each druid datasource maintains at least one realtime task, it’s opening at least one outgoing TCP connection. I am currently pooling those connections by closing them after few minutes of inactivity (there might be no data to ingest for certain time). When, those connections get closed, there will be no realtime task process for certain datasources. How will this affect indexing, segment handoff and querying when realtime task is not active for certain time? I believe this should not have any affect but I just want to confirm.

Thanks in advance.

[1] https://github.com/apache/storm/blob/v0.9.4/external/storm-hbase/src/main/java/org/apache/storm/hbase/trident/state/HBaseState.java#L41

We have separate Kafka topics for each datasource.

Closing and re-opening tranquility Beams won’t affect the workings of the Druid tasks, so feel free to do that.