We approach this by having a separate Samza job (similar to Storm topology) for each datasource. This makes sense for us because each datasource is generally big enough to deserve at least one dedicated Samza container. This may be the case for you too, especially since each
Druid datasource needs at least one dedicated indexing task anyway.
That’s pretty interesting approach. I am assuming, you have either isolated stream source (individual Kafka topic) or some kind of custom routing sending data to different Samza jobs?
But if you do want to load into all of your Druid datasources with a single Storm topology, you can do that.
Currently, I am opting for creating multiple datasources within single topology. I am using Tranquility’s Direct API with Storm’s Trident API. Something similar to hbase-storm .
So, my question is:
Since each druid datasource maintains at least one realtime task, it’s opening at least one outgoing TCP connection. I am currently pooling those connections by closing them after few minutes of inactivity (there might be no data to ingest for certain time). When, those connections get closed, there will be no realtime task process for certain datasources. How will this affect indexing, segment handoff and querying when realtime task is not active for certain time? I believe this should not have any affect but I just want to confirm.
Thanks in advance.