Merge Data when having multiple realtime nodes


I am new to druid and as I am working on setting up a staging druid clusters, I find something really confuse me. Please help.

To begin, I am running Druid 0.8.3 and using stand alone realtime nodes to ingest realtime data from kafka. This is to match the production environment we are using.

The question is, if I have multiple, say 3 for example, realtime nodes that consume data from a kafka topic. As I understand, each message will only go to one of the realtime node and later will be pushed to the deep storage, in this case, hdfs. The question raises considering three realtime nodes will have three segments pushed to the hdfs. In my case, this seems to lead a lot more space usage in deep storage especially considering I am using sketch data structure in hyperunique aggregator.

For example, originally, one keyword will only keep one sketch structure but now one word may keep up to three sketch structures due to there are three realtime nodes.

Is there any solution to this?

I was told that I can re-index the data in hdfs with batch ingestion ( But after some research, I am still quite confused on how this should work.

If any one with experience could help me please.

Thank you,

Will Yan

Hey Will,

You can solve this by partitioning your data by key on the Kafka topic side, such that messages that should roll up together will go to the same topic. You can do this by setting a key on the Kafka producer.

Thank you, this idea is really something that I never thought of and is definitely much easier to do than reindexing. But after some discussion with the others in the group, we still decide to reindex the batch data since we want to filter out some of the in the process.

As I understand now, is this reindexing process the same as what is described under “Reindexing and Delta Ingestion with Hadoop Batch Ingestion” category that I just need to send the spec to the overlord to index those batch data and then druid will use the new version of the segments automatically by itself?

Yeah, that’s the idea. If you do a batch indexing, that replaces the old data for the same “intervals” as defined in the batch indexing job.