I am new to druid and as I am working on setting up a staging druid clusters, I find something really confuse me. Please help.
To begin, I am running Druid 0.8.3 and using stand alone realtime nodes to ingest realtime data from kafka. This is to match the production environment we are using.
The question is, if I have multiple, say 3 for example, realtime nodes that consume data from a kafka topic. As I understand, each message will only go to one of the realtime node and later will be pushed to the deep storage, in this case, hdfs. The question raises considering three realtime nodes will have three segments pushed to the hdfs. In my case, this seems to lead a lot more space usage in deep storage especially considering I am using sketch data structure in hyperunique aggregator.
For example, originally, one keyword will only keep one sketch structure but now one word may keep up to three sketch structures due to there are three realtime nodes.
Is there any solution to this?
I was told that I can re-index the data in hdfs with batch ingestion (http://druid.io/docs/0.9.1.1/ingestion/update-existing-data.html). But after some research, I am still quite confused on how this should work.
If any one with experience could help me please.