I’m new to Druid. I’ve followed the docs and set up the servers and injected data without problem (just simple ones of course).
But when I try to understand the internal of Druid, I met some difficulties. The white paper http://static.druid.io/docs/druid.pdf seems to be outdated. It doesn’t clearly explain some key concepts related to scalability/replications too.
My current doubts are as follows, if there’s any better docs/links explaining these stuff it will be great. If there’s none, I hope someone can probably help? Hope it’s the right place to ask these questions. Thanks.
(1) In the white paper, there’s real-time nodes. However, in the docs on druid website, there’s no setup tutorial of real-time nodes. If you want to inject stream data, the tutorial let you use Tranquility. So I guess real-time node is now deprecated?
In the white paper, it says, from my understanding, that the real-time node itself doesn’t replicate, the real-time node will genenrally read from Kafka, and so you can setup (manually) TWO or MORE real-time nodes to read the same stream to make the replication. Correct me if I’m wrong.
It also says, each real-time node can be set up to inject a portion of a stream (maybe via Kafka partitions concept). So in my understanding, I probably need to setup many real-time nodes, and each node try to inject a portion of stream, and some of the real-time nodes will use different Kafka consumer group ids, so there will be replication too? Is my understanding correct?
Real-time node will push segment into deep sotrage after the segment is completed, so how do the replicated real-time nodes push the data into deep storage, will they smartly know that they belong to the same stream/segment, and smartly know that they are replications during the push? If two real-time nodes both inject a portion of a stream via Kafka, how do they merge into one segment before pushing into deep storage? Will data be possibly lost and possible duplicated during the push? There’s just no details on these questions in the white paper.
(2) Tranquility. On the Tranquility tutorial, it says that it will smartly create realtime indexing task (via Overload & middleMananger nodes). On the Druid docs, in the indexing-services (http://druid.io/docs/0.9.1.1/design/indexing-service.html) and tasks (http://druid.io/docs/0.9.1.1/ingestion/tasks.html), there’s nothing regarding the “realtime indexing task”.
After some search, it’s said that the “realtime indexing task” and “realtime node” are doing almost the same thing. The difference is that realtime index task can be dynamically created and destroyed. Is that so?
So in my understanding, the Tranquility will take care of the creating of realtime indexing tasks? It will probably create many realtime indexing tasks for the same stream/segment for replication purpose (same stream but different Kafka consumer group)? It will create many realtime indexing tasks for injecting the same stream (same Kafka consumer group)? Will replication be taken care of? What’s the policy (i.e., how many replications for the same segment? when to decide to let more than one indexing task to read from the same stream/segment/Kafka consumer group? etc.) of Tranquility in creating realtime indexing tasks? Is it possible to write the policy in config files?
The above assumes that Tranquility reads from Kafka.
How about without Kafka? On the druid tutorial, I can inject my stream via Tranquility HTTP server, Tranquility will internally create realtime index task(s)? How about replication in this case? Will replication be taken care of too? Meaning that will Tranquility create several realtime indexing tasks and push data onto each one upon data injection via Tranquility HTTP server?
(3) Tranquility will dyanmically create realtime indexing tasks, so in my understanding, a new dataSource(i.e., with a different data schema, data from a completely new source) can be created on the run without restarting Tranquility or Druid servers. Is my understanding correct?
Thanks for reading.