Doubts regarding replicaiton policies in realtime ndoes and Tranquility

Hi all,

I’m new to Druid. I’ve followed the docs and set up the servers and injected data without problem (just simple ones of course).

But when I try to understand the internal of Druid, I met some difficulties. The white paper http://static.druid.io/docs/druid.pdf seems to be outdated. It doesn’t clearly explain some key concepts related to scalability/replications too.

My current doubts are as follows, if there’s any better docs/links explaining these stuff it will be great. If there’s none, I hope someone can probably help? Hope it’s the right place to ask these questions. Thanks.

(1) In the white paper, there’s real-time nodes. However, in the docs on druid website, there’s no setup tutorial of real-time nodes. If you want to inject stream data, the tutorial let you use Tranquility. So I guess real-time node is now deprecated?

In the white paper, it says, from my understanding, that the real-time node itself doesn’t replicate, the real-time node will genenrally read from Kafka, and so you can setup (manually) TWO or MORE real-time nodes to read the same stream to make the replication. Correct me if I’m wrong.

It also says, each real-time node can be set up to inject a portion of a stream (maybe via Kafka partitions concept). So in my understanding, I probably need to setup many real-time nodes, and each node try to inject a portion of stream, and some of the real-time nodes will use different Kafka consumer group ids, so there will be replication too? Is my understanding correct?

Real-time node will push segment into deep sotrage after the segment is completed, so how do the replicated real-time nodes push the data into deep storage, will they smartly know that they belong to the same stream/segment, and smartly know that they are replications during the push? If two real-time nodes both inject a portion of a stream via Kafka, how do they merge into one segment before pushing into deep storage? Will data be possibly lost and possible duplicated during the push? There’s just no details on these questions in the white paper.

(2) Tranquility. On the Tranquility tutorial, it says that it will smartly create realtime indexing task (via Overload & middleMananger nodes). On the Druid docs, in the indexing-services (http://druid.io/docs/0.9.1.1/design/indexing-service.html) and tasks (http://druid.io/docs/0.9.1.1/ingestion/tasks.html), there’s nothing regarding the “realtime indexing task”.

After some search, it’s said that the “realtime indexing task” and “realtime node” are doing almost the same thing. The difference is that realtime index task can be dynamically created and destroyed. Is that so?

So in my understanding, the Tranquility will take care of the creating of realtime indexing tasks? It will probably create many realtime indexing tasks for the same stream/segment for replication purpose (same stream but different Kafka consumer group)? It will create many realtime indexing tasks for injecting the same stream (same Kafka consumer group)? Will replication be taken care of? What’s the policy (i.e., how many replications for the same segment? when to decide to let more than one indexing task to read from the same stream/segment/Kafka consumer group? etc.) of Tranquility in creating realtime indexing tasks? Is it possible to write the policy in config files?

The above assumes that Tranquility reads from Kafka.

How about without Kafka? On the druid tutorial, I can inject my stream via Tranquility HTTP server, Tranquility will internally create realtime index task(s)? How about replication in this case? Will replication be taken care of too? Meaning that will Tranquility create several realtime indexing tasks and push data onto each one upon data injection via Tranquility HTTP server?

(3) Tranquility will dyanmically create realtime indexing tasks, so in my understanding, a new dataSource(i.e., with a different data schema, data from a completely new source) can be created on the run without restarting Tranquility or Druid servers. Is my understanding correct?

Thanks for reading.

Please find few of the answers inline

Hi all,

I’m new to Druid. I’ve followed the docs and set up the servers and injected data without problem (just simple ones of course).

But when I try to understand the internal of Druid, I met some difficulties. The white paper http://static.druid.io/docs/druid.pdf seems to be outdated. It doesn’t clearly explain some key concepts related to scalability/replications too.

My current doubts are as follows, if there’s any better docs/links explaining these stuff it will be great. If there’s none, I hope someone can probably help? Hope it’s the right place to ask these questions. Thanks.

(1) In the white paper, there’s real-time nodes. However, in the docs on druid website, there’s no setup tutorial of real-time nodes. If you want to inject stream data, the tutorial let you use Tranquility. So I guess real-time node is now deprecated?

Yes,druid now encourages to use Tranquility for real time ingestion and not the standalone real time nodes.You can read http://druid.io/docs/latest/ingestion/stream-pull.html#limitations to understand limitations of standalone real time nodes.

In the white paper, it says, from my understanding, that the real-time node itself doesn’t replicate, the real-time node will genenrally read from Kafka, and so you can setup (manually) TWO or MORE real-time nodes to read the same stream to make the replication. Correct me if I’m wrong.

It also says, each real-time node can be set up to inject a portion of a stream (maybe via Kafka partitions concept). So in my understanding, I probably need to setup many real-time nodes, and each node try to inject a portion of stream, and some of the real-time nodes will use different Kafka consumer group ids, so there will be replication too? Is my understanding correct?

Real-time node will push segment into deep sotrage after the segment is completed, so how do the replicated real-time nodes push the data into deep storage, will they smartly know that they belong to the same stream/segment, and smartly know that they are replications during the push? If two real-time nodes both inject a portion of a stream via Kafka, how do they merge into one segment before pushing into deep storage? Will data be possibly lost and possible duplicated during the push? There’s just no details on these questions in the white paper.

(2) Tranquility. On the Tranquility tutorial, it says that it will smartly create realtime indexing task (via Overload & middleMananger nodes). On the Druid docs, in the indexing-services (http://druid.io/docs/0.9.1.1/design/indexing-service.html) and tasks (http://druid.io/docs/0.9.1.1/ingestion/tasks.html), there’s nothing regarding the “realtime indexing task”.

After some search, it’s said that the “realtime indexing task” and “realtime node” are doing almost the same thing. The difference is that realtime index task can be dynamically created and destroyed. Is that so?
Yes,tranquility is a way to create the realtime indexing task programmatically.
So in my understanding, the Tranquility will take care of the creating of realtime indexing tasks? It will probably create many realtime indexing tasks for the same stream/segment for replication purpose (same stream but different Kafka consumer group)? It will create many realtime indexing tasks for injecting the same stream (same Kafka consumer group)? Will replication be taken care of? What’s the policy (i.e., how many replications for the same segment? when to decide to let more than one indexing task to read from the same stream/segment/Kafka consumer group? etc.) of Tranquility in creating realtime indexing tasks? Is it possible to write the policy in config files?

User decide the replication factor while using tranquility via a configuration.Tranquility will read the data once and will send the same data to multiple indexing tasks(equal to number of replication) hence ensuring that each replica gets exactly the same set of data.In case of realtime nodes ensuring all replicas get same set of data was not possible.

The above assumes that Tranquility reads from Kafka.

How about without Kafka? On the druid tutorial, I can inject my stream via Tranquility HTTP server, Tranquility will internally create realtime index task(s)? How about replication in this case? Will replication be taken care of too? Meaning that will Tranquility create several realtime indexing tasks and push data onto each one upon data injection via Tranquility HTTP server?

As stated above in both the cases i.e using tranquility as client library or Tranquility server,it reads the input only once and sends the data to multiple tasks.
Tranquility is a push based mechanism where the user’s application read the data from input source(eg:Kafka) and pushes to druid via Tranquility.AFAIK,it doesn’t pull data by itself from Kafka.

(3) Tranquility will dyanmically create realtime indexing tasks, so in my understanding, a new dataSource(i.e., with a different data schema, data from a completely new source) can be created on the run without restarting Tranquility or Druid servers. Is my understanding correct?

Yes,a new data source can be created on the runtime without restarting any service.

This might help:
https://groups.google.com/forum/#!searchin/druid-development/fangjin$20yang$20"thoughts"/druid-development/aRMmNHQGdhI/muBGl0Xi_wgJ