Druid realtime with replication

Hi all,

I want to replicate real-time nodes. The ingestion is done via standalone realtime node.
If i have 2-3 realtime standalone nodes pulling from the same topic with 1 single partition, will the realtime nodes be able to decide who will do the handoff ? How it is done?
How does the broker know about the realtime nodes and how it decides from which to read?

Thanks, Vadim.

Hi Vadim,

There’s some interesting information on this under ‘Ingest from Kafka’ here: http://druid.io/docs/latest/ingestion/overview.html

If you only have one Kafka partition, you should be able to have multiple realtime nodes using different consumer groups to pull identical copies of the data from Kafka. When it comes time to persist the segment, they will all persist their identical data and the last one to write would ‘win’. The broker would know that multiple realtime nodes are serving data for the current time period and would pick one of them to query according to some selection strategy so that won’t be an issue. The issue arises when your data volume increases and you have to split your topic into multiple partitions. Once this happens, running multiple realtime nodes will lead to inconsistencies.

Hey Vadim,

To build on what David has said, another option instead of realtime nodes is to use Tranquility, which supports partitioned replicas. You can write a program that combines a Kafka consumer with a Tranquility sender and use that rather than realtime nodes if you need replication.

Hi David,

If i have 2 standalone real-time nodes. 1 of them crashes. After restart it will re-ingest the data for the windowPeriod and may contain less data for the last period (let’s say “hour”), because some messages that were good 10 minutes ago, are not good any more. When the time to hand off comes, the real-time nodes will pass to historical nodes different data. Is there any solution for this problem?

Actually, using more standalone real-time nodes means that if any 1 of them crashes, i’ll have data inconsistency after restart.

Hey Vadim,

In general, realtime ingestion is considered best-effort and even with a single realtime node there are conditions that could lead to data inconsistency. What many teams do is to run a realtime pipeline to allow for immediate querying of data (while tolerating some possible data discrepancies) and then run a batch ingestion pipeline to periodically rebuild an exact copy of the segment.

For the realtime pipeline, I’d also echo Gian in recommending taking a look at Tranquility which will help you manage ingestion, especially as the volume of data gets larger.

FYI, there’s a ton of work being done right now on reworking the real-time ingestion to provide more correctness guarantees. If you’re curious, you can read about it here: https://github.com/druid-io/druid/issues/1642

Is there no way to replicate using some sort of Write Ahead Log? This will be the most consistent, which can be used to even maintain cross DC consistency.

Hey Guruprasad,

Druid version and better now support the Kafka indexing service which is able to provide exactly-once ingestion guarantees and can handle historical as well as real-time data. For more information, check out:

Blog: https://imply.io/post/2016/07/05/exactly-once-streaming-ingestion.html
Docs: http://druid.io/docs/

This is essentially using Kafka as a WAL.