How does Druid merge historical and real-time result?

I know Druid can merge historical and real-time result before returning a final consolidated result to the query.

But how exactly does it finish, or what kind of strategy does it use?

Since Druid is different from HBase, which is kind of a key-value DB, I suppose that timestamp of record should be critical.

Anyone can provide some details?

Hi Eric,

The Druid broker maintains a timeline that tracks which time ranges are available on historical nodes and which are available via realtime indexes. It queries both historical nodes and realtime workers, then merges the results. Realtime results are merged with historical results in the same way as two historical results would be merged together. The exact merging algorithm differs slightly for each query type, but generally each node computes the query result based on the data it has available, and then the broker does a final aggregation on top of those partial aggregations.

The whitepaper goes into some more detail on this (http://static.druid.io/docs/druid.pdf).

Hi Gian, thanks for the tip that the exact merging algorithm differs for each query type.
At first I am curious about how brokers will do when historical result overlap or conflict with real-time result. I suppose this might happen because a user can inject batch data directly to the historical nodes. Anyway, this blog indicates me that the batch data will replace real-time data for the same interval. And I think this may basically solve my problem.

Hi Eric,

yes, you are right, druid uses MVCC swapping protocol to replace any old segments with newer segments having latest version.latest version.latest version.

fwiw, reindexing data for an interval will create segments with newer version and replace any older segments present for the same interval.

Hi Nishant/Gian,

We have a setup of multiple druid-realtime (pulling the data from Kafka using different group.id for given datasource), wanted to understand how broker dispatches the queries to these druid real-time services? if one of the druid-realtime services is not able to serve the request, will original user query will fail ? or Broker will return the partial result (given second druid-realtime returns with the result)?

Any help on this would be greatly appreciated.

Thanks,
JP

Hi Jvalant,

What do you mean by not able to serve request ?

In normal scenarios, druid would return partial results.

Thanks

Abhishek