Hybrid Pipeline Architecture

Hello folks,

We currently have a real time ingestion pipeline in druid which is doing close to 400K QPS using tranquility and Spark Streaming.

We are certainly very happy with the performance.

However we have cases where large number of events can be delayed , where delay could be between 6 hours to 48 hours.

Most of these events are getting dropped, causing loss of data which we would like to avoid.

My use case involves storing data at queryGranularity of 1 minute and segment granularity of 6 hours.

Has anyone faced a problem like this and whats the ideal architecture to go around this.

The questions which are bothering me are:

  1. How does the reindexing happen for delayed events ?

  2. What should be the intermediate store for delayed events ?

Really appreciate the help there.



I am not familiar with the tranquility, why don’t you try the indexing service, it can guarantee the data consumed exactly once, no loss of data. As to the delayed events, I guess it caused by peon small heap size or a bit few workers.

在 2018年6月22日星期五 UTC+8上午2:39:11,Pranav Sawant写道: