Hello Druid Experts,
Im exploring ways to integrate Druid ingestion with the streaming pipeline.
I understood there are 2 ways for now.
Use Tranquility Java APIs to integrate, which internally talks to Overlord and spawn tasks for indexing.
Use Kafka Indexing Service
We are wondering to take a different route:
To integrate low level APIs in our streaming pipeline like Spark.
Where from the reducer phase, use the StreamAppenderatorDriver class (could be other classes/modules) to create segments how KafkaIndexingTask is doing.
Tranquility API, again publishes data over network. So its 2 times IO ( from Kafka and Streaming, and then Tranquility)
Tranquility can have data loss or duplicates, which we dont want.
Exploring idea of using Spark as the compute engine for Druid as well, as we are already using/supporting.
If we use Kafka Indexing Service, we have run Overlord/MiddleManager and Peons as distributed compute engine, that will increase the operation overhead.
- No out of the box realtime query servers.
Can we do something about that?
- No exactly once, what KIS provides.
Ans: We are able to manage in our pipeline.
So, if someone has experience in trying out this route, or how it sounds like, then please share/suggest.