Batch indexing & Tranquility

Hey everybody,

My Java application needs to send historic data to a remote Druid cluster for indexing. I need a solution without Hadoop.

  1. Can I use Tranquility for this? The Tranquilizer API looks very appealing but from some older posts it seems that Tranquility does not (yet?) support batch ingestion: it does real-time ingestion and hence will drop events outside its window. Is this info still correct?
  2. If Tranquility doesn’t work, the next best API according to the documentation seems to be the Index Task API. Since my application and Druid are not on the same file system, I am thinking of using the EventReceiverFirehose endpoint (/druid/worker/v1/chat//push-events/). Some questions:
  3. Can I make multiple calls to /druid/worker/v1/chat//push-events/? (Looking at DruidBeam.scala, it seems so.)
  4. Does a call to /druid/worker/v1/chat//push-events/ block until the events are processed?
  5. If such a call doesn’t block, how can I back pressure (= make sure I don’t overflow Druid with requests)?
    Thanks a lot for any feedback!

Kaspar

Hey Kaspar,

Yes, Tranquillity still does drop events outside the realtime window – it’s really designed for ingestion of realtime events.

If you’re willing to use Kafka, the easiest thing is probably using the Kafka indexing service. It is designed to offer ingestion of both realtime and historical data streams, in an exactly once manner. We have a tutorial here: https://imply.io/docs/latest/tutorial-kafka-indexing-service.html. Note that it is experimental at this time, but if it fits your needs I encourage you to give it a shot and let us know if it works for you!