import from kafka

Hi,

I understood that the following approaches to load from kafka are available, perhaps more:

  • ingest into realtime nodes (not sure how)

  • use map reduce from kafka InputFormat to druid batch side

  • use tranquility? ( I understood it is an adaptor for push based streaming solutions such as storm, not sure how it would work with a pull based one such as kafka)

  • direct/native integration?

Related to that, I understood realtime nodes are optional, and more recent options are available.

We are not keen on realtime, but every 10 minutes would be fine (focusing on batch and low cost).

Input wil likely go thru kafka, directly or indirectly, as we will likely use it as a bus for all systems.

Pls advice,

Nicu

Hey Nicolae,

I think it’s worth giving the realtime stuff a try, as you may find it is actually not that expensive to run.

In a bit more detail the options are:

  1. Standalone Realtime nodes ingesting directly from Kafka with the Kafka firehose (http://druid.io/docs/latest/ingestion/firehose.html)

  2. Have a process somewhere that pulls from Kafka and pushes to Druid with Tranquility. This could be a stream processor like Storm, but it could also be a standalone process running a simple loop.

  3. Load Kafka data into S3 or HDFS using something like Camus (https://github.com/linkedin/camus) or Secor (https://github.com/pinterest/secor), then index it in Druid with Map/Reduce.

#3 is the batch method. #1 and #2 are both realtime methods.

Thanks, some questions:

  1. Exactly once processing is only available with option 3, correct?

  2. Does option 2 require existence of real time nodes?

  3. Typically batch would be required to complement stream by idempotent override, so I am thinking starting with, but I love kafka too much…hard one:)

Thanks,

Nicu

Hey Nicolae,

Yeah, exactly once / transactional processing is currently only available with batch indexing. There’s some work underway to make it possible with Kafka, but that is not the case right now.

Option 2 (tranquility) does not need realtime nodes, but it does need an indexing service (overlord + middleManagers).