Kafka Indexing Extension

Hi all

We have a few questions regarding the new Kafka indexing extension, which we’d like to play with since it addresses some of the issues we have with Tranquility.

Does the extension require the 0.9.1 code base or is it compatible with 0.9.0 (which we are running now). Also, what level of stability should we expect from it in its current state? Ideally, we’d like to run it alongside our existing Tranquility pipeline and compare results.

Thanks for any tips. We’re also interested in anything we can do to help/speed development in this area.

Regards,

Max

Hey Max,

Yes the extension requires 0.9.1 as it uses some hooks into the Overlord that were not implemented in 0.9.0. There should be a 0.9.1 RC within the next week or so.

Stability wise, it’s been tested continuously over a two week period at low data volumes but hasn’t yet been tested at high ingestion rates; it’ll be tested in a more demanding environment over the next few weeks.

It would be awesome if your team is able to run it alongside Tranquility and help with validation / finding any issues that might be remaining.

Excellent. It sounds like we will wait for the 0.9.1 rc to land and then we’ll give it a try.

We are also looking forward to trying out the Kafka indexer. It would be great if one could register a callback function that can do a flatMap() operation to convert an input record fetched from the topic into possibly several input records for Druid. As far as I can tell from having looked into some classes related to batch-indexing, the parsers only had a conversion callback that takes a single record and returns a single record.

With the Kafka Indexer and its exactly-once guarantee, it seems to be even more important that it can do sophisticated conversions in-place. If we had to have a streaming-ETL app which consumes from a Kafka topic, converts the records into a format suitable for Druid ingestion and then writing those into another topic which the Kafka Indexer would read from, then this streaming-ETL app would also have to have an exactly-once guarantee.

So, is there a way to perform a flatMap-like record conversion within the kafka indexer task by specifying a custom conversion class within the ingestion spec?

thanks

Sascha

Hey Sascha,

This is a little tricky because if one Kafka message became multiple input rows for Druid, that would actually interfere with our current strategy for providing exactly once guarantees. A single Kafka message might need to be split across Druid segments, or we might only be able to insert one of the rows, and in that case there’s no correct offset to store in the ingestion metadata. It could probably be made to work but it isn’t something that would work today.

Personally I am hoping a “streaming ETL with an exactly-once guarantee” will become available and then you can just stick that in front of Druid :slight_smile: