Indexing via Java API

Hello,

Where can I find an example of indexing data into Druid using its Java API?

Basically, my input data comes from Kafka. It is then transformed by a Java application and pushed into a second dedicated Kafka topic. This topic is consumed by a Druid realtime node. I would like to eliminate the need for the second Kafka topic and have the Java application insert into Druid directly. How do I do this? Alternatively, is there a way of plugging data transformations/filters into the realtime node setup?

Thank you,

/David

Where can I find an example of indexing data into Druid using its Java API?

I'm looking for that too.

I have similar requirements.

Hey David,

Tranquility (https://github.com/druid-io/tranquility) feels like it would be a good fit for your requirements. It works with the Druid indexing service instead of realtime nodes and is the recommended way to ingest data into Druid as it gives you better replication/scalability and allows things like schema rollover without downtime. There is a tranquility-kafka service (https://github.com/druid-io/tranquility/blob/master/docs/kafka.md) that handles ingesting data from Kafka using Tranquility, but it runs as a separate process and is intended to allow people to use Kafka without needing to write any code. In your case, where you want to reduce the number of processes in your ingestion pipeline, you’ll probably want to use Tranquility directly but you may be able to get some inspiration from the tranquility-kafka code.

Druid at this time doesn’t have any mechanism to do pre-ingestion transforms.

Hey folks,

Like David Lim mentioned, tranquility-kafka is a good way to read from a kafka topic directly into Druid, without transformation.

If you want to do transformations too, you can embed tranquility-core as a library into your own Java applications. Docs and sample code are here: https://github.com/druid-io/tranquility/blob/master/docs/core.md

Hello,

Thanks for the information, I’ll test embedding tranquility-core!

/David

Hello,

Is it feasible to use the low level API you mentioned inside a custom Hadoop M/R in order to do batch indexing into Druid? The motivation for doing so would be the same as above - the possibility to transform records before indexing.

Thanks,

/David

Hi David, Druid has a built in hadoop indexer so you can use Hadoop to load data into Druid. Perhaps I am not fully understanding your use case.

Hi Fangjin,

Thank you for your reply - the reason for wanting to use tranquility-core directly is twofold: a) I need to transform events before indexing b) our data is on GCS which until recently was not supported by Druid as an input data source if I understand correctly.

Hi David, that makes sense. Using tranquility to transmit to events to Druid as the last job of a stream processor is a common setup we see.