100% Kafka

I was wondering what would be required to run Druid completely using Kafka. If I use S3 for long term storage I no longer need HDFS and I’m already using Kafka for realtime indexing so I’ve got this entire Hadoop cluster just to reprocess historical data. Is there a way to do what was previously done for bulk processing with Hadoop with Kafka. I was hoping that something like KSQL might make that a bit easier. What I’d like to be able to do is setup Druid as Druid+S3+Kafka.

Hey Zachary!

Many Druid users that are focussed on only using real-time will use a data workflow with Kafka > Enrich the data with something like KSQL, Flink or Spark Streaming > Ingest back into Kafka and then into Druid. This is a very common method of bringing enriched data into Druid.

Thanks for the response. It seems like it’s 80% of the way there but what about something like doing compactions or reindexing? Wouldn’t those operations still require Hadoop? Would someone going with a pure Kafka solution have to forgo those operations or is there a way they can be done in Kafka?

No need for Hadoop as the work is being done by MiddleMangaer.

Rommel Garcia

Interesting. I don’t think I had seen parallel native indexing tasks last time I had looked. Maybe it’s new or I had overlooked it before.


Now I see how you can have a complete Kafka solution.

The Kafka Indexing Service provides the parallelism to ingest from Kafka through MM.

Rommel Garcia

Hi Zachary,

If you’re using Kafka Indexing Service, you’re on a version of Druid new enough that any kind of reprocessing tasks can be run directly on the druid Middle Managers. We are using the system like this, Kafka Indexing Service for ingesting live data, and if we ever need to reprocess old data segments to change something, we submit native tasks for them. I don’t see any need for Hadoop unless maybe your data is completely humongous, but we have a good amount of data ourselves.

Last time I was running Druid was with 0.12.0 with the Kafka indexing service but it looks like native parallel indexing was added in 0.13.0 so at the time I still needed Hadoop. I’m looking forward to not having to have that in my next setup. I’m hoping that it will be easier to get a small install going. I’m amazed at how often I see people cobbling together solutions were I end up saying, “ya know, there’s this thing called Druid.”. It would also be nice to be able to just throw it on the ol’ kubernetes cluster with Minio for S3 and go even though Kafka on Kubernetes is still a bit of a pain.