Why Kafka indexing service does not save consumer offsets in Kafka?

I recently started ingesting from Kafka using the Kafka Indexing Service.
I noticed that the service does not commit consumer offsets in Kafka.

Instead, the service manages the offset in its metastore.

I would like to know the reason of the design decision as to why it didn’t use the Kafka’s offset management feature.



The rationale is explained pretty well in the docs for Kafka itself: https://kafka.apache.org/0101/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html, see “Storing offsets outside Kafka”:

“The consumer application need not use Kafka’s built-in offset storage, it can store offsets in a store of its own choosing. The primary use case for this is allowing the application to store both the offset and the results of the consumption in the same system in a way that both the results and offsets are stored atomically. This is not always possible, but when it is it will make the consumption fully atomic and give ‘exactly once’ semantics that are stronger than the default ‘at-least once’ semantics you get with Kafka’s offset commit functionality.”

This other bullet point from the docs describes something very similar to what Druid does:

“If the results are being stored in a local store it may be possible to store the offset there as well. For example a search index could be built by subscribing to a particular partition and storing both the offset and the indexed data together. If this is done in a way that is atomic, it is often possible to have it be the case that even if a crash occurs that causes unsync’d data to be lost, whatever is left has the corresponding offset stored as well. This means that in this case the indexing process that comes back having lost recent updates just resumes indexing from what it has ensuring that no updates are lost.”

Hi Gian,

Thanks for the quick reply in 8min! Yah, I’m aware of the doc, and thought Druid needed precise control.

I’m going to develop the offset monitoring, and thought I could use of one of the readily available Kafka monitoring tools.

As an alternative idea, it would be great if the supervisor would emit the offset at every commit via the Druid’s emitter.

I know the /status endpoint can expose the info, but I prefer it to push the metrics out than polling.



That sounds like a nice feature to add, a patch would be welcome :slight_smile: