Recently we found that there was some issue with our coordinator which didn’t get discovered to be part of the cluster and eventually we saw broker issues.
We also found supervisor tasks were failed due to unavailable coordinator node.
We found the issue “Cannot allocate enough memory”, thus update the process with the required Xms Xmx.
We restarted both coordinator and broker, after which we could see the data sources listed on the console.
Now, we see that the supervisor which resumed back with tasks, could succeed, but it took the latestoffset value from kafka topic.
Instead, it should have picked the offset wherein it left in the failed state.
We see that there was 1 week data loss and kafka indexing service didnt pick up offset where it had succeeded earlier.
Can you help here?
Your best bet might be to ingest from the earliest offset available.
See “auto.offset.reset”: https://kafka.apache.org/documentation/#consumerconfigs
Not sure if this will work, because the consumer will only use this config if it can’t find the current offset. Might be worth a shot, though.
Also is it possible you set a consumer group that is shared with others consumers that might have consumed the topic whereas your supervisor was off ?
I have the same problem. After a failed task, every messages read from Kafka by failed task are lost. I didn’t try with earliest auto offset reset, but reading kakfa documentation it seems that consumers begin to read from the oldest message found in topic, that is it re-read all the messages, also messages already loaded in Druid.
Instead it should begin from latest successful offset loaded in Druid, not from latest offset read. Is it possible ?
With the introduction of supervisor, zookeeper no longer keeps track of consumer offsets.
When tried to checking the consumer group for druid-, there was no value at the zookeeper end.
As per the latest supervisor datasource status in druid, i see it matches the latest offset with kafka topic offset.
I tried setting up the supervisor datasource to get earliest offset, by updating its spec with useEarliestoffset:true option.
My understanding is that, druid will avoid duplication of data due to the timestamp value in the message.
I did see the immediate request showed up the earliest offset considered in the supervisor/datasource status page, but concurrent requests didnt showed any increments.
I am not sure,if it could read all the missed messages between the offsets, in one index task?
Also, coordinator console for this datasource didnt show up any latest shards/segments loaded after 15th May, this gets me into question if the task passed, why they arent updated ?
How to do the backfill for the missed kafka messages?
Thus, i am still not sure what are the steps to get the supervisor work better and get the ingestion into historicals for querying.