We are trying to use kafka indexing service to ingest stream data in druid (using Druid v0.13). In the configuration file we set
“ioConfig”: {
“topic”: “###”,
“replicas”: 1,
“useEarliestOffset”: true,
“taskDuration”: “PT5M”,
“completionTimeout”: “PT10M”,
“consumerProperties”: {
“bootstrap.servers”: “****:9092”
}
}
For some reasons, overlord node failed, we need to restart it. I am just wondering how kafka indexing service will get resumed and pick up the offset correctly. Currently, I don’t see how we can find out how much data has been ingest from the topics.
Do you mean on Overlord console? It will show task failure, but I am wondering how it detect how much message has been ingest and where to pickup for the next task.
I don’t know if there are docs, but I learned it by reading the source. The operation that the indexing service does to publish new segments is the SegmentTransactionalInsertAction, which eventually translates to a SQL transaction which
Reads the data source’s record in druid_datasources and confirms that the offsets stored there are the same as the ones the task had when it started
Writes new offsets to druid_datasources
Inserts new segments into druid_segments
So essentially the exactly-once comes down to the ability to run a SQL transaction over cluster metadata.