KIS task failing with no meaningful reason

Hi Team,

We have lots of ingestion tasks (Kafka Indexing Service) failed with the bellow message - any hints, please?

2017-09-28T15:18:30,517 ERROR [task-runner-0-priority-0] io.druid.indexing.kafka.KafkaIndexTask - Resetting Kafka offsets for datasource [XXXX]: {class=io.druid.indexing.kafka.KafkaIndexTask, partitions=[1]}



The job starts from "partitionOffsetMap" :         "1" : 252317290 and we after 10 mins of normal processing we also have a warn message:
2017-09-28T15:18:30,496 WARN [task-runner-0-priority-0] io.druid.indexing.kafka.KafkaIndexTask - OffsetOutOfRangeException with message [Offsets out of range with no configured reset policy for partitions: {XXXX-1=252337201}]
2017-09-28T15:18:30,502 INFO [task-runner-0-priority-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[XXXX_0a10ea4d91031aa_ddbohkdj]: ResetDataSourceMetadataAction{dataSource='XXXXX', resetMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic='XXXX', partitionOffsetMap={1=252337201}}}}
2017-09-28T15:18:30,506 INFO [task-runner-0-priority-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[XXXX_0a10ea4d91031aa_ddbohkdj] to overlord[http://YYYY:8090/druid/indexer/v1/action]: ResetDataSourceMetadataAction{dataSource='XXXX', resetMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic='XXXXX', partitionOffsetMap={1=252337201}}}}

Thanks,
Dan

Hi,

You can try to reset the supervisor for this datasource, see if the error goes away.

Hong

Thanks Hong for your response.

I tried to reset multiple times without luck - also, playing around with some parameters (resetOffsetAutomatically, auto.offset.reset, useEarliestOffset) didn’t help

it worth mentioning strange behavior in our DB (Aurora): some of the data source are missing, for others not all Kafka topic partition have an offset in the payload.

Didn’t find the zookeeper node where the Kafka offset per partition is stored

Any other idea, please?

We recently had a similar issue, with exactly the same error, it turned out it’s a setting on kafka, although our data retention in kafka is set to 24hr, there is another parameter log.retention.bytes, which is set to 10MB only, so kafka message actually got purged after 5-6 min, due to this limit. I would suggest you check the earliest kafka offset and the latest offset, keep on inserting message and test out how long does it actually take your latest offset message become the earliest offset and get purged out of kafka.

Hong