Workaround after ingestion failure

Hi,

I was going through the following issue:

https://github.com/apache/druid/issues/7926

Can someone comment - if we run into such an issue, how to recover without having to deploy a new code ?

In my case, I am seeing the following error. There was a failure in zk cluster and middle manager service went down.

2019-12-30 00:09:52 WARN [KafkaSupervisor-flowlogs-Reporting-0] org.apache.druid.indexing.kafka.supervisor.KafkaSupervisor - Lag metric: Kafka partitions [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39] do not match task partitions

2019-12-30 00:09:57 INFO [KafkaSupervisor-flowlogs] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - [flowlogs] supervisor is running.

2019-12-30 00:09:57 INFO [KafkaSupervisor-flowlogs] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Creating new task group [0] for partitions [0, 32, 2, 34, 4, 36, 6, 38, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]

2019-12-30 00:09:57 ERROR [KafkaSupervisor-flowlogs] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - SeekableStreamSupervisor[flowlogs] failed to handle notice: {class=org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor, exceptionType=class org.apache.druid.java.util.common.IAE, exceptionMessage=Expected instance of org.apache.druid.indexing.seekablestream.SeekableStreamEndSequenceNumbers, got org.apache.druid.indexing.seekablestream.SeekableStreamStartSequenceNumbers, noticeClass=RunNotice}

org.apache.druid.java.util.common.IAE: Expected instance of org.apache.druid.indexing.seekablestream.SeekableStreamEndSequenceNumbers, got org.apache.druid.indexing.seekablestream.SeekableStreamStartSequenceNumbers

Thanks,

Dhiman

Hey Dhiman,

I’m not totally sure based on your logs (they don’t paint the full picture) but this might be related to https://github.com/apache/druid/pull/8305. At any rate, please try updating to the latest version of Druid, where this and other bugs related to start vs. end sequence numbers have been fixed.

Gian

Hi Gian,

The druid cluster is a production setup and any upgrade needs to go through a process.

There was a network change and as a result ZK cluster went down as well as other druid

services. Those services were brought up and ever since I am seeing the error messages

related to start and end sequence numbers. I have tried to restart different services but

ingestion task is failing. Is there no way to bring the ingestion tasks up ? Looking for

a temporary work around.

Thanks,

Dhiman

Hey Dhiman,

If you can’t upgrade, you might be able to fix this by manually editing the metadata in the metadata store to be of ‘end’ type rather than ‘start’ type. Doing a manual reset might help too, although this will cause your ingestion to lose its place in the Kafka stream and reset to earliest or latest (which may or may not be acceptable to you).