Overlord turns to LOST_CONTACT_WITH_STREAM every time the task roleover happens and the publishing task fails

We are using Druid 0.16.1-incubating.

Every time the tasks of a data source roles over overlord status changes to LOST_CONTACT_WITH_STREAM.

  • Sometimes it recovers itself by resetting all the Kafka partitions( we have 100 partitions). Takes anywhere from 10 to 30 minutes

  • Sometimes we have to manually reset the data source to get the overlord recovered

  • Meanwhile, all the publishing tasks start failing. This failing tasks some time shows success in logs but failed in the status field of overload UI. Please check the attachment file (failed-task-log) for logs

    Even after a full recovery, this happens again for in next rollover as well( rollover period is 1 hour)
    overlord logs:

    {id='index_kafka_argos-kafka_24ba0fc4bed083b_jjknmfki', startTime=2020-04-17T04:44:39.249Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_f01ceb0e53da389_iebbfnfj', startTime=2020-04-17T04:44:39.263Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_ba198aa83bccd99_pfkacocn', startTime=2020-04-17T04:44:39.848Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_86542a37b5a9175_gcbpghdb', startTime=2020-04-17T04:44:39.637Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_1fc0f5304049afd_gehbjmpf', startTime=2020-04-17T04:44:39.567Z, remainingSeconds=1373}], suspended=false, healthy=false, state=UNHEALTHY_SUPERVISOR, detailedState=LOST_CONTACT_WITH_STREAM, recentErrors=[org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisorStateManager$SeekableStreamExceptionEvent@6b458840, org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisorStateManager$SeekableStreamExceptionEvent@30fb4880, org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisorStateManager$SeekableStreamExceptionEvent@2e526108]}}
    

Below are my cluster configs:
Common config(Shared among all the components):

druid.host=${AWS_LOCAL_IPV4}

druid.discovery.curator.path=${ZK_DISCOVERY}

druid.zk.service.host=${ZK_HOST}

druid.zk.service.sessionTimeoutMs=180000

druid.zk.paths.base=${ZK_PATH}

druid.startup.logging.logProperties=true

druid.enablePlaintextPort=true

druid.emitter.statsd.dimensionMapPath=conf/druid/_common/metrics.json

druid.indexing.doubleStorage=double

druid.server.hiddenProperties=[“druid.s3.accessKey”,“druid.s3.secretKey”,“druid.metadata.storage.connector.password”]

overlord config:

druid.service=druid/overlord

druid.indexer.runner.type=remote

druid.indexer.storage.type=metadata

druid.indexer.queue.startDelay=PT5S

druid.indexer.logs.kill.enabled=true

druid.indexer.logs.kill.durationToRetain=259200000
coordinator config:

druid.service=druid/coordinator

druid.coordinator.startDelay=PT10S

druid.coordinator.period=PT5S

druid.coordinator.balancer.strategy=diskNormalized

druid.coordinator.kill.on=true

druid.coordinator.kill.period=P1D

druid.coordinator.kill.durationToRetain=P1D

druid.coordinator.kill.maxSegments=10

manager config:

druid.service=druid/middlemanager

druid.cache.type=caffeine

druid.cache.expireAfter=3600000

druid.cache.sizeInBytes=805306368

druid.realtime.cache.useCache=true

druid.realtime.cache.populateCache=true

druid.worker.capacity=2

druid.worker.ip=10.152.9.104

druid.indexer.runner.javaOpts=-server -Xmx7680m -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:MaxDirectMemorySize=7680m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/mnt/dump/peon-gc-%t.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=3 -XX:GCLogFileSize=10M

druid.indexer.task.baseTaskDir=/mnt/persistent/task/

druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]

druid.indexer.fork.property.druid.server.http.numThreads=2

druid.indexer.fork.property.druid.processing.numThreads=7

druid.indexer.fork.property.druid.processing.numMergeBuffers=2

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/mnt/druid/segments”, “maxSize”: 0}]

druid.indexer.fork.property.druid.storage.type=s3

druid.indexer.fork.property.druid.storage.bucket=argos-druid-ecomprod

druid.indexer.fork.property.druid.storage.baseKey=druid/ecomprod/master/storage

druid.indexer.fork.property.druid.storage.archiveBucket=argos-druid-ecomprod

druid.indexer.fork.property.druid.storage.archiveBaseKey=druid/ecomprod/master/archive

historical config:

druid.service=druid/historical

druid.cache.type=caffeine

druid.cache.expireAfter=10800000

druid.cache.sizeInBytes=4637851648

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.historical.cache.unCacheable=

druid.processing.numThreads=15

druid.processing.numMergeBuffers=4

druid.processing.buffer.sizeBytes=1073741824

druid.server.maxSize=4164084039680

druid.server.http.queueSize=288

druid.server.http.numThreads=96

druid.server.http.defaultQueryTimeout=60000

druid.server.http.maxQueryTimeout=120000

druid.segmentCache.locations=[{“path”:"/mnt/druid/segments",“maxSize”:2082042019840},{“path”:"/mnt2/druid/segments",“maxSize”:2082042019840}]

broker config:

druid.service=druid/broker

druid.cache.type=caffeine

druid.cache.expireAfter=10800000

druid.cache.sizeInBytes=2251292672

druid.broker.cache.useCache=false

druid.broker.cache.populateCache=false

druid.broker.cache.unCacheable=

druid.broker.http.numConnections=16

druid.broker.http.readTimeout=PT5M

druid.processing.numThreads=7

druid.processing.numMergeBuffers=2

druid.processing.buffer.sizeBytes=1073741824

druid.server.http.queueSize=144

druid.server.http.numThreads=48

druid.server.http.defaultQueryTimeout=30000

druid.server.http.maxQueryTimeout=60000

druid.sql.enable=true

druid.sql.avatica.enable=true

failed-task-log (636 KB)

Hello all, I’m ingesting data from a kafka topic which is partitioned on customer id on 150 partitions. Therefore I see many segments on druid side although my granularity is 1 hour. Is it possible to repartition(and merge the partitions) during compaction time?

Hi Murat,

I have responded in the Druid slack channel. We can communicate further over there.

Thanks,

Hemanth

What’s the link for Druid slack channel. How can I be added there ?

Thanks

Hey Aditya,

Please check https://druid.apache.org/community/

I am facing this issue can someone please help me with the solution?