We are using Druid 0.16.1-incubating.
Every time the tasks of a data source roles over overlord status changes to LOST_CONTACT_WITH_STREAM.
-
Sometimes it recovers itself by resetting all the Kafka partitions( we have 100 partitions). Takes anywhere from 10 to 30 minutes
-
Sometimes we have to manually reset the data source to get the overlord recovered
-
Meanwhile, all the publishing tasks start failing. This failing tasks some time shows success in logs but failed in the status field of overload UI. Please check the attachment file (
failed-task-log
) for logsEven after a full recovery, this happens again for in next rollover as well( rollover period is 1 hour)
overlord logs:{id='index_kafka_argos-kafka_24ba0fc4bed083b_jjknmfki', startTime=2020-04-17T04:44:39.249Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_f01ceb0e53da389_iebbfnfj', startTime=2020-04-17T04:44:39.263Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_ba198aa83bccd99_pfkacocn', startTime=2020-04-17T04:44:39.848Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_86542a37b5a9175_gcbpghdb', startTime=2020-04-17T04:44:39.637Z, remainingSeconds=1373}, {id='index_kafka_argos-kafka_1fc0f5304049afd_gehbjmpf', startTime=2020-04-17T04:44:39.567Z, remainingSeconds=1373}], suspended=false, healthy=false, state=UNHEALTHY_SUPERVISOR, detailedState=LOST_CONTACT_WITH_STREAM, recentErrors=[org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisorStateManager$SeekableStreamExceptionEvent@6b458840, org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisorStateManager$SeekableStreamExceptionEvent@30fb4880, org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisorStateManager$SeekableStreamExceptionEvent@2e526108]}}
Below are my cluster configs:
Common config(Shared among all the components):
druid.host=${AWS_LOCAL_IPV4}
druid.discovery.curator.path=${ZK_DISCOVERY}
druid.zk.service.host=${ZK_HOST}
druid.zk.service.sessionTimeoutMs=180000
druid.zk.paths.base=${ZK_PATH}
druid.startup.logging.logProperties=true
druid.enablePlaintextPort=true
druid.emitter.statsd.dimensionMapPath=conf/druid/_common/metrics.json
druid.indexing.doubleStorage=double
druid.server.hiddenProperties=[“druid.s3.accessKey”,“druid.s3.secretKey”,“druid.metadata.storage.connector.password”]
overlord config:
druid.service=druid/overlord
druid.indexer.runner.type=remote
druid.indexer.storage.type=metadata
druid.indexer.queue.startDelay=PT5S
druid.indexer.logs.kill.enabled=true
druid.indexer.logs.kill.durationToRetain=259200000
coordinator config:
druid.service=druid/coordinator
druid.coordinator.startDelay=PT10S
druid.coordinator.period=PT5S
druid.coordinator.balancer.strategy=diskNormalized
druid.coordinator.kill.on=true
druid.coordinator.kill.period=P1D
druid.coordinator.kill.durationToRetain=P1D
druid.coordinator.kill.maxSegments=10
manager config:
druid.service=druid/middlemanager
druid.cache.type=caffeine
druid.cache.expireAfter=3600000
druid.cache.sizeInBytes=805306368
druid.realtime.cache.useCache=true
druid.realtime.cache.populateCache=true
druid.worker.capacity=2
druid.worker.ip=10.152.9.104
druid.indexer.runner.javaOpts=-server -Xmx7680m -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:MaxDirectMemorySize=7680m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/mnt/dump/peon-gc-%t.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=3 -XX:GCLogFileSize=10M
druid.indexer.task.baseTaskDir=/mnt/persistent/task/
druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]
druid.indexer.fork.property.druid.server.http.numThreads=2
druid.indexer.fork.property.druid.processing.numThreads=7
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912
druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/mnt/druid/segments”, “maxSize”: 0}]
druid.indexer.fork.property.druid.storage.type=s3
druid.indexer.fork.property.druid.storage.bucket=argos-druid-ecomprod
druid.indexer.fork.property.druid.storage.baseKey=druid/ecomprod/master/storage
druid.indexer.fork.property.druid.storage.archiveBucket=argos-druid-ecomprod
druid.indexer.fork.property.druid.storage.archiveBaseKey=druid/ecomprod/master/archive
historical config:
druid.service=druid/historical
druid.cache.type=caffeine
druid.cache.expireAfter=10800000
druid.cache.sizeInBytes=4637851648
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.historical.cache.unCacheable=
druid.processing.numThreads=15
druid.processing.numMergeBuffers=4
druid.processing.buffer.sizeBytes=1073741824
druid.server.maxSize=4164084039680
druid.server.http.queueSize=288
druid.server.http.numThreads=96
druid.server.http.defaultQueryTimeout=60000
druid.server.http.maxQueryTimeout=120000
druid.segmentCache.locations=[{“path”:"/mnt/druid/segments",“maxSize”:2082042019840},{“path”:"/mnt2/druid/segments",“maxSize”:2082042019840}]
broker config:
druid.service=druid/broker
druid.cache.type=caffeine
druid.cache.expireAfter=10800000
druid.cache.sizeInBytes=2251292672
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false
druid.broker.cache.unCacheable=
druid.broker.http.numConnections=16
druid.broker.http.readTimeout=PT5M
druid.processing.numThreads=7
druid.processing.numMergeBuffers=2
druid.processing.buffer.sizeBytes=1073741824
druid.server.http.queueSize=144
druid.server.http.numThreads=48
druid.server.http.defaultQueryTimeout=30000
druid.server.http.maxQueryTimeout=60000
druid.sql.enable=true
druid.sql.avatica.enable=true
failed-task-log (636 KB)