restoreOnRestart doesn't work

We’re using Druid 0.9.0 we have restoreOnRestore turn on but after we restart the middle manager I see that the tasks FAILED on the coordinator and it didn’t restore the tasks.

We use systemctl to manager the service.

druid.host=
druid.port=8080
druid.service=druid/middlemanager

Task Logging

druid.indexer.logs.type=file

MiddleManager Service

druid.indexer.runner.allowedPrefixes=[“com.metamx”,“druid”,“io.druid”,“user.timezone”,“file.encoding”]
druid.indexer.runner.compressZnodes=true
druid.indexer.runner.javaCommand=java
druid.indexer.runner.javaOpts=-server -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
druid.indexer.runner.maxZnodeBytes=524288
druid.indexer.runner.startPort=8100
druid.worker.ip=localhost
druid.worker.version=0

Peon Configs

druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]
druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/mnt/persistent/zk_druid”, “maxSize”: 0}]
druid.indexer.fork.property.druid.processing.numThreads=7
druid.indexer.fork.property.druid.server.http.numThreads=50
druid.indexer.fork.property.druid.storage.archiveBaseKey=ci-druid-archive
druid.indexer.fork.property.druid.storage.archiveBucket=cn-dev
druid.indexer.fork.property.druid.storage.baseKey=ci/druid
druid.indexer.fork.property.druid.storage.bucket=cn-dev
druid.indexer.fork.property.druid.storage.type=s3
druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart=true
druid.peon.mode=remote
druid.indexer.task.baseDir=/tmp
druid.indexer.task.baseTaskDir=/tmp/persistent/tasks
druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing
druid.indexer.task.defaultRowFlushBoundary=50000
druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.3.0”]
druid.indexer.task.chathandler.type=announce

Remote Peon Configs

druid.peon.taskActionClient.retry.minWait=PT1M
druid.peon.taskActionClient.retry.maxWait=PT10M
druid.peon.taskActionClient.retry.maxRetryCount=10

``

The task is realtime. It’s from tranquility.

Try setting druid.indexer.task.restoreTasksOnRestart=true instead of druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart=true.

However, looking at the log it looks like it’s showing that it did gracefully shutdown the task.

2016-07-07T14:40:30,502 INFO [sparrow-firehose-web-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing sparrow-firehose-web_2016-07-07T14:00:00.000Z_2016-07-07T15:00:00.000Z_2016-07-07T14:00:00.367Z
2016-07-07T14:40:30,502 INFO [sparrow-firehose-web-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing sparrow-firehose-web_2016-07-07T14:00:00.000Z_2016-07-07T15:00:00.000Z_2016-07-07T14:00:00.367Z, numReferences: 0
2016-07-07T14:40:30,502 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Gracefully stopping.
2016-07-07T14:40:30,502 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done!
2016-07-07T14:40:30,503 INFO [Thread-54] io.druid.indexing.overlord.ThreadPoolTaskRunner - Graceful shutdown of task[index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1] finished in 818ms with status[SUCCESS].
2016-07-07T14:40:30,506 INFO [Thread-54] LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-07-07T14:40:30.504Z”,“service”:“druid/middlemanager”,“host”:“10.91.39.204:8100”,“metric”:“task/interrupt/count”,“value”:1,“dataSource”:“sparrow-firehose-web”,“error”:“false”,“graceful”:“true”,“task”:“index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1”}]
2016-07-07T14:40:30,506 INFO [Thread-54] LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-07-07T14:40:30.506Z”,“service”:“druid/middlemanager”,“host”:“10.91.39.204:8100”,“metric”:“task/interrupt/elapsed”,“value”:819,“dataSource”:“sparrow-firehose-web”,“error”:“false”,“graceful”:“true”,“task”:“index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1”}]
2016-07-07T14:40:30,506 INFO [Thread-54] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@4c577186].
2016-07-07T14:40:30,508 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1”,
“status” : “SUCCESS”,
“duration” : 3024414
}

``

However, when I look at the coordinator it shows that the task FAILED.

Should I ignore what the coordinator says?

Hey Noppanit,

Did you set druid.indexer.task.restoreTasksOnRestart=true? Without that, the task executor will stop gracefully, but the middleManager watching it won’t be expecting that and will mark it failed anyway.

Gi Gian,

Yes I set that on the middle manager.

druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart=true

druid.indexer.task.restoreTasksOnRestart, not druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart.

druid.indexer.task.restoreTasksOnRestart, not druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart.

Gian

Gi Gian,

Yes I set that on the middle manager.

druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart=true

Hey Noppanit,

Did you set druid.indexer.task.restoreTasksOnRestart=true? Without that, the task executor will stop gracefully, but the middleManager watching it won’t be expecting that and will mark it failed anyway.

Gian

Should I ignore what the coordinator says?

However, looking at the log it looks like it’s showing that it did gracefully shutdown the task.

2016-07-07T14:40:30,502 INFO [sparrow-firehose-web-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing sparrow-firehose-web_2016-07-07T14:00:00.000Z_2016-07-07T15:00:00.000Z_2016-07-07T14:00:00.367Z
2016-07-07T14:40:30,502 INFO [sparrow-firehose-web-incremental-persist] io.druid.segment.ReferenceCountingSegment - Closing sparrow-firehose-web_2016-07-07T14:00:00.000Z_2016-07-07T15:00:00.000Z_2016-07-07T14:00:00.367Z, numReferences: 0
2016-07-07T14:40:30,502 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Gracefully stopping.
2016-07-07T14:40:30,502 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done!
2016-07-07T14:40:30,503 INFO [Thread-54] io.druid.indexing.overlord.ThreadPoolTaskRunner - Graceful shutdown of task[index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1] finished in 818ms with status[SUCCESS].
2016-07-07T14:40:30,506 INFO [Thread-54] LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-07-07T14:40:30.504Z”,“service”:“druid/middlemanager”,“host”:“10.91.39.204:8100”,“metric”:“task/interrupt/count”,“value”:1,“dataSource”:“sparrow-firehose-web”,“error”:“false”,“graceful”:“true”,“task”:“index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1”}]
2016-07-07T14:40:30,506 INFO [Thread-54] LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-07-07T14:40:30.506Z”,“service”:“druid/middlemanager”,“host”:“10.91.39.204:8100”,“metric”:“task/interrupt/elapsed”,“value”:819,“dataSource”:“sparrow-firehose-web”,“error”:“false”,“graceful”:“true”,“task”:“index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1”}]
2016-07-07T14:40:30,506 INFO [Thread-54] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@4c577186].
2016-07-07T14:40:30,508 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_realtime_sparrow-firehose-web_2016-07-07T14:00:00.000Z_0_1”,
“status” : “SUCCESS”,
“duration” : 3024414
}

``

However, when I look at the coordinator it shows that the task FAILED.

Try setting druid.indexer.task.restoreTasksOnRestart=true instead of druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart=true.

The task is realtime. It’s from tranquility.

We’re using Druid 0.9.0 we have restoreOnRestore turn on but after we restart the middle manager I see that the tasks FAILED on the coordinator and it didn’t restore the tasks.

We use systemctl to manager the service.

druid.host=
druid.port=8080
druid.service=druid/middlemanager

Task Logging

druid.indexer.logs.type=file

MiddleManager Service

druid.indexer.runner.allowedPrefixes=[“com.metamx”,“druid”,“io.druid”,“user.timezone”,“file.encoding”]
druid.indexer.runner.compressZnodes=true
druid.indexer.runner.javaCommand=java
druid.indexer.runner.javaOpts=-server -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
druid.indexer.runner.maxZnodeBytes=524288
druid.indexer.runner.startPort=8100
druid.worker.ip=localhost
druid.worker.version=0

Peon Configs

druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]
druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/mnt/persistent/zk_druid”, “maxSize”: 0}]
druid.indexer.fork.property.druid.processing.numThreads=7
druid.indexer.fork.property.druid.server.http.numThreads=50
druid.indexer.fork.property.druid.storage.archiveBaseKey=ci-druid-archive
druid.indexer.fork.property.druid.storage.archiveBucket=cn-dev
druid.indexer.fork.property.druid.storage.baseKey=ci/druid
druid.indexer.fork.property.druid.storage.bucket=cn-dev
druid.indexer.fork.property.druid.storage.type=s3
druid.indexer.fork.property.druid.indexer.task.restoreTasksOnRestart=true
druid.peon.mode=remote
druid.indexer.task.baseDir=/tmp
druid.indexer.task.baseTaskDir=/tmp/persistent/tasks
druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing
druid.indexer.task.defaultRowFlushBoundary=50000
druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.3.0”]
druid.indexer.task.chathandler.type=announce

Remote Peon Configs

druid.peon.taskActionClient.retry.minWait=PT1M
druid.peon.taskActionClient.retry.maxWait=PT10M
druid.peon.taskActionClient.retry.maxRetryCount=10

``

You received this message because you are subscribed to the Google Groups “Druid User” group.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/946a0ed9-5bd1-4f84-a3a3-c94227e8b343%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups “Druid User” group.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/348cfa2e-c7fd-4dca-8ef4-9b985cf2a190%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Thanks both Gian and David. I was confused a bit with the documentation. It works now by setting druid.indexer.task.restoreTasksOnRestart,