Druid 0.8.3, unable to get restorable indexing task to work

Hi folks,

I’m experimenting the restorable indexing tasks feature introduced in 0.8.3, but am not able to make it work so far.

In middleManager runtime.properties, I have:
druid.indexer.task.restoreTasksOnRestart=true

I kill middleManager by
kill $(ps aux | grep ‘middleManager’ | awk ‘{print $2}’)

Then restart,
nohup java $ARGS -classpath $CLASSPATH io.druid.cli.Main server $ROLE > $DRUID_LOG_PATH 2>&1&

In the log where the middleManager was restarted, I can see a JVM argument was passed -Ddruid.indexer.task.restoreTasksOnRestart=true, but clearly all the indexing task failed and were not restarted. I saw some error when middleManger was being shut down, but not sure is related at all.

Looked at: https://groups.google.com/forum/#!searchin/druid-user/druid.indexer.task.restoreTasksOnRestart/druid-user/eKKMLWvNwAc/-Pn2F7WDCAAJ

“Middle Managers can be updated one at a time in a rolling fashion when you set druid.indexer.task.restoreTasksOnRestart=true (it’s off by default)”, what does it mean that “update one at a time”? I kill all my middleManager nodes around the same time, will that be an issue?

Any help would be appreciated, this is one of the key features that we need to move Druid to production.

Thanks
Shuai

2016-03-03T19:56:37,602 INFO [forking-task-runner-4] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[143] for task: index_realtime_task_2016-03-03T19:00:00.000Z_2_0
2016-03-03T19:56:37,605 INFO [forking-task-runner-4] io.druid.indexing.overlord.ForkingTaskRunner - Exception caught during execution
java.io.FileNotFoundException: /local/druid/indexer/index_realtime_task_2016-03-03T19:00:00.000Z_2_0/log (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_72]
at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_72]
at java.io.FileInputStream.(FileInputStream.java:138) ~[?:1.8.0_72]
at com.google.common.io.Files$FileByteSource.openStream(Files.java:126) ~[guava-16.0.1.jar:?]
at com.google.common.io.Files$FileByteSource.openStream(Files.java:116) ~[guava-16.0.1.jar:?]
at com.google.common.io.ByteSource.copyTo(ByteSource.java:253) ~[guava-16.0.1.jar:?]
at com.google.common.io.Files.copy(Files.java:442) ~[guava-16.0.1.jar:?]
at io.druid.indexing.common.tasklogs.FileTaskLogs.pushTaskLog(FileTaskLogs.java:51) ~[druid-indexing-service-0.8.3-rc3.jar:0.8.3-rc3]
at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:370) [druid-indexing-service-0.8.3-rc3.jar:0.8.3-rc3]
at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:175) [druid-indexing-service-0.8.3-rc3.jar:0.8.3-rc3]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
2016-03-03T19:56:41,727 INFO [forking-task-runner-10] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[143] for task: index_realtime_task_2016-03-03T19:00:00.000Z_2_0
2016-03-03T19:56:41,730 INFO [forking-task-runner-10] io.druid.indexing.overlord.ForkingTaskRunner - Exception caught during execution
java.io.FileNotFoundException: /local/druid/indexer/index_realtime_task_2016-03-03T19:00:00.000Z_2_0/log (No such file or directory)
at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_72]
at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_72]
at java.io.FileInputStream.(FileInputStream.java:138) ~[?:1.8.0_72]
at com.google.common.io.Files$FileByteSource.openStream(Files.java:126) ~[guava-16.0.1.jar:?]
at com.google.common.io.Files$FileByteSource.openStream(Files.java:116) ~[guava-16.0.1.jar:?]
at com.google.common.io.ByteSource.copyTo(ByteSource.java:253) ~[guava-16.0.1.jar:?]
at com.google.common.io.Files.copy(Files.java:442) ~[guava-16.0.1.jar:?]
at io.druid.indexing.common.tasklogs.FileTaskLogs.pushTaskLog(FileTaskLogs.java:51) ~[druid-indexing-service-0.8.3-rc3.jar:0.8.3-rc3]
at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:370) [druid-indexing-service-0.8.3-rc3.jar:0.8.3-rc3]
at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:175) [druid-indexing-service-0.8.3-rc3.jar:0.8.3-rc3]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]

Adding a little more insight:

Looks like the tasks themselves realized the config for restart=true, so they gracefully shut down themselves (https://github.com/gianm/druid/blob/bad270b6c4a751df6941e02c817bbb44f44eb4c0/indexing-service/src/main/java/io/druid/indexing/overlord/ThreadPoolTaskRunner.java#L116), see log below

Task Log:

2016-03-04T03:13:45,921 INFO [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Gracefully stopping.
2016-03-04T03:13:45,921 INFO [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done!
2016-03-04T03:13:45,923 INFO [Thread-32] io.druid.indexing.overlord.ThreadPoolTaskRunner - Graceful shutdown of task[index_realtime_task_2016-03-04T03:00:00.000Z_1_0] finished in 1,192ms with status[SUCCESS].
2016-03-04T03:13:45,923 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@19bedeb8].
2016-03-04T03:13:45,928 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.announcement.Announcer.stop()] on object[io.druid.curator.announcement.Announcer@32f5ecc4].
2016-03-04T03:13:45,928 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_realtime_task_2016-03-04T03:00:00.000Z_1_0”,
“status” : “SUCCESS”,
“duration” : 750153
}
2016-03-04T03:13:45,928 INFO [Thread-32] io.druid.curator.announcement.Announcer - unannouncing [/druid/segments/{hostname}:8135/{hostname}:8135_realtime__default_tier_2016-03-04T03:01:16.877Z_97fe063fd77a4aa88aa1f93b37f995d50]

However, Overlord claims the worker wrote a FAILED status and therefore ask to clean up the task, that’s probably what’s caused the (No such file or directory) mentioned in my last post, as the task realtime node already cleaned the task files but overlord asked Middlemanager to do it again.

Can someone help out here what could be wrong, why Overlord think the task failed even the Peon mark task status to SUCCESS after the graceful shutdown?

Hey Shuai,

From those logs it almost seems like you have two middleManager processes running at the same time on the same machine, and they are clobbering each other. Could you double-check that your old middleManager processes are completely exited before you are starting new ones?

Thanks Gian, unfortunately, I did not get much luck, what I did:
kill pid(middleManager)
ps aux|grep middle (wait till all realtime noes disappeared including middle manager it self)
restart the middle mananger
Please
see the highlight txt, I feel like once the middle manager comes back, it writes all it’s task to a FAILED status. The peon wrote a exist status of [2], but also saying it’s finished successfully, does not seem
related.
Killing MiddleManager (Overlord log)
2016-03-04T05:06:58,269 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.
RemoteTaskRunner - Kaboom! Worker[{hostname}:8084] removed!
2016-03-04T05:06:58,269 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - [{hostname}:8084]: Found [index_realtime_task_2016-03-04T05:00:00.000Z_0_1] running
During MiddleManager restart (MiddleManager log)
**2016-03-04T05:06:55,673 INFO [forking-task-runner-0] io.druid.indexing.overlord.****ForkingTaskRunner - Process exited with status[2] for task: index_realtime_task_2016-03-**04T05:00:00.000Z_0_1
2016-03-04T05:06:55,677 INFO [forking-task-runner-0] io.druid.indexing.common.tasklogs.FileTaskLogs - Wrote task log to: log/index_realtime_task_2016-03-04T05:00:00.000Z_0_1.log
2016-03-04T05:06:55,442 INFO [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Gracefully stopping.
2016-03-04T05:06:55,442 INFO [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done!
2016-03-04T05:06:55,444 INFO [Thread-32] io.druid.indexing.overlord.ThreadPoolTaskRunner - Graceful shutdown of task[index_realtime_task_2016-03-04T05:00:00.000Z_0_1] finished in 1,512ms with status[SUCCESS].
2016-03-04T05:06:55,444 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@19bedeb8].
2016-03-04T05:06:55,448 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.announcement.Announcer.stop()] on object[io.druid.curator.announcement.Announcer@32f5ecc4].
2016-03-04T05:06:55,448 INFO [Thread-32] io.druid.curator.announcement.Announcer - unannouncing [/druid/segments/{hostname}:8100/{hostname}:8100_realtime__default_tier_2016-03-04T05:00:39.156Z_9d94750541bb44d59c289e0b588608d10]
2016-03-04T05:06:55,449 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_realtime_task_2016-03-04T05:00:00.000Z_0_1”,
“status” : “SUCCESS”,
“duration” : 377356
}
2016-03-04T05:06:55,449 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@5b5b59].
2016-03-04T05:06:55,451 INFO [Thread-32] io.druid.curator.CuratorModule - Stopping Curator
2016-03-04T05:06:55,454 INFO [Thread-32] org.apache.zookeeper.ZooKeeper - Session: 0x1533fc1f3d4005b closed
2016-03-04T05:06:55,454 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down
2016-03-04T05:06:55,454 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.http.client.NettyHttpClient.stop()] on object[com.metamx.http.client.NettyHttpClient@22ad1bae].
2016-03-04T05:06:55,468 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.metrics.MonitorScheduler.stop()] on object[com.metamx.metrics.MonitorScheduler@58a2d9f9].
2016-03-04T05:06:55,468 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[com.metamx.emitter.service.ServiceEmitter@32456db0].
2016-03-04T05:06:55,468 INFO [Thread-32] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@4e224df5].
Before restart MiddleManager (Overlord log)
**2016-03-04T05:07:48,181 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.**RemoteTaskRunner - Worker[{hostname}:8084] reportin’ for duty!
2016-03-04T05:07:48,181 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cancelling Worker[{home}:8084] scheduled task cleanup
2016-03-04T05:07:48,185 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{hostname}:8084] wrote RUNNING status for task: index_realtime_task_2016-03-04T05:00:00.000Z_0_1
2016-03-04T05:07:48,570 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{hostname}:8084] wrote FAILED status for task: index_realtime_task_2016-03-04T05:00:00.000Z_0_1
2016-03-04T05:07:48,570 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{hostname}:8084] completed task[index_realtime_task_2016-03-04T05:00:00.000Z_0_1] with status[FAILED]
2016-03-04T05:07:48,570 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_realtime_task_2016-03-04T05:00:00.000Z_0_1
2016-03-04T05:07:48,570 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_task_2016-03-04T05:00:00.000Z_0_1] on worker[{hostname}:8084]
2016-03-04T05:07:48,571 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_task_2016-03-04T05:00:00.000Z_0_1] from activeTasks
2016-03-04T05:07:48,571 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_task_2016-03-04T05:00:00.000Z_0_1] from TaskLock[index_realtime_task]
2016-03-04T05:07:48,573 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.MetadataTaskStorage - Deleting TaskLock with id[4839]: TaskLock{groupId=index_realtime_task, dataSource=task, interval=2016-03-04T05:00:00.000Z/2016-03-04T06:00:00.000Z, version=2016-03-04T05:00:38.240Z}
2016-03-04T05:07:48,587 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_task_2016-03-04T05:00:00.000Z_0_1 to status: TaskStatus{id=index_realtime_task_2016-03-04T05:00:00.000Z_0_1, status=FAILED, duration=0}
2016-03-04T05:07:48,601 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_task_2016-03-04T05:00:00.000Z_0_1, type=index_realtime, dataSource=task}
2016-03-04T05:07:48,601 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task FAILED: RealtimeIndexTask{id=index_realtime_task_2016-03-04T05:00:00.000Z_0_1, type=index_realtime, dataSource=task} (0 run duration)
2016-03-04T05:07:48,601 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_task_2016-03-04T05:00:00.000Z_0_1] went bye bye.

Another attempt to gather better insight, please check the attached log file for details, summary:

19:07
MM being shutdown
>
Overlord found worker removed, but all the indexing tasks remains (Found [index_realtime_task_2_60_2016-03-04T18:00:00.000Z_1_0] running)
>
Indexing task (realtime node) being shutdown: (Task completed with status: Success)

19:11
MM restarts (log looks totally normal, same as start from fresh, no previous tasks info at all)
>
Overlord
- found worker again (Worker[{middlemanager-hostname}:8084] reportin’ for duty!)
- Overlord cancelling worker (Cancelling Worker[{middlemanager-hostname}:8084] scheduled task cleanup)
- Worker[{middlemanager-hostname}:8084] wrote RUNNING status for task: index_realtime_task_1_2016-03-04T19:00:00.000Z_3_0
- Worker[{middlemanager-hostname}:8084] completed task[index_realtime_task_1_2016-03-04T19:00:00.000Z_2_0] with status[FAILED]

The newly restarted middle manager seems simply wrote “FAILED” status for all it’s old tasks. After the middle manager is gone, those tasks remaining in “Running”, but after a restart, they were written as FAILED, not sure what I’ve missed. Posting my configs to see if anything is off, I’ve tried put druid.indexer.task.restoreTasksOnRestart=true to both middleManager and overlord config, but neither works.

MM:
druid.host={worker-hostname}

druid.port=8084
druid.service=druid/prod/middlemanager

Resources for peons

#druid.indexer.runner.javaOpts=-server -Xmx25g -Xms1g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
druid.indexer.task.baseTaskDir=/local/var/druid/indexer
druid.indexer.task.chathandler.type=announce
druid.indexer.task.restoreTasksOnRestart=true

Peon properties

druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912
druid.indexer.fork.property.druid.processing.numThreads=1
druid.indexer.task.defaultRowFlushBoundary = 50000
druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/local/var/druid/segment-cache”, “maxSize”: 300000000000}]
druid.indexer.fork.property.druid.server.http.numThreads=30

druid.worker.capacity=12
druid.worker.ip={worker-hostname}

Overlord:

druid.host={overlord-hostname}

druid.port=8090
druid.service=druid/prod/overlord

Remote workers (middleManager + peons)

druid.indexer.runner.type=remote

Allows overlord restart or other overlord take leadership

druid.indexer.storage.type=metadata
druid.indexer.task.restoreTasksOnRestart=true

restartable_indexing_tasks.log (34 KB)

Discussed a bit with crallen@ over IRC, list open questions below:

  1. why MM wrote FAILED status to task status in ZK?
  2. MM shutdown log showing tasks exited with status [2] even though task itself logged a graceful shutdown, this is a bit concerning

I’ve tried upgrade to 0.9.0 just trying my luck but still is the same issue. The only diff is there’s no exception (Unable to register instance) in the Task log being closed, however, the MM log still shows status[2] for the index task. So I guess the below exception I saw in 0.8.3 task log is not related?

io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@4e224df5]
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_72]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_72]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_72]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_72]
        at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:337) [java-util-0.27.4.jar:?]
        at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:261) [java-util-0.27.4.jar:?]
        at io.druid.cli.CliPeon.run(CliPeon.java:248) [druid-services-0.8.3-rc3.jar:0.8.3-rc3]
        at io.druid.cli.Main.main(Main.java:99) [druid-services-0.8.3-rc3.jar:0.8.3-rc3]
Caused by: java.lang.IllegalStateException: Expected state [STARTED] found [STOPPED]
        at io.druid.common.config.Log4jShutdown.stop(Log4jShutdown.java:102) ~[druid-common-0.8.3-rc3.jar:0.8.3-rc3]
        at io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop(Log4jShutterDownerModule.java:109) ~[druid-server-0.8.3-rc3.jar:0.8.3-rc3]
        ... 8 more

Gian mentioned on IRC, the status[2] is OK, so the focus would be on the why MM wrote FAILED status while starting up

Just attached restorable_task_debug.log that has more detailed states before MM shutdown, after MM shutdown and after MM restarts.

restorable_task_debug.log (44.9 KB)

        [user@{worker-hostname}]/% ls {root}/var/druid/indexer
        ls: cannot access {root}/var/druid/indexer: No such file or directory

``

That seems odd

I believe this is due the the fact MM tells Overlord the task failed and overlord start to tell MM to clean tasks up

io.druid.indexing.overlord.ForkingTaskRunner - Removing task directory

``

That entry appears quite a bit, and shouldn’t. It seems the MM is not picking up that it is cleanly shutting down?

Yup, I think so, too.

MM somehow wrote FAILED status for the task once it starts reporting duty to Overlord, I guess that’s what triggered the clean up?

2016-03-07T20:55:58,277 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] reportin' for duty!
        2016-03-07T20:55:58,277 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cancelling Worker[{worker-hostname}:8084] scheduled task cleanup
        2016-03-07T20:55:58,280 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] wrote RUNNING status for task: index_realtime_task_2016-03-07T20:00:00.000Z_1_0
        2016-03-07T20:55:58,280 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] wrote RUNNING status for task: index_realtime_task_2016-03-07T20:00:00.000Z_2_0
        2016-03-07T20:55:58,280 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] wrote RUNNING status for task: index_realtime_task_2016-03-07T20:00:00.000Z_0_0
        2016-03-07T20:55:58,280 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] wrote RUNNING status for task: index_realtime_task_2016-03-07T20:00:00.000Z_5_0
        2016-03-07T20:55:58,280 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] wrote RUNNING status for task: index_realtime_task_2016-03-07T20:00:00.000Z_3_0
        2016-03-07T20:55:58,280 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] wrote RUNNING status for task: index_realtime_task_2016-03-07T20:00:00.000Z_4_0
        2016-03-07T20:55:58,301 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[{worker-hostname}:8084] wrote FAILED status for task: index_realtime_task_2016-03-07T20:00:00.000Z_1_0

Tried set -Djava.io.tmpdir and druid.indexer.task.baseTaskDir to be the same but does not seem help

After added a couple of log lines into https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/overlord/ForkingTaskRunner.java#L141

It is clear that this check if (restoreFile.exists()) always returns false, even restore.json is there. Couple of thoughts:

  1. Permission issue, middle manager cannot read the file, but the file has permission of 710, root/user/group all have read permission
  2. Symlink, I put symlink as taskBaseDir, File.exists() probably does not like that?
  3. Something deletes restore.json before MM starts up (very unlikely, please correct me if by any chance this can happen)

2016-03-08T10:45:55,806 INFO [main] io.druid.indexing.worker.WorkerTaskMonitor - my-debug: start()
2016-03-08T10:45:55,806 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - my-debug: FTR restore()
2016-03-08T10:45:55,806 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - my-debug: FTR getting restoreFile
2016-03-08T10:45:55,807 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - my-debug: FTR restoreFile.getName() = restore.json
2016-03-08T10:45:55,807 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - my-debug: FTR restoreFile.getAbsoluteFile().exists() = false
2016-03-08T10:45:55,807 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - my-debug: FTR restoreFile.getPath() = {taskBaseDir}/restore.json
2016-03-08T10:45:55,807 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - my-debug: FTR restoreFile.canRead() = false
2016-03-08T10:45:55,807 INFO [main] io.druid.indexing.overlord.ForkingTaskRunner - my-debug: FTR restoreFile does not exist, returning empty list
2016-03-08T10:45:55,807 INFO [main] io.druid.indexing.worker.WorkerTaskMonitor - my-debug: restored.size() 0

Finally get things working!!

Turned out it was our restart script that clears the indexer working dir, therefore the restore.json was gone. This seems to be a really stupid problem however the lack of logging made the investigation extremely difficult. I ended up with adding log to ForkingTaskRunner to identify what went wrong and some debug logic eventually helped fixing the problem.