Drui 0.8.3 (Stable), questions regarding restartable index task

This post is related to https://groups.google.com/forum/#!topic/druid-user/JzDjMR41gXg, but is at a higher level to clarify some basics how restartable indexing task. I get really stuck in the other post and trying to step back a bit and rule out some possibilities:

  1. Has anyone successfully made middle manager totally safe for restart by setting druid.indexer.task.restoreTasksOnRestart=true? If yes, would you mind share the your configs (MM, overlord even the _common)

  2. What’s the expected behavior for restorable tasks? Say you kill middle manager process, wait till all the realtime node (peons) process disappear, restart middle manager and the indexing task will be restored? Will the middle manager log reflect this restoring process? Currently the middle manager log showing nothing about trying to restore the task at all. Currently, as soon as I restart middle manager, Overlord log says the middle manager wrote a FAILED status and all tasks “went bye bye”

  3. If the restartable task works for you, would you mind share how you stop/start middle manager process? I first stop java process by “kill pid”, after all jvm processes are gone, run “java -cp classpath cli server middlemanager”, I wonder I’m not restarting the process properly

  4. I can see the restore.json and I can see all the task IDs listed in the file in the indexer dir, I think the WorkerTaskMonitor trys to restore here https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/worker/WorkerTaskMonitor.java#L105, but I cannot find any activities in overlord, middle manager or peon logs. Anywhere this activities will be logged?

  5. While shutting down middle manager, in the task (Peon) log, I always see ERROR [main] org.apache.curator.x.discovery.details.ServiceDiscoveryImpl - Could not unregister instance: task-05-0006-0001, this happens whenever Peon is going down, it does not seem to affect any functionality (indexing + handoff), but want to confirm if this error is related to restartability at all

I believe this is a KEY Druid feature operationwise for my system (probably same for yours, too) and I’m sure it should work as listed in the release notes. Digging into Druid code helps understanding the flow a bit, but still having hard time making any progress. Any input will be highly appreciated, others in the community might run into similar situations might benefit from the discuss

Thanks in advance