Druid Batch Indexer on Dataproc

Hi folks,

I’m trying to use dataproc to index my data. When I submit a job to the overlord, the overlord will say something like “io.druid.indexing.overlord.TaskLockbox - Adding task[…] to activeTasks”, but I do not see the job running in the dataproc console and the task remains in active state 1 forever.

I followed most of the tips found in this thread, such as copying the XML files from the datarpoc master to the conf/druid/_common directory.


How can I debug this?


Only jobs submitted through the Dataproc API are visible in the Dataproc console. Since Druids submits the job directly to Hadoop you will never see it in the Dataproc console.

Hm. That makes sense. Is there a way to see a command is submitted to hadoop from the overlord?

I currently only see this line in stdout
2017-08-14T09:20:47,107 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

The task remains in the metadatabase and doesn’t change, so it’s hard to know what the overlord is doing that’s making the task stuck.