Indexing service: task not allotted to worker

Hi,
I’m trying to ingest some test data to understand how indexing service works.

Used this command to post a task to overlord:

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @examples/indexing/wikipedia_index_task.json localhost:8090/druid/indexer/v1/task

Overlord log shows this task is received and added to pending list. But its not reaching my worker process.

2015-10-28T22:37:57,468 INFO [qtp2016593579-93] io.druid.indexing.overlord.HeapMemoryTaskStorage - Inserting task index_wikipedia_2015-10-28T22:37:57.465Z with status: TaskStatus{id=index_wikipedia_2015-10-28T22:37:57.465Z, status=RUNNING, duration=-1}

2015-10-28T22:37:57,472 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-10-28T22:37:57.465Z]: LockTryAcquireAction{interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z}

2015-10-28T22:37:57,475 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_wikipedia_2015-10-28T22:37:57.465Z, dataSource=wikipedia, interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z, version=2015-10-28T22:37:57.474Z}, taskIds=}

2015-10-28T22:37:57,476 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Added task[index_wikipedia_2015-10-28T22:37:57.465Z] to TaskLock[index_wikipedia_2015-10-28T22:37:57.465Z]

2015-10-28T22:37:57,476 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_wikipedia_2015-10-28T22:37:57.465Z

2015-10-28T22:37:57,476 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_wikipedia_2015-10-28T22:37:57.465Z

2015-10-28T22:37:57,479 WARN [pool-12-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.

2015-10-28T22:38:52,543 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

2015-10-28T22:39:52,543 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

This is middle manager config:

cat config/middleManager/runtime.properties

#druid.host=xxxxxx3en0003.xxx.xxxxx.com

druid.port=8080

druid.service=druid/middlemanager

Store task logs in deep storage

druid.indexer.logs.type=file

druid.indexer.logs.directory=/var/log/druid/indexerlogs

Resources for peons

druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

druid.indexer.task.baseTaskDir=/druid01/peon/tasks

Peon properties

druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

druid.indexer.fork.property.druid.processing.numThreads=2

druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/druid01/zk_druid”, “maxSize”: 0},{“path”: “/druid02/zk_druid”, “maxSize”: 0},{“path”: “/druid03/zk_druid”, “maxSize”: 0}]

druid.indexer.fork.property.druid.server.http.numThreads=50

druid.indexer.fork.property.druid.storage.archiveBaseKey=prod

druid.indexer.fork.property.druid.storage.archiveBucket=

druid.indexer.fork.property.druid.storage.baseKey=prod/v1

druid.indexer.fork.property.druid.storage.bucket=druid

druid.indexer.fork.property.druid.storage.type=local

druid.worker.capacity=9

druid.worker.ip=#{IP_ADDR}

druid.worker.version=#{WORKER_VERSION}

Overlord config:

druid.host=xxxxxx3en0003.xxx.xxxxxx.com

druid.port=8090

druid.service=overlord

Run the overlord in local mode with a single peon to execute tasks

This is not recommended for production.

druid.indexer.queue.startDelay=PT0M

This setting is too small for real production workloads

druid.indexer.runner.javaOpts=-server -Xmx1g

These settings are also too small for real production workloads

Please see our recommended production settings in the docs (http://druid.io/docs/latest/Production-Cluster-Configuration.html)

druid.indexer.fork.property.druid.processing.numThreads=6

druid.indexer.fork.property.druid.computation.buffer.size=100000000

Uncomment following property if you are running Middle Manager

druid.indexer.runner.type=remote

When i started the worker process, overlord logs showed this:

2015-10-28T22:31:23,620 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[xxxxxx3en0003.xxx.xxxxxx.com:8080] reportin’ for duty!

Looks like worker identified itself to overlord correctly but somehow overlord is not able to hand over the task to worker.

Thanks for your help,

Nikhil.

Hi Nikhil, are there any other tasks running in the cluster or is it only the one? If the task is stuck in pending state, it should be waiting on a lock. Do you have additional logs from your overlord or perhaps a screenshot of the overlord console?

– FJ

No Fangjin, there is only one task submitted to run. Infact when i submitted same task again it showed its looking for lock
2015-10-29T01:32:52,573 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 2 tasks from storage (0 tasks added, 0 tasks removed).

2015-10-29T01:32:52,574 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-10-29T01:24:08.099Z]: LockTryAcquireAction{interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z}

2015-10-29T01:33:52,573 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 2 tasks from storage (0 tasks added, 0 tasks removed).

2015-10-29T01:33:52,574 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-10-29T01:24:08.099Z]: LockTryAcquireAction{interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z}

Although i want to run overlord eventually in remote mode but to try out i copied middle manager config into overlord and commented druid.indexer.runner.type=remote. Task made it to peon and threw this error:

2015-10-29T01:40:07,594 INFO [task-runner-0] io.druid.segment.realtime.firehose.LocalFirehoseFactory - Found files: [/usr/lib/druid/examples/indexing/wikipedia_data.json]

2015-10-29T01:40:07,656 ERROR [task-runner-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Uncaught Throwable while running task[IndexTask{id=index_wikipedia_2015-10-29T01:40:00.155Z, type=index, dataSource=wikipedia}]

java.lang.ExceptionInInitializerError

    at io.druid.indexing.common.task.IndexTask.convertTuningConfig(IndexTask.java:122) ~[druid-indexing-service-0.8.1.jar:0.8.1]

    at io.druid.indexing.common.task.IndexTask.generateSegment(IndexTask.java:344) ~[druid-indexing-service-0.8.1.jar:0.8.1]

    at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:205) ~[druid-indexing-service-0.8.1.jar:0.8.1]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:235) [druid-indexing-service-0.8.1.jar:0.8.1]

    at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:214) [druid-indexing-service-0.8.1.jar:0.8.1]

    at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_79]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_79]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_79]

    at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]

Caused by: java.lang.IllegalStateException: Failed to create directory within 10000 attempts (tried 1446082807605-0 to 1446082807605-9999)

    at com.google.common.io.Files.createTempDir(Files.java:600) ~[guava-16.0.1.jar:?]

    at io.druid.segment.indexing.RealtimeTuningConfig.<clinit>(RealtimeTuningConfig.java:43) ~[druid-server-0.8.1.jar:0.8.1]

    ... 9 more

2015-10-29T01:40:07,667 ERROR [main] io.druid.cli.CliPeon - Error when starting up. Failing.

java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError

    at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

    at io.druid.indexing.worker.executor.ExecutorLifecycle.join(ExecutorLifecycle.java:162) ~[druid-indexing-service-0.8.1.jar:0.8.1]

    at io.druid.cli.CliPeon.run(CliPeon.java:225) [druid-services-0.8.1.jar:0.8.1]

    at io.druid.cli.Main.main(Main.java:91) [druid-services-0.8.1.jar:0.8.1]

overlord command:

java -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -Xms10g -Xmx10g -XX:MaxNewSize=1g -XX:NewSize=1g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSCompactAtFullCollection -Xloggc:/var/log/druid/druid-overlord-gc.log.201510281838 -XX:ErrorFile=/var/log/druid/druid-overlord-hs_err_pid.log -Dlog4j.configuration=/usr/lib/druid/config/_common/log4j.properties -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/hadoop/druid/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Ddruid.extensions.localRepository=./extensions-repo -classpath /usr/lib/druid/config/_common:/usr/lib/druid/config/overlord:/usr/lib/druid/lib/* io.druid.cli.Main server overlord

thanks,

nikhil.

Hi,

For the “Caused by: java.lang.IllegalStateException: Failed to create directory within 10000 attempts (tried 1446082807605-0 to 1446082807605-9999)”

does “/hadoop/druid/tmp” exist on your disk and writable by the user starting the druid process?

For the original problem, as FJ said, screenshot of overlord console would be useful.

– Himanshu

Hi Nikhil,

I think Himanshu is referring to this parameter in the overlord startup command:

-Djava.io.tmpdir=/hadoop/druid/tmp

The exception you see here:

Caused by: java.lang.IllegalStateException: Failed to create directory within 10000 attempts (tried 1446082807605-0 to 1446082807605-9999)

at io.druid.segment.indexing.RealtimeTuningConfig.(RealtimeTuningConfig.java:43) ~[druid-server-0.8.1.jar:0.8.1]

is generated from a function that tries to create the directory specified by java.io.tmpdir

  • Jon

Thanks Jonathan Wei.

Yes, I mean for tmp dir specified in the command.

– Himanshu

Thanks Guys.

Yes i was able to change the -Djava.io.tmpdir param to point to a different directory and got it working. Thats the reason i removed my response from this thread.
So overlord is working fine in local mode now. I’m going to switch back to test it in remote mode soon.

For remote mode I’m going to use NFS mounts as a deep storage as of now. I have a folder on overlord machine which i’ve exported to historical, coordinator nodes in cluster.

Please let me know its ok to proceed with NFS ( i plan to switch to HDFS in some time).

Hi FJ,
To make sure that no tasks are running already, i restarted overlord, middle managers, historical and coordinator.

When i started middle managers i could see this message in overlord log

2015-11-05T01:05:52,571 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[xxxxxen0003.xxx.xxxxxx.com:8080] reportin’ for duty!

2015-11-05T01:05:52,571 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cancelling Worker[xxxxxx3en0003.xxx.xxxxxx.com:8080] scheduled task cleanup

2015-11-05T01:06:21,121 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 0 tasks from storage (0 tasks added, 0 tasks removed).

2015-11-05T01:06:58,734 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[xxxxx3en0004.xxx.xxxxxx.com:8080] reportin’ for duty!

2015-11-05T01:06:58,734 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cancelling Worker[xxxxx3en0004.xxx.xxxxxx.com:8080] scheduled task cleanup

Not sure what does ‘Cancelling Worker’ message indicate.

Again i tried submitting the wikipedia index task to overlord vi a POST request. Got same log messages as i got few days before and did not see anything in worker logs…

2015-11-05T01:31:23,007 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-11-05T01:31:23.000Z]: LockTryAcquireAction{interval=2015-08-31T00:00:00.000Z/2015-09-01T00:00:00.000Z}

2015-11-05T01:31:23,010 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_wikipedia_2015-11-05T01:31:23.000Z, dataSource=wikipedia, interval=2015-08-31T00:00:00.000Z/2015-09-01T00:00:00.000Z, version=2015-11-05T01:31:23.008Z}, taskIds=}

2015-11-05T01:31:23,010 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Added task[index_wikipedia_2015-11-05T01:31:23.000Z] to TaskLock[index_wikipedia_2015-11-05T01:31:23.000Z]

2015-11-05T01:31:23,011 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_wikipedia_2015-11-05T01:31:23.000Z

2015-11-05T01:31:23,011 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_wikipedia_2015-11-05T01:31:23.000Z

2015-11-05T01:31:23,013 WARN [pool-12-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.

2015-11-05T01:32:21,126 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

2015-11-05T01:33:21,126 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

2015-11-05T01:34:21,126 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

Coordinator console is not accessible to me as of now. Is there a REST API which i can use to get list of all tasks?

I’m running overlord in remote mode on port 8090 and middle managers on 8080.

Thanks for your help,

Nikhil.

Hi,

If you use NFS as deep storage, it will either be very difficult to use hadoop indexing or not possible depending upon how your hadoop cluster is setup. So, you would have to use Index task which had to process all input data on single node and you will be limited in terms of scaling your batch ingestion.

If you don’t have batch ingestion, then it should be fine I guess.

– Himanshu

Thanks Himanshu.
Could you please give me some pointers about the ‘overlord not delegating task to workers’ problem too (please check my last comment on this thread)?

Do i need to have middle manager and overlord on separate machines? Right now i have one manager on same machine as overlord and another manager on separate machine.

Thanks,

Nikhil.

Hi,

Technically, It is OK to run overlord and middle manager processes on same machine. Overlord is druid’s poor task scheduling/management framework and middle manager act as the workers. You can see “http://overlord_host:overlord_port/console.html” to see what tasks are running and which middle managers are in the system (recognized by the overlord). There is REST api for everything that overlord console has (it works by making those calls), see http://druid.io/docs/0.8.1/design/indexing-service.html#overlord-node .
I would look at the overlord console to see if the all the middle managers are recognized (may be not if there was a config issue), whether task runs at all and what the task log says.

– Himanshu

Yes you are right. I enabled ‘debug’ mode logging for overlord process and found out that there is some configuration issue with workers.
Fixed that and got it to work fine.

Thanks for help.

Nikhil.