Peons not finishing tasks (All tasks running)

I am running a Druid cluster on EC2 with following configuration.

2 Overlord nodes on m3.large

3 Middlemanagers on m3.2xlarge

2 Historical nodes on m3.2xlarge

zk cluster on t2.micro

2 coordinator nodes on m3.large

We are inserting data through tranquility and recently moved to remote setting by setting druid.indexer.runner.type=remote. I see java peon processes picking up tasks but they are not finishing for hours. Here are some observations.

  • I have pretty beefy Historical nodes and they don’t seem stressed out. I learnt Historical node resources could be an issue (https://github.com/druid-io/tranquility/blob/master/docs/trouble.md#my-tasks-are-never-exiting)
  • I still see tasks logs being updated of a task which started 11 hours before, so peons are still working on tasks.
  • When I was running it with local setting segments were properly created and indexed in S3 and metadata.
    Here is my overlord config.

druid.host=ip-xxx-xx-xx-xxx.us-west-2.compute.internal

druid.port=8080

druid.service=druid/overlord

INDEXING SERVICE SETTINGS

Run in remote mode (distributes tasks among middleManagers)

druid.indexer.runner.type=remote

#druid.indexer.runner.minWorkerVersion=#{WORKER_VERSION}

Store all task state in the metadata storage (local)

druid.indexer.storage.type=metadata

Here is my middlemanager config

druid.host=ip-xxx-xxx-xxx-xxx.us-west-2.compute.internal

druid.port=8080

druid.service=druid/middlemanager

Worker properties (m3.2xlarge has 8 cores)

druid.worker.capacity=7

#druid.worker.ip=localhost

druid.worker.version=0.8.2

Resources for peons

druid.indexer.task.baseTaskDir=/var/druid/cache/middlemanager/tasks

What could be gone wrong here ? I am definitely missing some important detail in config.

Attached are some screenshots if it helps.

So I made some changes to Config thinking it might be a peon resource / config issue but now tasks just EXIT with FAILED status. I can’t find any error in task logs.

Any Idea ?

Middle manager config.

druid.host=ip-xxx-xxx-xxx-xxx100.us-west-2.compute.internal

druid.port=8080

druid.service=druid/middlemanager

Worker properties

druid.worker.capacity=2

#druid.worker.ip=localhost

druid.worker.version=0.8.2

Resources for peons

druid.indexer.runner.javaOpts=-server -Xms4g -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

druid.indexer.task.baseTaskDir=/var/druid/cache/middlemanager/tasks

Peon properties

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

druid.indexer.fork.property.druid.processing.numThreads=2

**Last task log entires. **

9006.244: [GC pause (G1 Evacuation Pause) (young), 0.0040717 secs]
   [Parallel Time: 1.6 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 9006244.4, Avg: 9006244.6, Max: 9006245.5, Diff: 1.1]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.5, Max: 1.3, Diff: 1.3, Sum: 3.7]
      [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2]
         [Processed Buffers: Min: 0, Avg: 2.9, Max: 10, Diff: 10, Sum: 23]
      [Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.2, Sum: 1.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3]
      [Object Copy (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.4]
      [Termination (ms): Min: 0.0, Avg: 0.3, Max: 0.4, Diff: 0.4, Sum: 2.6]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 8]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2]
      [GC Worker Total (ms): Min: 0.3, Avg: 1.1, Max: 1.3, Diff: 1.1, Sum: 9.0]
      [GC Worker End (ms): Min: 9006245.7, Avg: 9006245.8, Max: 9006245.8, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.5 ms]
   [Other: 2.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.2 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.3 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 1.2 ms]
   [Eden: 2454.0M(2454.0M)->0.0B(2454.0M) Survivors: 2048.0K->2048.0K Heap: 2713.8M(4096.0M)->259.9M(4096.0M)]
 [Times: user=0.01 sys=0.00, real=0.01 secs]
9007.778: [GC pause (G1 Evacuation Pause) (young), 0.0040403 secs]
   [Parallel Time: 1.5 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 9007777.7, Avg: 9007778.0, Max: 9007778.8, Diff: 1.1]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.5, Max: 1.3, Diff: 1.2, Sum: 4.0]
      [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2]
         [Processed Buffers: Min: 0, Avg: 3.5, Max: 9, Diff: 9, Sum: 28]
      [Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.2, Sum: 1.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3]
      [Object Copy (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.5]
      [Termination (ms): Min: 0.0, Avg: 0.3, Max: 0.4, Diff: 0.4, Sum: 2.2]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 8]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3]
      [GC Worker Total (ms): Min: 0.3, Avg: 1.1, Max: 1.4, Diff: 1.1, Sum: 9.1]
      [GC Worker End (ms): Min: 9007779.1, Avg: 9007779.1, Max: 9007779.1, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.5 ms]
   [Other: 2.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.3 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 1.2 ms]
   [Eden: 2454.0M(2454.0M)->0.0B(2454.0M) Survivors: 2048.0K->2048.0K Heap: 2713.9M(4096.0M)->259.8M(4096.0M)]
 [Times: user=0.02 sys=0.00, real=0.01 secs]
9009.315: [GC pause (G1 Evacuation Pause) (young), 0.0043862 secs]
   [Parallel Time: 1.7 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 9009314.8, Avg: 9009315.1, Max: 9009316.2, Diff: 1.4]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.5, Max: 1.2, Diff: 1.2, Sum: 3.8]
      [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1]
         [Processed Buffers: Min: 0, Avg: 3.0, Max: 12, Diff: 12, Sum: 24]
      [Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 1.3]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2]
      [Object Copy (ms): Min: 0.0, Avg: 0.2, Max: 0.7, Diff: 0.7, Sum: 1.6]
      [Termination (ms): Min: 0.0, Avg: 0.3, Max: 0.4, Diff: 0.4, Sum: 2.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 8]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.2]
      [GC Worker Total (ms): Min: 0.1, Avg: 1.2, Max: 1.5, Diff: 1.4, Sum: 9.4]
      [GC Worker End (ms): Min: 90093

Hi,
what is the configured segment granularity and windowPeriod ?

can you share the task payload and the complete logs for more details ?

Thanks for your response Nishant. Segment granularity is HOUR and window period is 10 minutes.

Tasks logs and payload is attached.

TaskPayload2016-02-10.txt (2.09 KB)

TaskLog2016-02-10.txt (1.78 MB)

Karan, can you attach the full task logs? The ones you’ve included only contain a very small portion of the logs and don’t show any actual useful information. Make sure to remove sensitive info from your task logs.

Fj - I am attaching the full logs from the overlord console.

Thanks.

taskLog_2016-02-10T22-00-00.000Z_0_0.txt.zip (3.19 MB)

index_realtime_EntityAuth-streaming_2016-02-10T22-00-00.000Z_0_0 (2.09 KB)

Fj / Nishant - Just checking if you got a chance to check the logs.

I have played around with config settings but not sure what is keeping tasks pending.

Hi Karan,
segments are not being handed over since the task was created with none rejection policy.

From the attached logs -

“rejectionPolicy” : {

“type” : “none”

},

you need to change it to serverTime in order for segments to be handed over.

rejectionPolicy “none” is normal for Tranquility; I think what’s going on here is that handoff is not working (the historicals are not loading the segment “EntityAuth-streaming_2016-02-10T22:00:00.000Z_2016-
02-10T23:00:00.000Z_2016-02-10T22:42:10.250Z” that was created by this task).

See https://github.com/druid-io/tranquility/blob/master/docs/trouble.md#my-tasks-are-never-exiting for some more details.