Peon process become zombie after successfully completed indexing task

Hi

Based on peon logs I see that it’s finished because there is message about it:

INFO 2017-01-20T02:42:27,573 - Task completed with status: {
“id” : “index_cfu_20170120_024555203UTC”,
“status” : “SUCCESS”,
“duration” : 1772
} [task-runner-0-priority-0][com.metamx.common.logger.Logger.info(Logger.java:70)]

``

Also I see that segment was successfully published. So, looks like peon made it main job.

Meanwhile overlord console shows this task in Running Tasks section. We are running overlord inside docker container and ps shows:

root@c8bdaeb50a57:~# ps axu | grep java
root 20 0.1 3.6 4251516 555876 ? Sl Jan18 3:59 java -Xmx512M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/var/log/druid/overlord/gc.log -XX:+UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/overlord/overlord-oom.hprof -Dconfig.dir=/etc/druid/overlord -cp /opt/druid/druid-0.9-SNAPSHOT.jar -Ddruid.host=XX.XX.XX.XX:31040 -Ddruid.port=31040 -Duser.timezone=UTC -Dfile.encoding=UTF-8 com.mycompany.Runner server overlord
root 5009 99.7 0.0 0 0 ? Zl 02:42 740:46 [java]

``

where 5009 is peon process. As you can see there is keyword which mean that peon become zombie. When and why it become zombie is the main question. Pay attention on CPU usage. It’s extremely hi!

Also we use custom Runner. But it’s not doing magic. Just read application specific config and call com.pleeco.druid.cli.Main.

Overlord thread dump contains one interesting stack trace:

forking-task-runner-1 - priority:5 - threadId:0x00007efd4403d000 - nativeId:0x7e - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

  • locked <0x00000000ebcff120> (a java.lang.UNIXProcess$ProcessPipeInputStream)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.google.common.io.ByteStreams.copy(ByteStreams.java:175)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:379)
    at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:179)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

``

As you can see there is ByteStreams.copy() method called from ForkingTaskRunner. It’s reading logs from InputStream to file. Input stream belongs to java Process object which holds reference to peon child jvm.

Theoretically ByteStreams.copy() should finish when peon complete it’s job but it’s not happens. It’s stuck.

Overlord logs regarding this task:

[INFO] 2017-01-20 02:42:18,675 io.druid.indexing.overlord.HeapMemoryTaskStorage info - Inserting task index_cfu_20170120_024555203UTC with status: TaskStatus{id=index_cfu_20170120_024555203UTC, status=RUNNING, duration=-1}
[INFO] 2017-01-20 02:42:18,675 io.druid.indexing.overlord.TaskLockbox info - Adding task[index_cfu_20170120_024555203UTC] to activeTasks
[INFO] 2017-01-20 02:42:18,676 io.druid.indexing.common.actions.LocalTaskActionClient info - Performing action for task[index_cfu_20170120_024555203UTC]: LockTryAcquireAction{interval=2016-01-02T00:00:00.000Z/2016-01-03T00:00:00.000Z}
[INFO] 2017-01-20 02:42:18,676 io.druid.indexing.overlord.TaskLockbox info - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_cfu_20170120_024555203UTC, dataSource=cfu, interval=2016-01-02T00:00:00.000Z/2016-01-03T00:00:00.000Z, version=2017-01-20T02:42:18.676Z}, taskIds=}
[INFO] 2017-01-20 02:42:18,676 io.druid.indexing.overlord.TaskLockbox info - Added task[index_cfu_20170120_024555203UTC] to TaskLock[index_cfu_20170120_024555203UTC]
[INFO] 2017-01-20 02:42:18,676 io.druid.indexing.overlord.TaskQueue info - Asking taskRunner to run: index_cfu_20170120_024555203UTC
[INFO] 2017-01-20 02:42:18,680 io.druid.indexing.overlord.ForkingTaskRunner info - Running command: java -cp /opt/druid/druid-0.9-SNAPSHOT.jar -server -Xmx1G -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -Ddruid.indexer.task.baseTaskDir=/var/druid/indexer/tasks -Ddruid.host=172.31.11.208:31040 -Ddruid.discovery.curator.path=/druid/discovery -Ddruid.processing.buffer.sizeBytes=100000000 -Ddruid.storage.cassandra.keyspace=druid -Ddruid.storage.cassandra.replication=‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1 -Ddruid.metadata.storage.cassandra.replication=‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1 -Ddruid.emitter=logging -Ddruid.metadata.storage.cassandra.keyspace=druid -Duser.timezone=UTC -Dfile.encoding.pkg=sun.io -Ddruid.metadata.storage.cassandra.contactPoints=172.31.21.32:31229 -Ddruid.selectors.indexing.serviceName=overlord -Ddruid.indexer.task.baseDir=/var/druid/indexer -Ddruid.monitoring.emissionPeriod=PT60m -Ddruid.indexer.queue.startDelay=PT0M -Ddruid.metadata.storage.connector.createTables=true -Ddruid.port=31040 -Ddruid.worker.capacity=2 -Ddruid.storage.cassandra.contactPoints=172.31.21.32:31229 -Ddruid.storage.cassandra.password=cassandra -Ddruid.indexer.runner.javaCommand=java -Ddruid.service=overlord -Ddruid.metadata.storage.type=cassandra -Ddruid.metadata.storage.cassandra.password=cassandra -Ddruid.metadata.storage.connector.connectURI=not_applicable_for_cassandra -Djava.io.tmpdir=/tmp -Ddruid.metadata.storage.cassandra.username=cassandra -Ddruid.indexer.runner.startPort=31041 -Ddruid.storage.cassandra.username=cassandra -Ddruid.processing.numThreads=4 -Ddruid.zk.service.host=172.31.21.32:31407 -Ddruid.indexer.logs.directory=/var/druid/indexer/logs -Ddruid.properties.file=/etc/druid/overlord/overlord.properties -Ddruid.zk.paths.base=/druid -Dfile.encoding=UTF-8 -Ddruid.storage.type=cassandra -Ddruid.metrics.emitter.dimension.dataSource=cfu -Ddruid.metrics.emitter.dimension.taskId=index_cfu_20170120_024555203UTC -Ddruid.host=172.31.11.208 -Ddruid.port=31041 io.druid.cli.Main internal peon /var/druid/indexer/tasks/index_cfu_20170120_024555203UTC/task.json /var/druid/indexer/tasks/index_cfu_20170120_024555203UTC/f110a537-e0f4-4446-ba89-dc95e4fdbd2f/status.json
[INFO] 2017-01-20 02:42:18,690 io.druid.indexing.overlord.ForkingTaskRunner info - Logging task index_cfu_20170120_024555203UTC output to: /var/druid/indexer/tasks/index_cfu_20170120_024555203UTC/log
[INFO] 2017-01-20 02:42:25,742 io.druid.indexing.common.actions.LocalTaskActionClient info - Performing action for task[index_cfu_20170120_024555203UTC]: LockTryAcquireAction{interval=2016-01-02T00:00:00.000Z/2016-01-03T00:00:00.000Z}
[INFO] 2017-01-20 02:42:25,742 io.druid.indexing.overlord.TaskLockbox info - Task[index_cfu_20170120_024555203UTC] already present in TaskLock[index_cfu_20170120_024555203UTC]
[INFO] 2017-01-20 02:42:25,828 io.druid.indexing.common.actions.LocalTaskActionClient info - Performing action for task[index_cfu_20170120_024555203UTC]: LockListAction{}
[INFO] 2017-01-20 02:42:27,535 io.druid.indexing.common.actions.LocalTaskActionClient info - Performing action for task[index_cfu_20170120_024555203UTC]: SegmentInsertAction{segments=[DataSegment{size=37170, shardSpec=HashBasedNumberedShardSpec{partitionNum=0, partitions=1}, metrics=[count, Volume, Volume_EUR, Volume_USD, PositionQty, PositionValue, PositionValue_EUR, PositionValue_USD, Match_Qty, Match_ValueEUR, Match_ValueUSD, Spread, DailyPL, DailyPL_EUR, DailyPL_USD, NDTA_Value, NDTA_Value_EUR, NDTA_Value_USD, Receivables_Value, Receivables_Value_USD, Receivables_Value_EUR], dimensions=[Region, Source_Region, Use_Region, BizDay, Cusip, SeId, Sedol, Isin, Ticker, SecurityDescription, SecClass, LiquidityBucket, SubBucket, MktSegment, MktSector, SecIssueCountry, SecExchangeCountry, Classification, SecPrice, SecPriceCurrency, Source_HighLevel, Source, Source_PB_CommitmentTerm, Source_LegalEntity, Use_HighLevel, Use, Use_PB_CommitmentTerm, Use_LegalEntity, ValueCurrency, Source_Customer_ParentID, Source_Customer_ParentName, Use_Customer_ParentID, Use_Customer_ParentName, Source_BusinessArea, Source_ManagementArea, Source_MgmtArea_Name, Use_BusinessArea, Use_ManagementArea, Use_MgmtArea_Name, SourceTradingDeskCode, UseTradingDeskCode], version=‘2017-01-20T02:42:18.676Z’, loadSpec={type=cassandra, key=19b49dcd-d369-46ad-96ab-acae324749a1, chunks_count=1}, interval=2016-01-02T00:00:00.000Z/2016-01-02T00:01:00.000Z, dataSource=‘cfu’, binaryVersion=‘9’}]}

``

Does anyone has any idea why peon process can stuck?