Real time index tasks failure with MappableContainerException

Hi all!

I am running into a production issue with some of my real time index tasks. Since 11 am this morning some of them are failing and I can’t figure out why.

I can see the following exceptions on for some of my tasks logs:

SEVERE: The exception contained within MappableContainerException could not be mapped to a response, re-throwing to the HTTP container
com.fasterxml.jackson.databind.JsonMappingException: Early EOF (through reference chain: java.util.ArrayList[209])

I am not sure this is the root cause of the problem but I cannot see anything more which looks like an error in my task logs. Also on the overlord UI the status of the task is “failed” but I cannot see this status on my task logs. Usually it says something like:

Task completed with status: {
  "id" : "index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm",
  "status" : "SUCCESS",
  "duration" : 8692022
}

Another interesting thing is that I have also some tasks with a failed status sent to my overlord even though the task logs say the task has been completed with a successful status.

I looks to me the tasks effectively failed as I have missing data for some hours today.

Would you have any idea why this is happening? I have attached my tasks logs if you want to take a look. One with a success status on the task logs (but a failed status sent to the overlord) and one with no status at all and a MappableContainerException.

Thanks for your help!

index_task_logs_success_status.txt (527 KB)

index_task_logs_with_no_status.txt (1.42 MB)

This part of the logs looks actually really interesting to us:

task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]
java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:119) [druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.TaskToolbox.pushSegments(TaskToolbox.java:202) [druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.task.RealtimeIndexTask$TaskActionSegmentPublisher.publishSegment(RealtimeIndexTask.java:351) [druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:452) [druid-server-0.7.0.jar:0.7.0]
	at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40) [druid-common-0.7.0.jar:0.7.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_80]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_80]
	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_80]
2015-09-30T14:25:00,796 ERROR [rtb_user_sync_statistics-2015-09-30T12:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[rtb_user_sync_statistics]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions., interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:119) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.TaskToolbox.pushSegments(TaskToolbox.java:202) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.task.RealtimeIndexTask$TaskActionSegmentPublisher.publishSegment(RealtimeIndexTask.java:351) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:452) [druid-server-0.7.0.jar:0.7.0]
	at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40) [druid-common-0.7.0.jar:0.7.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_80]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_80]
	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_80]
2015-09-30T14:25:00,800 INFO [rtb_user_sync_statistics-2015-09-30T12:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2015-09-30T14:25:00.797-07:00","service":"druid/prod/worker","host":"ec2-54-146-229-235.compute-1.amazonaws.com:8083","severity":"component-failure","description":"Failed to persist merged index[rtb_user_sync_statistics]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.IOException","exceptionMessage":"Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.","exceptionStackTrace":"java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.\n\tat io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:119)\n\tat io.druid.indexing.common.TaskToolbox.pushSegments(TaskToolbox.java:202)\n\tat io.druid.indexing.common.task.RealtimeIndexTask$TaskActionSegmentPublisher.publishSegment(RealtimeIndexTask.java:351)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:452)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:745)\n","interval":"2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00"}}]

It looks like the task cannot push its segment to S3 even after multiple retries. Could you confirm that’s the issue?
Also how does Druid push segment into S3? Is it an API call?

We are also wondering why the task is marked as successful by the worker whereas the segment is not pushed into S3. Also the overlord seems to get a successful status from the worker:

com.metamx.common.ISE: Segments not covered by locks for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:00,812 INFO [qtp557493895-49] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]: LockReleaseAction{interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
2015-09-30T14:25:00,812 ERROR [qtp557493895-49] io.druid.indexing.overlord.TaskLockbox - Lock release without acquire: {class=io.druid.indexing.overlord.TaskLockbox, task=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
2015-09-30T14:25:00,812 INFO [qtp557493895-49] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“alerts”,“timestamp”:“2015-09-30T14:25:00.812-07:00”,“service”:“druid/prod/indexer”,“host”:“ec2-54-91-252-223.compute-1.amazonaws.com:8080”,“severity”:“component-failure”,“description”:“Lock release without acquire”,“data”:{“class”:“io.druid.indexing.overlord.TaskLockbox”,“task”:“index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm”,“interval”:“2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00”}}]
2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] wrote SUCCESS status for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:01,299 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] completed task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] with status[SUCCESS]
2015-09-30T14:25:27,557 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] on worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080]
2015-09-30T14:25:27,559 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] went bye bye.

But the task is marked as failed on the overlord UI. How is that possible?

I have attached the overlord logs related to this task in case it helps!

overlord_logs.txt (26.3 KB)

Hi Guillaume,

This log line:

“2015-09-30T14:25:00,796 java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.”

indicates which overlord’s logs to check. There be an exception in there around 2015-09-30T14:25:00,796.

Hi Fangjin,

Thanks for you answer. If you look at the overlord logs I have posted, it looks like the overlord tries to release the lock task without acquire. It’s the only error I have on the overlord around 2015-09-30T14:25:00 for this task:

2015-09-30T14:25:00,812 INFO [qtp557493895-49] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“alerts”,“timestamp”:“2015-09-30T14:25:00.812-07:00”,“service”:“druid/prod/indexer”,“host”:“ec2-54-91-252-223.compute-1.amazonaws.com:8080”,“severity”:“component-failure”,“description”:“Lock release without acquire”,“data”:{“class”:“io.druid.indexing.overlord.TaskLockbox”,“task”:“index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm”,“interval”:“2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00”}}]

2015-09-30T14:25:00,812 ERROR [qtp557493895-49] io.druid.indexing.overlord.TaskLockbox - Lock release without acquire: {class=io.druid.indexing.overlord.TaskLockbox, task=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}

Then It looks like the worker reports a SUCCESS status for the task but the overlord doesn’t know about this task:

2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] wrote SUCCESS status for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm

2015-09-30T14:25:01,299 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm

2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] completed task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] with status[SUCCESS]

2015-09-30T14:25:27,557 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] on worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080]

2015-09-30T14:25:27,559 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] went bye bye.

Looking at the overlord logs I can see it gets a fail status from the task around 2015-09-30T12:05:

2015-09-30T12:00:09,192 INFO [qtp557493895-39] io.druid.indexing.overlord.TaskLockbox - Added task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] to TaskLock[index_realtime_rtb_user_sync_statistics]
2015-09-30T12:00:09,192 INFO [qtp557493895-39] io.druid.indexing.overlord.MetadataTaskStorage - Adding lock on interval[2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00] version[2015-09-30T12:00:09.192-07:00] for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T12:00:09,333 INFO [qtp557493895-46] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]: LockAcquireAction{interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
2015-09-30T12:00:09,381 INFO [qtp557493895-46] io.druid.indexing.overlord.TaskLockbox - Task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] already present in TaskLock[index_realtime_rtb_user_sync_statistics]
2015-09-30T12:05:06,581 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - [ec2-54-146-229-235.compute-1.amazonaws.com:8080]: Found [index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] running
2015-09-30T12:05:06,679 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Failing task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]
2015-09-30T12:05:06,680 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T12:05:06,682 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Can't shutdown! No worker running task index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T12:05:06,683 INFO [PathChildrenCache-0] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm to status: TaskStatus{id=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, status=FAILED, duration=-1}
2015-09-30T12:05:06,691 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] from TaskLock[index_realtime_rtb_user_sync_statistics]
2015-09-30T12:05:06,700 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, type=index_realtime, dataSource=rtb_user_sync_statistics}
2015-09-30T12:05:06,700 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task FAILED: RealtimeIndexTask{id=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, type=index_realtime, dataSource=rtb_user_sync_statistics} (-1 run duration) 

2015-09-30T13:10:00,584 INFO [qtp557493895-63] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]: SegmentInsertAction{segments=[DataSegment{size=2224, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[received, received_valid_id], dimensions=[bidder_id], version=‘2015-09-30T12:00:09.192-07:00’, loadSpec={type=s3_zip, bucket=gumgum-druid, key=prod-realtime/rtb_user_sync_statistics/2015-09-30T12:00:00.000-07:00_2015-09-30T13:00:00.000-07:00/2015-09-30T12:00:09.192-07:00/0/index.zip}, interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00, dataSource=‘rtb_user_sync_statistics’, binaryVersion=‘9’}]}

I guess that’s why the overlord UI says this task failed. But I still don’t get why the task sent a fail status to my overlord around 2015-09-30T12:05 even though the task continues to run and at the end reports a success status which is not acknowledged by the overlord because it doesn’t know about it.

What caused the first failure? I have attached my task logs if you want to take a look. I can only see a re connection to zookeeper around 2015-09-30T12:05 but I don’t think this has caused any issue, plus it happened a bit after the overlord got a failed status from the task.

Do you have any idea what happened here ?

Thanks for your help !

index_task_logs_success_status.txt (527 KB)

Can you post the full overlord logs? There is likely an explicit exception I’m not seeing in the logs you posted. It occurs before the lock release warnings.

I have attached the full overlord logs. Thanks for looking into this Fangjin!

xab.tar.gz (3.09 MB)

These are not the full overlord logs. These are a snippet of the overlord logs.

Jona,

Have you ever figured out the issue of

com.fasterxml.jackson.databind.JsonMappingException: Early EOF (through reference chain: java.util.ArrayList[209])?


On my server I am seeing the same issue and trying to understand what could cause the problem.

Thx,
--Roman


can you post the full log?

er, full stack trace