Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

I have started to have trouble getting stand-alone Hadoop indexing tasks to finish. All maps for the first job succeed, but with the message: “Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143”. All containers are killed, and the reduce phase can never begin. It would seem like there is an error in the map phase, but logs for this appear to complete normally.

Here is an example of the container log:

2015-06-25 19:43:56,773 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:20 ScheduledMaps:95 ScheduledReds:0 AssignedMaps:4 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:5 ContRel:0 HostLocal:0 RackLocal:5
2015-06-25 19:43:57,211 INFO [IPC Server handler 17 on 55176] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1435260232430_0001_m_000004_0 is : 1.0
2015-06-25 19:43:57,214 INFO [IPC Server handler 18 on 55176] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from attempt_1435260232430_0001_m_000004_0
2015-06-25 19:43:57,215 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1435260232430_0001_m_000004_0 TaskAttempt Transitioned from RUNNING to SUCCESS_CONTAINER_CLEANUP
2015-06-25 19:43:57,215 INFO [ContainerLauncher #6] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1435260232430_0001_01_000007 taskAttempt attempt_1435260232430_0001_m_000004_0
2015-06-25 19:43:57,216 INFO [ContainerLauncher #6] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1435260232430_0001_m_000004_0
2015-06-25 19:43:57,218 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1435260232430_0001_m_000004_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2015-06-25 19:43:57,218 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1435260232430_0001_m_000004_0
2015-06-25 19:43:57,219 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1435260232430_0001_m_000004 Task Transitioned from RUNNING to SUCCEEDED
2015-06-25 19:43:57,219 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 2
2015-06-25 19:43:57,773 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:20 ScheduledMaps:95 ScheduledReds:0 AssignedMaps:4 AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:5 ContRel:0 HostLocal:0 RackLocal:5
2015-06-25 19:43:57,777 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
2015-06-25 19:43:57,777 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.util.RackResolver: Resolved ip-10-12-1-163.ec2.internal to /default-rack
2015-06-25 19:43:57,777 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1435260232430_0001_01_000008 to attempt_1435260232430_0001_m_000005_0
2015-06-25 19:43:57,777 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=1536
2015-06-25 19:43:57,777 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 5
2015-06-25 19:43:57,777 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:20 ScheduledMaps:94 ScheduledReds:0 AssignedMaps:5 AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:6 ContRel:0 HostLocal:0 RackLocal:6
2015-06-25 19:43:57,777 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved ip-10-12-1-163.ec2.internal to /default-rack
2015-06-25 19:43:57,778 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1435260232430_0001_m_000005_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED
2015-06-25 19:43:57,779 INFO [ContainerLauncher #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1435260232430_0001_01_000008 taskAttempt attempt_1435260232430_0001_m_000005_0
2015-06-25 19:43:57,779 INFO [ContainerLauncher #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1435260232430_0001_m_000005_0
2015-06-25 19:43:57,781 INFO [ContainerLauncher #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1435260232430_0001_m_000005_0 : 13562
2015-06-25 19:43:57,782 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1435260232430_0001_m_000005_0] using containerId: [container_1435260232430_0001_01_000008 on NM: [ip-10-12-1-163.ec2.internal:9103]
2015-06-25 19:43:57,782 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1435260232430_0001_m_000005_0 TaskAttempt Transitioned from ASSIGNED to RUNNING
2015-06-25 19:43:57,782 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1435260232430_0001_m_000005 Task Transitioned from SCHEDULED to RUNNING
2015-06-25 19:43:58,590 INFO [IPC Server handler 16 on 55176] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1435260232430_0001_m_000001_0 is : 0.59880215
2015-06-25 19:43:58,781 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1435260232430_0001: ask=3 release= 0 newContainers=0 finishedContainers=1 resourcelimit=<memory:4608, vCores:-5> knownNMs=2
2015-06-25 19:43:58,781 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1435260232430_0001_01_000007
2015-06-25 19:43:58,781 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=4608
2015-06-25 19:43:58,781 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 5
2015-06-25 19:43:58,781 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:20 ScheduledMaps:94 ScheduledReds:0 AssignedMaps:4 AssignedReds:0 CompletedMaps:2 CompletedReds:0 ContAlloc:6 ContRel:0 HostLocal:0 RackLocal:6
2015-06-25 19:43:58,781 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1435260232430_0001_m_000004_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

``

After a few more tests, it seems that the container killed message is independent of whether the job succeeds or not.

On some submissions, “determine-partitions-groupby” will finish (both map & reduce), only to have “determine-partitions-dimselection” to get stuck (100% map 0% reduce). On other submissions, it will get stuck during “determine-partitions-groupby” (again map 100% reduce 0%). Regardless of where things stop, most map tasks complete successfully with the container killed message.

When things stop, all reduce tasks are in state SCHEDULED.

I have been able to resolve this issue by increasing the size of my EMR cluster. It is unclear to me whether the “stuck” job would have eventually succeeded on the smaller cluster, or if it was indeed stuck.