Failing subtasks without any logs

Hi,

I’m trying to migrate my old data sources into a new one (merging 3-4 data sources into 1). For some, reindexing works as expected. Some threw OOM (maxDirectMemory) while some failed without any errors at all. I’m trying to understand the latter.

Attached is my task spec where I’m trying to migrate data from “events_app1” into “events” data source while injecting an additional field (orgId) into the table. While this spec worked for many apps, it doesn’t work for some. The parent task and a child task is submitted successfully but since child task dies immediately, 2 more child tasks are spawned. But they end up failing too.

Below, you can see the error I see for the submitted task. There are no logs for child tasks. I also checked coordinator, overlord and middleManager logs but couldn’t find anything relevant.

Any ideas about how to enable logs for child tasks and what might be the reason for this behaviour?

WindowedSegmentId{segmentId=‘app1_events_2020-04-13T13:00:00.000Z_2020-04-13T14:00:00.000Z_2020-04-13T16:33:48.071Z’, intervals=[2020-04-13T13:00:00.000Z/2020-04-13T14:00:00.000Z]}, WindowedSegmentId{segmentId=‘app1_events_2020-05-04T12:00:00.000Z_2020-05-04T13:00:00.000Z_2020-05-04T11:22:41.715Z_2’, intervals=[2020-05-04T12:00:00.000Z/2020-05-04T13:00:00.000Z]}]}]
2020-05-15T07:45:46,942 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a new task[single_phase_sub_task_events_events_ipangpal_2020-05-15T07:45:46.776Z] for spec[index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z_0]
2020-05-15T07:45:47,043 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Waiting for subTasks to be completed
2020-05-15T07:45:47,781 WARN [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - task[single_phase_sub_task_events_events_ipangpal_2020-05-15T07:45:46.776Z] failed!
2020-05-15T07:45:47,781 INFO [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - We still have more chances[1/3] to process the spec[index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z_0].
2020-05-15T07:45:47,880 INFO [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a new task[single_phase_sub_task_events_events_dhgpmndj_2020-05-15T07:45:47.781Z] for retrying spec[index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z_0]
2020-05-15T07:45:48,770 WARN [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - task[single_phase_sub_task_events_events_dhgpmndj_2020-05-15T07:45:47.781Z] failed!
2020-05-15T07:45:48,770 INFO [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - We still have more chances[2/3] to process the spec[index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z_0].
2020-05-15T07:45:48,850 INFO [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a new task[single_phase_sub_task_events_events_ahhfnoch_2020-05-15T07:45:48.770Z] for retrying spec[index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z_0]
2020-05-15T07:45:49,770 WARN [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - task[single_phase_sub_task_events_events_ahhfnoch_2020-05-15T07:45:48.770Z] failed!
2020-05-15T07:45:49,771 ERROR [task-monitor-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - spec[index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z_0] failed after [3] tries
2020-05-15T07:45:49,774 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Failed because of the failed sub task[single_phase_sub_task_events_events_ahhfnoch_2020-05-15T07:45:48.770Z]
2020-05-15T07:45:49,774 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Cleaning up resources
2020-05-15T07:45:49,775 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor
2020-05-15T07:45:49,780 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z”,
“status” : “FAILED”,
“duration” : 4842,
“errorMsg” : null,
“location” : {
“host” : null,
“port” : -1,
“tlsPort” : -1
}
}
2020-05-15T07:45:49,789 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2020-05-15T07:45:49,792 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2020-05-15T07:45:49,797 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@6bdedbbd{HTTP/1.1,[http/1.1]}{0.0.0.0:8130}
2020-05-15T07:45:49,798 INFO [main] org.eclipse.jetty.server.session - node0 Stopped scavenging
2020-05-15T07:45:49,800 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@2cdb5974{/,null,UNAVAILABLE}
2020-05-15T07:45:49,804 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2020-05-15T07:45:49,805 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_parallel_events_events_cegnibin_2020-05-15T07:45:38.318Z].
2020-05-15T07:45:49,807 INFO [main] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Cleaning up resources

Thanks!

example.json (2.17 KB)

Hi Alper Kanat,
What I have come across is , If memory is not available in the node, OS will kill this jvm. In that case there will not be any trace in the log why it failed.

Thanks

Hi Tijo,

Thanks for your response.

Actually there are 5 middle managers and each have 128 GB RAM. RAM Usage per server seem to be around 70 GB. Each MM has around 4 GB of heap and it’s nowhere near it. Each instance seems to be ~1g. Peon heap, on the other hand, is 1g (with max direct 1g). But in that case, usually I see an OOM log message.

What is more interesting is, I changed maxNumConcurrentSubTasks to 1 (from 2) and it succeeded. Compared to other data sources, the one I’m migrating is relatively small but it took 10 minutes to migrate (according to task status). Also logs are full of messages like below:

2020-05-15T12:52:06,199 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T19:00:00.000Z_2020-04-13T20:00:00.000Z/2020-04-13T18:49:10.148Z/18
2020-05-15T12:52:06,199 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/0
2020-05-15T12:52:06,202 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/1
2020-05-15T12:52:06,205 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/2
2020-05-15T12:52:06,206 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/3
2020-05-15T12:52:06,206 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/4
2020-05-15T12:52:06,207 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/5
2020-05-15T12:52:06,207 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/6
2020-05-15T12:52:06,208 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/7
2020-05-15T12:52:06,208 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/8
2020-05-15T12:52:06,208 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T20:00:00.000Z_2020-04-13T21:00:00.000Z/2020-04-13T19:54:32.193Z/9
2020-05-15T12:52:06,210 WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_events_events_jeeddeip_2020-05-15T12:44:20.459Z/work/indexing-tmp/app1_events/2020-04-13T21:00:00.000Z_2020-04-13T22:00:00.000Z/2020-04-13T21:00:08.806Z/0

Hi Alper,

As shown the warning is more about being unable to delete the temp segments created by Peon .You may check the permission of the directory .

May I know from where this task is trying to ingest the data?

I feel you can play around the indexing task configs like maxBytesInMemory to get rid of heap OOM. May i know what is the size of the file ingesting if this is a native index.

Does this data source have more dim and measure compared to other data sources ?

You have mentioned that Compared to other data sources, the one I’m migrating is relatively small but it took 10 minutes to migrate. Migrating means reindexing from another data source or native indexing to druid.

Hi Tijo,

All druid components are running inside Docker Swarm containers using the official Druid image. (0.18.0) I think middleManager is trying to delete temporary segments inside the docker container? If so, the permission issue should affect everyone using this image, am I right?

I’m trying to ingest from a data source into another data source? (actually merging multiple data sources -one at a time- into one)

The data sources ingest data from Kafka (realtime) and the Kafka topic has 150 partitions. I think I’ve incorrectly configured the segment granularity to an HOUR so I’ve a lot of small segments (~25k for ~2 months). Average segment size is 3 KB (total size: 80 MB) and the number of rows are ~300k.

1 Like