Error in Batch Ingestion

Relates to Apache Druid <24.0.2>

Hello,

In a batch ingestion process (parquet files present in hdfs), for about 1000 small files, I get this error in subtasks:

“errorMsg”: “The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]…”

The job eventually succeeds by starting new substaks; however, I’m wondering where is this timeout variable set and what does it mean?

From the coordinator logs:

WARN [HttpRemoteTaskRunner-Worker-Cleanup-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Failing task because worker disappeared and did not report within cleanup timeout[PT15M].
INFO [HttpRemoteTaskRunner-Worker-Cleanup-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task to status: TaskStatus{id=, status=FAILED, duration=-1, errorMsg=The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]…}

For a bigger job, let’s say about 100k small parquet files, the index_parallel task errors out with the same error and the individual subtasks don’t start.

Any pointers on what/how I should troubleshoot this issue?

Thanks in advance for your help!

what is the maxNumConcurrentSubTasks in your ingestion and how many worker slots do you have?

Hello,

maxNumConcurrentSubTasks = 10
workers = 6

In the payload, I pass 40 subtasks, but I guess it takes 10 since the number of files are not that big.

can you reduce maxNumConcurrentSubTasks = 5 and check? My thinking is that tasks are in pending state until they timeout

Hello,

No they timeout with the error after running for close or over 1 hour. Nothing in pending state.

“errorMsg”: “The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]…”

Nevertheless, I had already tried with decreasing the value of subtasks as well as max_rows_in_memory. Doesn’t make any difference.

I’m just not sure which parameter it is considering with the value “PT15M” before timing out. I thought the parameter is druid.indexer.runner.taskCleanupTimeout and I changed the value to PT30M. However, it is still the same.

can you paste the log file from the task here?

also attach the middlemanager and overlord logs

The error message from the subtask is only this:

“errorMsg”: “The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]…”

From the overlord:

WARN [HttpRemoteTaskRunner-Worker-Cleanup-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Failing task because worker disappeared and did not report within cleanup timeout[PT15M].
INFO [HttpRemoteTaskRunner-Worker-Cleanup-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task to status: TaskStatus{id=, status=FAILED, duration=-1, errorMsg=The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]…}

I think I’ve hunch why this occurs. Whenever a sub-task hits the memory limit defined in middle manager, MaxDirectMemorySize, then perhaps it terminates and doesn’t send a heartbeat to coordinator, and hence the timeout error?