Overlord issuing Shutdown requests to Tasks with no Error messages

Hey Everyone,

We are on Druid v 0.22.1 (upgraded from v 0.19 recently). We have been encountering failures during Ingestion of datasources from AWS s3.
The ingestion type is single_dim partition and for some reason the ingestion for multiple datasources keep failing at partial_index_generic_merge of the Ingestion process.
Digging into some Overlord and the Middlemanager logs, we can see that the process of setting up the task to running status happens just fine, see screenshots below of the logs belonging to the task.



But somewhere between 12 seconds (09:12:06 and 09:12:19 in the screenshot below), something happened and the overlord assigned this task as part of the tasks that are part of the ShutDownTasks list.

Below is the log from the screenshot,

Shutdown [partial_index_generic_merge_ffs_dataset_blue_bbdifihc_2022-09-14T09:11:18.524Z] because: [task is not in knownTaskIds[[partial_range_index_generate_ffs_postdate_metrics_monthly_summary_green_fieoaodb_2022-09-14T09:10:43.990Z
<Whole list of taskIDs from other jobs that failed because of the same reason>
, partial_index_generic_merge_ffs_dataset_blue_eadofjid_2022-09-14T09:10:58.833Z]]]

Below is the screen shot of the error in the suspicious time that the task was issued Killed .

There seem to be an open socket connection error from the middle manager to the master, could that have been a cause for the issue.

Any help is appreciated.

P.S: I do not have the machine logs with me as the EC2’s were dropped as part of our ETL process, so had to attach screenshots from our logs.

Welcome @Vinith2704! Thanks for including the screenshots.

Can you share a bit about your cluster and configuration? Were you encountering any of these issues prior to upgrading?

Did anything change with these particular datasources?

Hopefully others will chime in, but, based on the the screenshot of the error, I wonder if it’s a ZK/leader election issue?

Best,

Mark

Hi Vinith -

Without seeing the overlord logs, I can’t be confident, but a couple of similar cases were resolved by either

  1. increasing the number of merge tasks (totalNumMergeTasks), or
  2. the size for zookeeper znodes (maxZnodeBytes) for the file handles.

In those cases, the the large number of intermediate files being written then merged caused issues. In one case, #1 helped; in another, #2 did.

Hi Mark,

Thanks for getting back. Please find below the information of the cluster.

Zookeeper Instances:
i3. large EC2 machine (2 core CPU, 15.25GB Memory, 1 x 475 NVMe SSD Storage, Up to 10 Gigabit Network Performance).
Master Instances:
r5d.4xl (16 core CPU,128 GiB Memory, 2 x 300 NVMe SSD Storage, Up to 10 Gigabit Network Performance).
Middle Manager:
i3.4xl (16 core CPU, 122 GiB Memory, 2 x 1900 NVMe SSD Storage, Up to 10 Gigabit Network Performance)

This issue did not occur in the previous version, in fact its not constantly happening on v 0.22.1 cluster too. Its just intermittent. Sometimes a simple restart of the failed Ingestion runs to successful completion.

As of this note, this is the status of the cluster and a job has just failed with the same error…

Let me see if i can get the overlord logs for the same.

Thanks,
Vinith

Hi Ben,

Thanks for the suggestions.
#1 the totalNumMergeTasks is already on the higher side for the Ingestion that failed, so am not sure if over provisioning can cause this.
#2 maxZnodeBytes, is this the overlord config? in which case it’s not currently set and that means it would default to 512KB. Will increasing his help? Am just confused because this issue is just intermittent and gets fixed with a restart sometimes… If yes, what would be a recommended size for the above (screenshot in previous chat) cluster config.

I have pasted the overlord logs from today’s failure. We can find the issue for the task
partial_index_generic_merge_ffs_postdate_metrics_monthly_summary_blue_dnniedjj_2022-09-15T09:41:21.036Z which ran on IP ip-10-134-70-112.

Am not sure how to share huge logs as its not letting me upload or paste the logs. Is there any medium i can share the logs?

Thanks,
Vinith