Overlord(0.13-incub) >30 secs to respond to HTTP requests

Hello Druid Users,

It takes > 30 secs to respond to GET/POST requests associated with submitting new Kafka Ingestion job, getting task status & logs.

**Setup **

Druid version: 0.13-incubating

Batch Ingestion (Hadoop): 6 jobs which run nightly

Reatime Ingestion (Kafka) : 18 Kafka & ~230 Running Tasks.

Meta data storage : Mysql RDS (Aurora). 15 open connections to RDS. CPU Util peaks every 15 mins to 30%.

2 Overlords in HA Mode

30 Nodes with historical & MiddelManager process co-located. (Attached config details)

No Errors/WARN in Overlord logs that seem indicative of the slowness. DEBUG logs too verbose and no obvious pattern indicating slowness.

Attaching configs, heap & thread dump of the overlord.

Would highly appreciate any pointers on how I can root cause the slowness and fix it.

Thanks a lot in advance !

jstack0122 (537 KB)

heap_dump.hprof (1.02 MB)

druid_Configs (4.57 KB)


Looking at the jstack, I see a few threads trying to make metadata calls (org.apache.druid.indexing.overlord.MetadataTaskStorage.*).

Can you check metrics on your metadata store side to see things look ok? Maybe clean up a few tables like (druid_pendingSegments, druid_tasks, etc.) mentioned here - https://druid.apache.org/docs/latest/configuration/index.html#metadata-storage
to see if that helps latency.

I would highly encourage to enable the config
druid.coordinator.kill.pendingSegments.onIn our setup, we saw that the create segment operations grew significantly slower because this table grew too big.