Real-time data handling breaks after a few hours

Hello folks:

We have this Druid deployment where we process 5Gb worth of data a day. As part of the setup, we have a dedicated virtual machine (VM) for our broker, another dedicated one for middleManager, 2 VMs for historical nodes and 1 VM to host the rest (kafka server, tranquility, coordinator, overlord, psql).

The issue is that after a few hours we can no longer query the real-time data (last 24 hours). We can replicate the issue by querying the data for the last 48 hours or longer.

Symptoms for the issue is as follows:

1- Once it breaks, we can still pull the historical data but cannot query any data for the last 24 hours.

2- We are not seeing any packet exchange between Broker and MiddleManager host machine when we query broker for recent data. This was confirmed via tcpdump.

3- Tranquility is hosted on the same machine as zookeeper and kafka and it immediately start dropping all the incoming messages before pushing them to druid

The only way for me to fix this is:

1- Stop tranquility, overlord and middlemanager services

2- Remove /tranquility/beams from zookeeper

3- Remove /druid/indexers/tasks/ from zookeeper

4- Start the services again

This works for a few hours and the issue comes back after a few hours. We noticed some complaints about memory on middlemanager.log so we increased the memory for java but still experiencing the issue.

Where should we be focusing on to troubleshoot this?


Do you see the index_realtime_* tasks failing? If so, do their logs show anything interesting? It sounds like you might have some kind of blockage cause by tasks taking a long time to exit.

Other than that, just make sure you have enough capacity to run all the concurrent tasks that Tranquility needs. This is usually 2 * partitions * replicas (see assuming windowPeriod << segmentGranularity as recommended. If you are getting tripped up by this, you should notice a bunch of pending ingestion tasks rather than all of them running smoothly.