We have this Druid deployment where we process 5Gb worth of data a day. As part of the setup, we have a dedicated virtual machine (VM) for our broker, another dedicated one for middleManager, 2 VMs for historical nodes and 1 VM to host the rest (kafka server, tranquility, coordinator, overlord, psql).
The issue is that after a few hours we can no longer query the real-time data (last 24 hours). We can replicate the issue by querying the data for the last 48 hours or longer.
Symptoms for the issue is as follows:
1- Once it breaks, we can still pull the historical data but cannot query any data for the last 24 hours.
2- We are not seeing any packet exchange between Broker and MiddleManager host machine when we query broker for recent data. This was confirmed via tcpdump.
3- Tranquility is hosted on the same machine as zookeeper and kafka and it immediately start dropping all the incoming messages before pushing them to druid
The only way for me to fix this is:
1- Stop tranquility, overlord and middlemanager services
2- Remove /tranquility/beams from zookeeper
3- Remove /druid/indexers/tasks/ from zookeeper
4- Start the services again
This works for a few hours and the issue comes back after a few hours. We noticed some complaints about memory on middlemanager.log so we increased the memory for java but still experiencing the issue.
Where should we be focusing on to troubleshoot this?