Increased query latency at granularity boundaries

Hi group

We’re experiencing an issue with our druid cluster that we’re having some trouble resolving.

We have a fairly large set of tasks (30ish) that are indexing with Tranquility at a six hour granularity. At each granularity boundary time (00:00, 06:00, 12:00, etc.) we get a brief blip where a large number of our queries time out, due to some increased latency during this roll-over time. There are no errors or warnings in any of the logs at this time, nor does it appear to be GC-related. The impact lasts about a minute.

We have tried tweaking warming period and window period, but this does not seem to have much effect. We actually have ended up with code that adds some randomness to warming and window period so that the task start up time and handoff times are staggered, and the load on the indexing service is spread out over a longer period. This seems to keep the cpu usage down on the middle manager during the switch, but it still does not mitigate the blip we see for a minute exactly at the grain time.

We’d like to solve this problem, as it translates into several hundred timeouts for end-user requests at each of these times, each day.

Has anyone experienced this, or have any troubleshooting ideas?

Regards,

Max

Hi Max, performance related problems are always the most difficult to answer :stuck_out_tongue:

A few general questions:

  1. What types of queries are you issuing that are timing out?

  2. Is the bottleneck at the historical, realtime, or broker level? The query/* metrics (http://druid.io/docs/0.9.1.1/operations/metrics.html) should report segment scan latencies, merging times, etc. that will provide a bit more insight as to where the actual problem is.

As for possible causes, that’ll require a bit more thought.

– FJ

What could be happening is something like this:

  1. New tasks are getting spun up when the granularity rolls over, they take a while to get going

  2. Tranquility blocks a bit on pushing to the new tasks, accumulating messages for both the old and new tasks

  3. Tranquility eventually gets unblocked on the new tasks and then has to emit a backlog to both old and new

  4. Either the old or new tasks have all their jetty threads tied up dealing with the backlog from Tranquility and can’t respond to queries

0.9.1.1 adds a feature to help with this, druid.indexer.server.maxChatRequests, which limits the number of concurrent jetty threads that are allocated to messages from Tranquility (“chat”). If you set this smaller than druid.server.http.numThreads, then the difference is going to be reserved and available for other requests (like queries).

Gian, that’s a good theory, and that sounds like a helpful config option.

I think we have determined that the problem is likely due to our application query patterns. Almost all of our queries are from some point in the past up to now, and we see that the internal (broker -> middle manager) query volume jumps significantly during the rollover, since each query must go to an additional segment. This subsides as the older segment is handed off and caches for it fill up. So the timeouts are just a result of increased load/query time.

We’ve been able to make some more overall perf improvements and this is looking like less of a problem.

Thanks for the suggestions.

  • Max

Hello Max

Can you please help what performance considerations you made to improve this latency? We are also facing similar issues

Thanks

Bhaskar

If you make many queries for intervals that extend up to the current moment, then it can put a lot of pressure on the realtime segments, of which there generally aren’t many. Additionally, they service queries much more slowly than historical nodes, so we were often seeing the realtime part take the most time during a typical query.