How to report on expensive queries?

We could use some advice on figuring out which queries are expensive in our cluster (Druid 11.x).

A very small number of queries (generally from our larger customers) result in long scan times and make many other queries slow at the same time. But it is tricky for us to figure out which query is the one that was expensive.

We gather all of the normal aggregate metrics from the cluster, but this usually doesn’t give us enough to work with. We haven’t been using the query request log since this also doesn’t give us the detail we need. We would need to know the actual set of filters that were set to diagnose the problem. We have 100s of queries per second, so just logging everything is a cumbersome option.

Thanks for any tips.

Max

A very small number of queries (generally from our larger customers) result in long scan times and make many other queries slow at the same time. But it is tricky for us to figure out which query is the one that was expensive.

You might be able to identify the query responsible for bottlenecking vs. the queries being bottlenecked by looking at query/segment/time vs. query/wait/time on the historical-level metrics. I would expect truly heavy queries to have a high segment scanning time (the former metric), while lightweight queries that are getting queued would have a relatively low segment/time but high wait/time.

You would still need some request logging to get the details of the queries, there is a “filtered request logger” that only logs queries that with a query/time metric that exceeds some threshold: http://druid.io/docs/latest/configuration/index.html#filtered-request-logging, that might be helpful

Thanks,

Jon