Query on realtime indexing task consumes too much time (query/wait/time on task too long)

I have a druid cluster, and I use spark streaming + tranquility to push streamed events to druid. When ingesting, I also do queries on the data. 90% of the queries returned with 1s, but occassionally there’s some query that took > 100s. I checked the query metric in broker, which show that the query/node/ttfb is about 100s, and the query is performed on worker-node:8100. In the worker-node’s task metric showed that the query/wait/time was really long.

2018-07-17T09:10:47,985 INFO [timeseries_apm_metrics_[2018-07-17T09:00:00.000Z/2018-07-17T09:04:50.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2018-07-17T09:10:47.985Z”,“service”:“druid/middleManager”,“host”:“emr-worker-1.cluster-64941:8100”,“version”:“0.11.0”,“metric”:“query/wait/time”,“value”:107970,“dataSource”:“apm_metrics”,“duration”:“PT290S”,“hasFilters”:“true”,“id”:“994f07d2-0fd0-4a5c-8583-2b05fa1a107d”,“interval”:[“2018-07-17T09:00:00.000Z/2018-07-17T09:04:50.000Z”],“numComplexMetrics”:“0”,“numMetrics”:“4”,“segment”:“apm_metrics_2018-07-17T09:00:00.000Z_2018-07-17T09:05:00.000Z_2018-07-17T09:02:39.915Z_1”,“type”:“timeseries”}]

My stream events is about 10k/s, and the queries rate is about 30/s. I also tried increase the partition of the stream, but it seems no help.

Can anybody give me some hint?

Problem solved. For “descending” timeseries query, the qps is about 10, but for “ascending” timeseries query, the qps is about 250. The root cause is on the descending operation on ConcurrentSkipListMap, which is expensive.