The query/segment/time become slow after replication=2 and resume after metaspace GC

Hi,

Recently, we changed the replication of data sources to 2. After the changes, the avg query response became slower, from 0.05+ seconds to 0.25+ seconds. After monitoring the Druid metrics, we found that the query/segment/time increased a lot from 20+ms to 200+ms (historical noes).

The query/segment/time decreased to 100+ms at some time point, which varies for different historical nodes. Then, we set the replication back to 1. The query/segment/time decreased back to 20+ms at some time point, which varies for different historical nodes as well. By cross check with GC logs, we found that the time decreased right after Metaspace GC occurs.

Does anyone have any clue what is happening?

Is the slow response caused segments being paged in and out memory?

How come the speed resume after Metaspace GC occurs or is it a coincidence?

Thanks and Regards,

Xiangqun

Hi,
In case the historical nodes don’t have enough available memory to serve both the segment replica’s from in-memory, it is possible that paging in-out of data from disk would slow down your queries. If you collect druid metrics (http://druid.io/docs/latest/operations/metrics.html) and have enabled sys metrics, you can verify this by checking out disk read and paging in/out related metrics.

Hi Nishant,

Thanks for your reply. We don’t have sys metrics enabled. Will do in the future. For now, I can still see there are free memory available and 0B Swap used. And I am curious about how JVM Metaspace GC will affect the query/segment/time. Do you have any thought on this?

Hi,

I have added pageIn/pageOut metrics. When I do a stress test, I came across a similar scenario. The query/segment/time increased during the stress test. However, it doesn’t decrease even the test is finished. The query/segment/time drop sharply after a Metaspace GC occurred.

2019-04-11T17:32:37.725+0800: 9980.413: [GC remark 2019-04-11T17:32:37.725+0800: 9980.413: [Finalize Marking, 0.0006864 secs] 2019-04-11T17:32:37.726+0800: 9980.413: [GC ref-proc, 0.0030266 secs] 2019-04-11T17:32:37.729+0800: 9980.416: [Unloading, 0.7800540 secs], 0.7862240 secs]

Are you using Druid’s javascript functionality by any chance? It can put a lot of pressure on metaspace since it compiles the JS functions each time you run a query.

Hi Gian,

Thanks for your reply. Yes, we do use a lot of javascript filters. Our queries are mainly topN and timeseries, and the javascript filter is mainly used to filter numeric metrics greater than 0.

e.g.

{ “type”: “javascript”, “dimension”: “gmv”, “function”: “function(x) { return(x > 0) }” }

Is there an alternative better way to this. e.g.

{

“type”: “not”,

“field”: {

“type”: “selector”,

“dimension”: “gmv”,

“value”: “0”

}

}

Thanks.

Hi Gian,

We have replaced the javascript filter with bound filter. It works very well. Thanks a lot!