Performance test results

Hey guys,

I ran some tests and can’t explain the metrics result. Any help is appreciate !

Tests on datasource “visits”, 1 segment (1 shard) per day (~400-500MB) 96 dimensions 6 metrics

QueryType: Timeseries on 6 metrics (2 HyperUnique, 2 LongSum, 1 DoubleSum)

1 broker node: c4.4xlarge

1 historical node: r3.8xlarge (32 cores with actually 31 workers in conf)

query on 7 days (7 segments):

user: 2359ms

Historical query/time: 2292ms

max(Historical query/segmentAndCache/time): 2283ms

avg(Historical query/segmentAndCache/time): 2081ms

query on 30 days (30 segments):

user: 5717ms

Historical query/time: 5649ms

max(Historical query/segmentAndCache/time): 5640ms

avg(Historical query/segmentAndCache/time): 4635ms

(other metrics are not relevant)

The fact is query/segmentAndCache/time give us “Milliseconds taken to query individual segment or hit the cache (if it is enabled on the historical node).”

The max time of this metric is expected to be the same between both queries, but we can clearly see that in the second query it take more than twice the time.

My historical has 31 worker, so he’s able to scan and compute 31 segments in parallel. I expected the time added in the second query came from merging, but it seems like … NO !

Thanks guys,


Hmmm, that seems odd. Can you post the query/segment/time times?

Hey Fangjin,

The cache was disable, so the query/segment/time times are equals to query/segmentAndCache/time.

7 segments:

max(Historical query/segmentAndCache/time): 2283ms

avg(Historical query/segmentAndCache/time): 2081ms


30 segments:

max(Historical query/segmentAndCache/time): 5640ms

avg(Historical query/segmentAndCache/time): 4635ms


Hi Benjamin,
could you also share your historical jvm config and for more details ?





HTTP server threads


Processing threads and buffers


Segment storage


GroupBy queries









Amazon ec2 -> r3.8xlarge

Any idea ?

Hi Benjamin,
Few observations/questions -

  1. you mentioned the cache was disabled during above results, but the runtime.props have set it to true, please confirm that cache was disabled when the numbers were taken ?
  2. server.maxSize is set 300G, I assume the historical was actually loading less data and all data is being served from memory, is this correct ?
  3. What is the difference that you see when you execute a Timeseries with 1 count aggregator over 1 week and 1 month of data.
  4. I guess the slowness when querying 31 segments in parallel can also be due to possible increase in GC pressure on historical. can you also monitor and compare gc activity when executing both a week and a month long query. If this is the case, gc tuning might help.



Hey Nishant,

  1. Yes, the cache is disable at query time with the “context” parameters
  2. My historical node is a r3.8xlarge with 244 Go of memory so i guess everything is in memory. Any way to check ?
  3. Timeseries with longSum on Count aggregator: 1 week: ~200ms, 1 month: ~300ms
  4. I guess not seeing the results below, what’s your point of view about this ?



Hey Ben,

I wonder if what’s going on is that r3.8xlarges don’t actually have 32 cores, they have 32 hyperthreads. So past 16 concurrent scans you wouldn’t necessarily expect the performance curve to be linear. Does that bear out in reality – do you find that the curve is linear up until 16 and then sub-linear?

Fwiw, there is also a patch in master ( that improves HLL performance and GC pressure by reducing allocations. shows 16

I think this is across two sockets (judging by physical ID reported in proc/cpuinfo) so the only thing that would REALLY scale linearly is if you had a cpuset defined such that 1, 2, 4, 8 threads all ran on the same physical socket without stepping on each other’s hyperthreads.

Larger than that and you’re going to have second order effects either cross socket (NUMA) or related to hyper threading

Thanks Gian and Charles for your precious help !

I ran into more tests, here’s my results:

It seems like the scanning segment time is linear till 8 segments and then increase. Meaning r3.8xl can handle 8 scans at the same time i guess?!

I give you more test logs if it can helps anyone to do anything :wink:



This is really cool. Can you expand a bit on how the different measurements were collected?

Also, did you happen to collect query/cpu/time?

Also, how much data is this expected to be churning through?

was there any warmup querying to make sure disk page cache wasn’t the cause?

Hey Charles,

Query: Timeseries over nb seg with 2 hyperUnique, 2 longSum and 1 doubleSum metrics.

DataSource: 1 segment per day (~400-500MB) (~5-6 million rows)

For each nb seg, the query was executed 10 times, then each metric value is the average of this values for the 10 runs.

time(ms) represent the user time, cpu is query/cpu/time,query is query/time, seg is query/segment/time, segCache is query/segmentAndCache/time, wait is query/wait/time.

There is only Historical node metrics.

More infos needed ?!

cool! any chance you can publish the data in tabular/text form?

Btw, I haven't caught up on this thread, but it came up in our dev
sync and I thought I would add some information about how to scale any
java-based data system on large boxes (Druid included).

Specifically, the recipe we've had for success is

1) Make sure that NUMA zone reclaim is turned off
2) If using RHEL (not sure about other OSs): make sure that
transparent huge pages are set to never
3) Turn off biased locking on your JVM: (-XX:-UseBiasedLocking)


Hey Eric,

Can you explain each point ?!

I tried to modify all of them but seems there is no changes. Except when the number of segments is under 16 (befor hypertheading i gues).

Take a look at this:

Where the “Série1” is y initial run and the “Série2” represents the same run after modifications you made.