Druid latency

Guys,

I believe we are just running groupBy queries, but I’m seeing (not totally sure) average performance. I played with the heap space on my historicals and brokers, but I’m not sure if this is doing much. There are number of knobs to tweak that are sort of confusing to me when used with other parameters. Some of those are:

  • memcached connections - What happens if I use one versus 100? I have an ElastiCache cluster.

  • Caching on the broker (read but not write) - My broker reads so does this mean it will intelligently look up cached results before requesting them to be merged by the historicals (read/read)?

  • Here is a particularly confusing one:

druid.segmentCache.locations=[{“path”: “/mnt/persistent/zk_druid”, “maxSize”: 550000000000}]

druid.server.maxSize=550000000000

What is the difference between maxSize and segmentCache locations maxSize? How do I know the proper ratio for server.maxSize? In the docs, it specifies that server.maxSize is the proper RAM to disk ratio, but if too much disk my application will page. I’ve noticed in other persons configurations that server.maxSize and segment locations maxSize are the same size. I thought one was for caching segments, and the other specifies how much is loaded into memory and allows overflow? How do I go about tweaking this when my dataset is in the terabyte level?

  • The final confusing thing for me is the whole number of threads versus http connections for brokers and historicals. I seem to be getting a ton of backend connection errors in my Amazon Elastic Load Balancer that sits in front of my broker nodes. Here is the current http / thread config for both broker and historical roles respectively. It would seem that increasing any number of http server, memcached, number of processing threads server connections for my broker has no effect on performance. I know from reading past configuraiton issues that my numThreads*brokers has to be larger than the number of historical connections…? I’ve also scaled out more brokers, but it still appears slow and the performance slightly random if not poor. When I fire up the cluster it would seem it runs fast very briefly, but this could be just the web browser caching. Any suggestions would help. My nodes are all r4 series with greater than 200 GiB of RAM.

HTTP server threads

druid.server.http.numThreads=50

druid.broker.http.readTimeout=PT5M

druid.broker.retryPolicy.numTries=2

Processing threads and buffers

druid.processing.buffer.sizeBytes=2147483647

druid.processing.numThreads=31

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=false

Druid connection balancer type - we choose connectionCount based on fewest number

of active connections

druid.broker.balancer.type=connectionCount

, And here is my historical config:

HTTP server threads

druid.server.http.numThreads=50

Processing threads and buffers

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=31

Query cache (we use a small local cache)

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

Segment storage

druid.segmentCache.locations=[{“path”: “/mnt/persistent/zk_druid”, “maxSize”: 550000000000}]

druid.server.maxSize=550000000000

Oh, I forgot to show my jvm.configs for brokers and historicals respectively:

Broker:

1 -server

2 -Xmx32g

3 -Xms32g

4 -XX:NewSize=6g

5 -XX:MaxNewSize=6g

6 -XX:MaxDirectMemorySize=90g

7 -XX:+UseConcMarkSweepGC

8 -XX:+PrintGCDetails

9 -XX:+PrintGCTimeStamps

10 -Duser.timezone=UTC

11 -Dfile.encoding=UTF-8

12 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

13 -Djava.io.tmpdir=/mnt/tmp

``

Historical:

1 -server

2 -Xmx32g

3 -Xms32g

4 -XX:NewSize=12g

5 -XX:MaxNewSize=12g

6 -XX:MaxDirectMemorySize=56g

7 -XX:+UseConcMarkSweepGC

8 -XX:+PrintGCDetails

9 -XX:+PrintGCTimeStamps

10 -Duser.timezone=UTC

11 -Dfile.encoding=UTF-8

12 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

13 -Djava.io.tmpdir=/mnt/tmp

``