Druid 0.7.0 - Middle managers configuration

Hi all,

I recently upgraded my Druid cluster to the latest version 0.7.0.

I used the same configuration for my middle managers I was using before:

Instance type: c3.4xlarge (16 CPU, 30 GB Memory)

runtime.properties:

druid.host:8080=ec2-54-90-201-171.compute-1.amazonaws.com

druid.indexer.firehoseId.prefix=druid:prod:chat

druid.indexer.fork.property.druid.computation.buffer.size=536870912

druid.indexer.fork.property.druid.indexer.hadoopWorkingPath=/mnt/druid-indexing

druid.indexer.fork.property.druid.processing.numThreads=2

druid.indexer.fork.property.druid.request.logging.dir=logs/request_logs/

druid.indexer.fork.property.druid.request.logging.type=file

druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/mnt/druid/persistent/zk_druid”, “maxSize”: 0}]

druid.indexer.fork.property.druid.storage.baseKey=prod-realtime

druid.indexer.fork.property.druid.storage.bucket=gumgum-druid

druid.indexer.fork.property.druid.storage.type=s3

druid.indexer.logs.s3Bucket=gumgum-druid

druid.indexer.logs.s3Prefix=tasks-logs

druid.indexer.logs.type=s3

druid.indexer.runner.startPort=8081

druid.indexer.task.baseDir=/mnt/tmp/

druid.indexer.task.baseTaskDir=/mnt/tmp/persistent/tasks/

druid.indexer.task.chathandler.type=announce

druid.indexer_runner.javaOpts=-server -Xmx2560m -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=512K -XX:+PrintGCDateStamps -Xloggc:logs/gc-peon.log -XX:+PrintGCDetails -Djava.io.tmpdir=/mnt/tmp

druid.port=8080

druid.selectors.indexing.serviceName=druid:prod:indexer

druid.server.http.numThreads=20

druid.service=druid/prod/worker

druid.worker.capacity=6

druid.worker.ip=localhost

druid.worker.version=5

I have only 6 workers per machine and 2 processing threads. However if you look at the Ganglia graph I have attached, you will see my middle manager is over loaded. The pic goes up to 35 load/proc.

I don’t understand why this happens. In my previous cluster I had 8 workers and 2 processing threads, the load of my nodes were fine… Any idea why mi nodes are now overloaded with this configuration?

Thanks a lot for your help!

Guillaume

Yay data!

From just a cursory glance it looks like most of the threads are waiting on network IO rather than crunching numbers. Are you seeing a measurable difference in throughput, or are you simply concerned that the behavior does not match the prior behavior from the ganglia reports?

Thanks,

Charles Allen

I wonder if this is related to https://groups.google.com/forum/#!topic/druid-development/k70Aa_LMY24, investigating. Will update both threads.

HI Torche, is it possible to provide us with a thread dump of your middle manager and one of the peons?

Hi Gian,

Sorry for the delay. I have attached a thread dump of one of my peon and one of my middle manager.

I have also attached a part of one of my Storm worker stacktrace. You will see I get a lot of this kind of errors:

STDIO [ERROR] Mar 24, 2015 8:55:00 AM com.twitter.finagle.loadbalancer.LoadBalancerFactory$$anon$2$$anonfun$1 apply

I’m not sure what it means, I googled it but didn’t find anything on the web for this error. I get tons of this error from my Storm cluster periodically. Look at the Kibana errors graph I have attached.

Any idea what’s happening there?

mmThreadDump.txt (90.8 KB)

peonThreadDump.txt (44.5 KB)

stacktrace_storm_worker.txt (31.4 KB)

Torche, do things still generally work for you, and you’re just wondering why your CPU use went up? Or are things actually broken, in that ingestion is not working in some way? I don’t see much strange in the thread dumps. One of them is in the middle of sending a response somewhere, but that’s probably fine (unless it’s taking a long time, but that’s hard to tell from one thread dump…)

Also, which version of tranquility are you using?

Also, you may want to route those finagle through slf4j so you can get them logged with the same logging mechanism that you’re using for everything else. See here for more info: http://www.slf4j.org/api/org/slf4j/bridge/SLF4JBridgeHandler.html

Ingestion is still working for me. I am just wondering why my middle managers are over loaded whereas the number of events I am indexing is pretty low compared to few months ago.

This happened when we upgraded druid from 0.6.x to 0.7.0. We also upgraded tranquility from 2.9.1:0.2.3 to 2.10:0.3.1.

Plus I used 8 workers with the same number of processing threads before and same instance type.

I will take a series of thread dump and post them as soon as possible. It might give us more clues!

Thanks for the link, I will take a look and try to log finagle through slf4j.