Understanding and tuning Middle Manager / Peon properties

Hi,

I’m trying to scale our indexing service appropriately and having a little trouble understanding the Middle Manager and Peon properties.

First of all, I’m looking at Gian’s comment in this thread:

Each peon is going to use its heap to merge query results for certain query types, and to store up to maxRowsInMemory * (2 + maxPendingPersists) aggregated rows for indexing. It’ll write everything else to disk and then memory-map that. And like historical nodes, they each have a processing thread pool with one processing buffer per pool.
For best performance you usually want to have enough memory set aside per peon for its heap, its processing buffers, and for the OS to be able to keep all/most of its memory-mapped data in the page cache

I’m trying to apply this wisdom to the example production cluster configuration and some things are confusing. Here are what seem to be the relevant properties from that example:

Hardware:

r3.8xlarge (Cores: 32, Memory: 244 GB, SSD)

Runtime.properties:

# Resources for peons
druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

# Peon properties
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912
druid.indexer.fork.property.druid.processing.numThreads=2
druid.indexer.fork.property.druid.server.http.numThreads=50

druid.worker.capacity=9

On the threads & cores side, we have 9 workers x 2 threads which is quite a lot lower than the 32 available. Sure there are 50 http threads, but those are mostly going to be idle. Is there a reason to leave so many cores apparently unoccupied?

On the memory side, we have 9 workers x 3g JVM heap = 27g. (The middle manager itself is only using 64m I think). The processing buffer is allocated within the JVM, right? Even if not, and even if there is a separate buffer per thread, that’s only about 9gb. So, at top the workers are using 36g out of 244g available. Do we really need to reserve ~200gb for memory mapping?

Anecdotally, someone else at our company played with some different EC2 instance classes and found he got much better performance from C4 (Compute optimized) classes than the R3 (RAM optimized) instances. That might be because the data we are ingesting is probably a lot smaller than the cases envisioned for the production example. Our day granularity segments cap out at just over 300 MB and the majority are in the 100-200 MB range. The source data is a day chunk in json format that is typically around 1.5 GB. We do have an unfortunately somewhat clunky ingestion protocol I don’t want to go into great detail about here, but it involves source data indexed by UTC being ingested into Druid in local time, which unfortunately forces us to have to ingest across three segments at once (using 5 source data files) - this is a little different setup than what my colleague was investigating earlier where I think he was ingesting a single segment at a time. But now it’s taking ~35-50 minutes to run a single task on a c4.large (running a single worker with a single thread).

I was thinking about another related issue. processing.numThreads and worker.capacity both default to the number of available processors minus one, right? Say I leave these as defaults, and deploy to a 4 core VM. So both have a value of 3. Does that mean I can get 3 Peons created each of which is attempting to use 3 threads? That seems like a bad configuration for 4 cores, isn’t it?

On a 4 core VM, would it be better to have 3 workers with 1 thread each, or 1 worker with 3 threads each?

Hey Ryan,

You’re right that the default of processing.numThreads doesn’t tend to work out well for peons. Most deployments end up customizing that. But when customizing that, remember to account for an ingestion thread and a persist thread for each peon as well. They aren’t part of the processing pool (that only does queries). Those 2 threads aren’t active all the time, so you can usually get away with allocating somewhere between 1 and 2 full cores total for them both.

You’re also right that the memory on the R3 is a bit overkill. You do want some extra memory for mmapping but you usually don’t need that much. The memory mapped partial segments should add up to roughly 2x the size of the eventual segment that gets created. Usually accounting for a GB per peon works well there, since most people aim for 500MB segments. So C3/C4 should still have enough memory.