I’m trying to scale our indexing service appropriately and having a little trouble understanding the Middle Manager and Peon properties.
First of all, I’m looking at Gian’s comment in this thread:
Each peon is going to use its heap to merge query results for certain query types, and to store up to maxRowsInMemory * (2 + maxPendingPersists) aggregated rows for indexing. It’ll write everything else to disk and then memory-map that. And like historical nodes, they each have a processing thread pool with one processing buffer per pool.
For best performance you usually want to have enough memory set aside per peon for its heap, its processing buffers, and for the OS to be able to keep all/most of its memory-mapped data in the page cache
I’m trying to apply this wisdom to the example production cluster configuration and some things are confusing. Here are what seem to be the relevant properties from that example:
r3.8xlarge (Cores: 32, Memory: 244 GB, SSD)
# Resources for peons druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps # Peon properties druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912 druid.indexer.fork.property.druid.processing.numThreads=2 druid.indexer.fork.property.druid.server.http.numThreads=50 druid.worker.capacity=9
On the threads & cores side, we have 9 workers x 2 threads which is quite a lot lower than the 32 available. Sure there are 50 http threads, but those are mostly going to be idle. Is there a reason to leave so many cores apparently unoccupied?
On the memory side, we have 9 workers x 3g JVM heap = 27g. (The middle manager itself is only using 64m I think). The processing buffer is allocated within the JVM, right? Even if not, and even if there is a separate buffer per thread, that’s only about 9gb. So, at top the workers are using 36g out of 244g available. Do we really need to reserve ~200gb for memory mapping?
Anecdotally, someone else at our company played with some different EC2 instance classes and found he got much better performance from C4 (Compute optimized) classes than the R3 (RAM optimized) instances. That might be because the data we are ingesting is probably a lot smaller than the cases envisioned for the production example. Our day granularity segments cap out at just over 300 MB and the majority are in the 100-200 MB range. The source data is a day chunk in json format that is typically around 1.5 GB. We do have an unfortunately somewhat clunky ingestion protocol I don’t want to go into great detail about here, but it involves source data indexed by UTC being ingested into Druid in local time, which unfortunately forces us to have to ingest across three segments at once (using 5 source data files) - this is a little different setup than what my colleague was investigating earlier where I think he was ingesting a single segment at a time. But now it’s taking ~35-50 minutes to run a single task on a c4.large (running a single worker with a single thread).