I’m trying to understand how can I tune druid indexing properties to achieve a high throughput with the indexing service.
I’m using samza + tranquility to inject events into druid indexing service.
I’m producing at a constant rate of ~12K events/s to two partitions with a replication factor of 1 (2 indexing tasks in total).
I think my dimensions don’t have a high cardinality and I have ensured that samza does not produce a bottleneck.
Attached to this message you will find examples of my events (which are related to traffic and network behaviour), the log from one of the indexing tasks and a few screenshots from jconsole.
Also, you can find attached the properties from my middleManager nodes. I have two middleManagers which can spawn four tasks each. Each task have 7 GB of heap.
With this configuration, I can’t even get close to 12K events/s indexed by druid.
In this case, from what I can understand from jconsole plots, I think that the GC is the bottleneck. A lot of GC’s are ocurring, which decreases the overall throughput of the task.
I’m also having a few ‘to-space-overflow’ collections on the G1GC, which makes that even worse.
From this point, I think I have three options:
1- Increase the heap for each druid indexing task
2- Decrease the maxRows of my task
3- Increase the number of partitions
I’ve tried the option number two. Instead of using a maxRows of 500.000, I decreased it to 220.000.
Doing that, the garbage collections of the task decreases, but the intermediate persists increases a lot.
When my task finishes (my segment duration is 1 hour) I have like 100 intermediate persists in the disk.
The task then tries to merge them, which results in LOTS of time doing just that, which with enough time ends up filling my middleManagers capacities which tasks.
My questions right now are:
How much is the ideal time between intermediate persists? Should I optimize for this metric and when achieved, just increment my heap to avoid GC’s?
I don’t use autoscaling, and my worker capacity is limited. In that case, what do you think works better? Small heap tasks with a big partition number or big heap tasks with a fewer partition number?
Having said that, I would like to ask for general guidelines to improve the behaviour of this scenario.
Any ideas? Any druid properties that I did not mention which can be related?
Thanks you very much,
middleManagerProperties.txt (2.57 KB)
msgs.txt (97.8 KB)
taskLog.txt (242 KB)