Druid realtime node is taking long time to merge and persist hourly segment

Hi we encountered a problem with druid 0.7.3 where segment merging and persist is taking a ridiculous long time, for example the following log for merging walk through the logs:

2015-09-03T18:31:00,452 INFO [clientmetrics_events_v1_prod-2015-09-03T06:00:00.000Z-persist-n-merge] io.druid.segment.IndexMerger - outDir[/mnt/storage/druid-tmp/realtime/basePersist/clientmetrics_events_v1_prod/2015-09-03T06:00:00.000Z_2015-09-03T07:00:00.000Z/merged/v8-tmp] walked 500,000/500,000 rows in 50,159 millis.

2015-09-03T18:31:37,959 INFO [clientmetrics_events_v1_prod-2015-09-03T06:00:00.000Z-persist-n-merge] io.druid.segment.IndexMerger - outDir[/mnt/storage/druid-tmp/realtime/basePersist/clientmetrics_events_v1_prod/2015-09-03T06:00:00.000Z_2015-09-03T07:00:00.000Z/merged/v8-tmp] completed walk through of 818,730 rows in 87,756 millis.

and also the logs for processing dimensions:

2015-09-03T18:32:40,946 INFO [clientmetrics_events_v1_prod-2015-09-03T06:00:00.000Z-persist-n-merge] io.druid.segment.IndexMerger - Starting dimension[content_id] with cardinality[21,943]

2015-09-03T18:34:38,628 INFO [clientmetrics_events_v1_prod-2015-09-03T06:00:00.000Z-persist-n-merge] io.druid.segment.IndexMerger - Completed dimension[content_id] in 117,682 millis.

2015-09-03T18:36:06,801 INFO [clientmetrics_events_v1_prod-2015-09-03T06:00:00.000Z-persist-n-merge] io.druid.segment.IndexMerger - Starting dimension[query_string_content_id] with cardinality[12,863]

2015-09-03T18:37:23,643 INFO [clientmetrics_events_v1_prod-2015-09-03T06:00:00.000Z-persist-n-merge] io.druid.segment.IndexMerger - Completed dimension[query_string_content_id] in 76,842 millis.

We checked the size of that hour’s file, it seems that the overall data size is not that large:

$ du -h --max-depth=0 /mnt/storage/druid-tmp/realtime/basePersist/clientmetrics_events_v1_prod/2015-09-03T06:00:00.000Z_2015-09-03T07:00:00.000Z/

103M /mnt/storage/druid-tmp/realtime/basePersist/clientmetrics_events_v1_prod/2015-09-03T06:00:00.000Z_2015-09-03T07:00:00.000Z/

We also checked the recommended config for production realtime nodes and I believe that we are following most of the recommendation. Our machines has 24 cores 128 GB memory, and our druid realtime node is started with -Xmx24G, druid.processing.buffer.sizeBytes=1000000000 and druid.processing.numThreads=23. It is also the only process running on the machine right now so no interference from other programs is a concern. Our data source use schema-less dimension spec and have around 10 metrics. The overall number of dimensions is around 200, I am not sure if that will cause any problem.

Also we noticed that during the merge and persist, realtime nodes seems to be using 100% of a single core of CPU while leaving other cores idle a lot. I will appreciate if you have any suggestion on our problem.

Hey James, some things you could look at are:

  1. See how many intermediate persist files you’re getting per hour; if it’s a lot then that would make the final merge slower. You can reduce the number of intermediate persists by using a lower intermediatePersistPeriod and/or a higher maxRowsInMemory. For the latter, make sure you do have enough heap to accommodate the additional rows.

  2. It sounds like you have just one realtime shard running on your machine. Each shard uses just a single core for persisting and merging. You can parallelize this by having multiple shards running on the same machine, by having more than one entry in your specFile. Each one should have the same configuration with one difference: the partitionNum needs to be different on each shard (start at 0 and count upward).

  3. In general make sure you’re not blocking on disk i/o.

So we updated the maxRowsInMemory to a higher values for our topics and the number of intermediate files drop from 50-100 to 10-20. The performance for the new data is back to normal and it catches up with our data ingest rate. It seems that the merge process doesn’t like the large number of intermediate files in realtime nodes.