Druid compaction confusion, what exactly is maxTotalRows?

I have been optimizing segments for my data source. As of now, my estimation is that 30M rows per each segment should be a fine tuning (it is currently set to default 5M).

When trying to run compaction on my data source in Web UI, the dialog box also asks me Max total rows, is it fine to set it to 30M too? Or it should be a higher value. It says this value is used for intermediate pushing of the segments but it doesn’t explain anything more.

Relates to Apache Druid 0.20.0

Hi Mostafa,

I’m curious. How did you arrive at 30M?
A compaction/reindexing task is essentially a batch ingestion using Druid as the source.
Your math on the maxtotalrows is likely to be similar if not identical. In many cases you will see segment reduction because multiple segments with rollups covering the same time frame will collapse into better rollups. In terms of segment sizing, it should be very similar.

More info: Coordinator Process · Apache Druid
and here: Configuration reference · Apache Druid
ans here: Segment Size Optimization · Apache Druid

I came to 30M because my segment size on an hourly basis was way too low in terms of size (several mega bytes) and that created too many small segments, causing problems.

maxRowsInMemory is something you need to touch if you have performance issues when the actual ingestion task is running. It configures how often druid writes intermediate segments during ingestion and it has memory implications.

While you can increase the rows in a segment to higher than the recommended 5M (for example, when there segment size is too low at 5M), please make sure your segments scan times do not go out of whack due to this change as this may perversely degrade query performance.

1 Like