Segment Size Optimization

Environment:

Druid 0.12.1 (Installed via HDP-3.1.0.6)
2 Brokers, 2 Coordinators, 2 Overlords, 4 Historicals, 4 MiddleManagers

31 datasources total (20% of them utilizes tuple and quantile datasketches)

Datasources without rollup range between 50 and 100 dimensions

100% of data ingested via Kafka Indexing Service (~1TB per day)

Daily segment granularity, Hourly query granularity

All queries are Timeseries, TopN, Scan, and Search based. GroupBy queries are avoided

Turnilo for data visualization purposes

First of all, I want to say thank-you to the Druid committers and community. Druid is an amazing platform and it has far exceeded our expectations. It has quite a learning curve (docs are great), but, it has saved significant development efforts. Thank you!

Question:

Segment compaction is critical to great performance, and, we’re following the doc below for guidance:

https://druid.apache.org/docs/latest/operations/segment-optimization.html

Since we’re using an older version, we’re submitting tasks to manually compact segments for each of our datasources. We’re targeting 5 million rows per segment, but, we’re far-exceeding the 300-700MB segment size range. The doc above states to optimize for 5 million rows first, so, that’s the approach we’re pursuing and so far performance has been excellent. However, for most of our datasources, we’re at ~4GB per segment with ~3.5 million rows spanning approximately 5 days.

So, should we continue to compact up to 5 million rows knowing our segment sizes will increase proportionally, or, take a different approach altogether ? Any thoughts are greatly appreciated.

Hi JB:

row count is a bigger factor than raw segment size

if that 4gb segment has a lot of columns it maybe is ok to leave

like 2-5 million rows per segment is the sweet spot, so in my option, your datasource is in good shape.

Thanks

Thanks for the follow-up Ming, I appreciate it.