Druid 0.12.1 (Installed via HDP-184.108.40.206)
2 Brokers, 2 Coordinators, 2 Overlords, 4 Historicals, 4 MiddleManagers
31 datasources total (20% of them utilizes tuple and quantile datasketches)
Datasources without rollup range between 50 and 100 dimensions
100% of data ingested via Kafka Indexing Service (~1TB per day)
Daily segment granularity, Hourly query granularity
All queries are Timeseries, TopN, Scan, and Search based. GroupBy queries are avoided
Turnilo for data visualization purposes
First of all, I want to say thank-you to the Druid committers and community. Druid is an amazing platform and it has far exceeded our expectations. It has quite a learning curve (docs are great), but, it has saved significant development efforts. Thank you!
Segment compaction is critical to great performance, and, we’re following the doc below for guidance:
Since we’re using an older version, we’re submitting tasks to manually compact segments for each of our datasources. We’re targeting 5 million rows per segment, but, we’re far-exceeding the 300-700MB segment size range. The doc above states to optimize for 5 million rows first, so, that’s the approach we’re pursuing and so far performance has been excellent. However, for most of our datasources, we’re at ~4GB per segment with ~3.5 million rows spanning approximately 5 days.
So, should we continue to compact up to 5 million rows knowing our segment sizes will increase proportionally, or, take a different approach altogether ? Any thoughts are greatly appreciated.