Controlling Realtime Segment Size

Is there any way to control realtime segment sizes beyond just rolling smaller increments of time? Our segments are currently 10-250MB on disk rolling at 15 minutes now (query granularity is also 15min). Occasionally doing a group by with four or five dimensions over one segments interval will cause large heap sizes or run into memory issues with GC pauses, etc. How does one determine the best segment size to use? Would upgrading help here with the improvements to groupBy queries?

We are running druid 0.6.171 with historical nodes running on r3.xlarge instances, jvm heaps are set to 28GB.

-drew

Beyond segment granularity, you can use “sharding” to further partition segments (http://druid.io/docs/latest/Realtime-ingestion.html#-sharding ).
Segments of [uncompressed]size 512MB - 1G are known to be “optimal” but
you might need to experiment for your particular dataset.

Depending upon where you are seeing the GC pauses (broker or historicals), you might need to tune the propers, see the performance-faq http://druid.io/docs/latest/Performance-FAQ.html .

Upgrading might be a good idea in general but I don’t think it will solve the specific problem you described.

– Himanshu

Thanks Himanshu - where do I go to get the uncompressed segment sizes?

-drew

Himanshu is probably talking about the unzipped size, which is the “size” in the segment descriptor (also stored in metadata storage and reported by the coordinator).