Druid segments and compaction

Hi all,

There is a kafka stream indexing task running on my druid cluster with hour granularity, and it’s generating tons of segments (100~300 segments per hour), sizes are mostly between 4000~10000 bytes, only a few are quite large.

I tried to compact segments of an hour, it starts with 375 segments which mostly are small segments around 10KB, after compaction a segment of 300MB is generated.

I wonder if it’s working as expected, or my configuration is not correct. There are 35W+ segments on my druid cluster now, and it really slows the console down.

Thanks.

Hi!

Sorry for the late reply. What do you have your segment sizes set to? MaxRowPerSegment, etc? This article should help.

https://druid.apache.org/docs/latest/operations/segment-optimization.html

Cheers,

Rachel

Hi Rachel,

Thanks for the suggestion! I read the article before. The maxRowPerSegments is set to 5,000,000 and segmentByteSize around 500MB as recommendation. The druid version is 0.16.0.

I’m using Kafka indexing source, and there might be a lot of late data. Does every late event create a new segment? Are those related?

Ruibin

Rachel Pedreschi <rachel.pedreschi@imply.io> 于2020年7月3日周五 下午11:49写道:

Once a segment is created, it is immutable, so yes, late data will create a new segment until compaction comes around. I’m not 100% sure which version auto-compaction was released in, but if that is post .16, then you might benefit from upgrading.

The segment view on the console might help you see what is going on in more detail.

https://druid.apache.org/docs/latest/operations/druid-console.html#segments