Can anyone give me any general tips on how to find out why druid is creating so many segments? I have 140,000 segments for a two month period. Many segments are very small. I’m indexing using Kafka. I have a fairly large number of columns, roughly 60.
Is it because I’m indexing with __time as the event times (could be hours old) and not the current time continually. I see a lot of “New Segment” lines in the peon output log file. Is this typical?
I believe I figured this one out. Time will tell, but it looks like my problem was the kafka tuningConfig property: maxTotalRows. I had it set to 5 million for some reason. I set it to 0 and the problem goes away. I believe this property tells druid to flush data to disk if the number of records in the entire index exceeds this value. I don’t see much use in this property, but at least I know now how it works (I think).
You have to run compaction. It will make the segment size bigger and reduces the amount of segments created.
This is a message to see if someone can get a notification.s
+1 to @Rommel
This might be helpful, too, Tony in understanding the optimal segment sizes…
Oh and I find this sys table SQL quite helpful from time to time to profile the segments overall…
SELECT “datasource”, ABS(“num_rows” / 1000000) as “bucket”,
COUNT(*) as segments,
SUM(“size”) / 1048576 as totalSizeMb,
MIN(“size”) / 1048576 as minSizeMb,
AVG(“size”) / 1048576 as averageSizeMb,
MAX(“size”) / 1048576 as maxSizeMb,
SUM(“num_rows”) as totalRows,
MIN(“num_rows”) as minRows,
AVG(“num_rows”) as averageRows,
MAX(“num_rows”) as maxRows,
(AVG(“size”) / AVG(“num_rows”)) as avgRowSizeB
WHERE is_available = 1 AND is_overshadowed = 0 AND is_realtime = 0
GROUP BY 1,2
ORDER BY 1,2