First of all I want to congratulate you on the awesome project that you are building, it helped us greatly into getting sub-second analytics on billion of events per day.
I have 2 questions about Druid management:
- Has there been any changes in the way Druid deals with late events? Our ETL pipeline does a batch ingestion every 10 minutes on all the events we collected (using another system) in those minutes and then we continuously update the segments with late events (which we get with a delay of up to one week).
Every now and then we compact the segments when we update it with late events, i.e. we merge all 6 ten-minute segments in an hour + the late events and we create an one-hour segment. Then after one day we compact 24 one-hour segments + late-events into a one-day segment.
This workflow uses quite a bit of resources as we end up with about 200 batch jobs per day. Is there an alternative to this (even in terms of redesigning the datasource)? Just to give you an idea of the scales, we have about 1 billion events that come on time and about 100k events that come over time up to one week.
- The workflow that we have creates a lot of unused segments that don’t seem to ever be deleted by Druid. E.g. once we overwrite segments 0-10, 10-20, … 50-60 with a 0-60 segment we observe that all the overwritten segments still exist.
Is there any setting in the ingestion tasks that can tell druid to remove these obsolete segments? If not is there a way to get rid of them (e.g. is it safe to just delete them from HDFS)?
Here’s an example of the overlapping/overwritten segments that we have in HDFS for one hour:
The ten-minute segments
The hour segment
The day segment
Thank you for the help!