[druid-user] Re: Druid performance with late events

Hey!

That’s right - additional segments are created containing those additional events.

Query performance is affected inasmuch as there will be an additional query subtask generated in order to analyse that additional segment alongside the ones that already exist. Indeed, this reason for the generation of small segments is one of the use cases that auto-compaction was built to address.

An alternative would be to store the full event data in something like S3 and to regularly re-ingest (overwrite, not append) particular intervals using batch.

Hope this helps?

Interesting…! I don’t think you would be able to do that though…

It might be helpful to think of segments as actual files: segments are created bit-by-bit as the ingestion is happening in the peon tasks that are being run by the Middle Managers. The files they create - ie the segments - are being put into deep storage regularly by those tasks. The overlod “publishes” them to the metadata store.

Meanwhile, on a configurable cycle, the coordinator is checking the metadata database. In there it finds information about segments, so it uses this to tell the historicals to load the segments from deep storage.

So you could certainly reduce the cycle time that the Coordinator makes of the metadata database, but doing so would only mean that it would be a long while before data is queryable - it wouldn’t affect the segment creation because that’s all the Overlord’s job.

  • Pete

In streaming, yep - the data that’s being ingested is queryable.

It would be very unusual to keep data in those processes for a long time - typically people get them handed off quickly - ie within a few hours.

Yes - you’re right - the auto-compaction handles that.