How to improve load speed for 1 small file

Loading 5 rows takes about 20 seconds until it becomes available. Is there a way to fasten the time?

I use apache/druid:0.18.0 docker environment. Steps to load data is

  1. Call /druid/indexer/v1/task
  2. Iterate /druid/coordinator/v1/loadstatus

Configuration files exist in trino/plugin/trino-druid/src/test/resources at master · trinodb/trino · GitHub

Welcome @Yuya_Ebihara! I’ve reached out to a colleague who might have some experience with that Docker environment. Hopefully we can provide more details, but, generally, I think that people speed up batch ingestion by running a parallel ingestion job with many workers in parallel to speed up the ingestion. maxNumConcurrentSubTasks greater than 1 in the tuningConfig could help (assuming that your input source is splittable). Otherwise, if maxNumConcurrentSubTasks is 1, then tasks run sequentially.

Follow-up:

The ingestion process is a multi-stage, asynchronous process. The MM ingests the file, converts it into a Druid Segment file, then publishes it to Deep Storage. The Overlord updates the Metadata DB, then the Historical reads the Metadata DB and assigns the new Segment to one or more Historicals by writing the instructions to Zookeeper (or this could be direct if set to HTTP). While there are some parameters which might tighten up the time it takes to load such a small batch, for example, druid.segmentCache.announceIntervalMillis, we wouldn’t necessarily recommend this. If you’re looking for lower latencies, in terms of data availability, you might try real-time ingestion.

Can you tell us a bit more about your use case?

Thanks for sharing the tips. The use case is testing Trino Druid connector on CI, not production. We create many tables with few records in the tests, so we would like to minimize the load time as much as possible.

I set druid.segmentCache.announceIntervalMillis as 1000 in historical/runtime.properties, but it still takes 15s for loading 5 rows.

I see.
So, let’s try to figure it out, quoting from the docs:

Before any unassigned segments are serviced by Historical processes, the Historical processes for each tier are first sorted in terms of capacity, with least capacity servers having the highest priority. Unassigned segments are always assigned to the processes with least capacity to maintain a level of balance between processes. The Coordinator does not directly communicate with a historical process when assigning it a new segment; instead the Coordinator creates some temporary information about the new segment under load queue path of the historical process. Once this request is seen, the historical process will load the segment and begin servicing it.
So I think we’ll need to tighten the coordinator processing frequency and communication with historicals, scanning through the configs, here’s what seems relevant to me:

  • If you are standing up a Druid cluster as part of the CI cycle, then you might want to make druid.coordinator.startDelay which is 5 min by default.
  • These coordinator settings for Segment Management I think may help a lot, seems like you can use http communication instead of zookeeper which should make it more synchronous.
  • druid.manager.segments.pollDuration defaults to 1 minute, maybe at 5 seconds it could help reduce the total time. I’m not sure this one is affecting you though, since you mentioned you’ve got it down to 15s.

Let us know how it goes.