Our ingestion setup, at least for the moment, is entirely batch-based. We get “eventually correct” updates from an upstream data warehousing service for (day) granularity sized batches of data which we turn into indexing tasks for the indexing service.
Is there any way provided for controlling the priority of these tasks? I don’t see anything in the documentation, but just want to double check that I’m not missing something.
Here is the use case I’m worried about. In the normal course of events we’ll mostly be getting periodic updates to today’s data plus occasional updates to older data trickling in. In that case I’m not too concerned about prioritization. But there will be some times when we’ll want to backfill large ranges of historical data, due to new dimensions becoming available and/or schema changes. If we just dump six months worth of “index this day” events into our system and create tasks for them all at once, it will saturate our indexers. I’d want to give priority to recent (new) data in this kind of situation rather than have it blocked waiting for six months of backfilling to finish.
Of course there are many ways we can handle this external to Druid and task creation - our own priority queue, or throttling the backfill notifications, spinning up additional indexers, etc. But is there any way to handle it from the Druid side?
Another lever that might help - can I control which indexers are eligible to take a task in the task spec somehow? E.g., we spin up an extra set of indexers to handle a backfill, and restrict the backfill tasks to go to those nodes, guaranteeing that one or two remain free for the new stuff?