I’m trying to setup a realtime ingestion job suitable for distinct count queries.
After some time, I figured out that it would be unfeasible, because distinct-count needs a single-dimension partition, and partitionSpecs are only available on batch ingestion jobs.
My question, then, gets to: is there any known limitation to RealtimeIngestion with partitioned data? If not, what would be a good starting point?
Realtime ingestion doesn’t let you specify a partitioning spec. So for the distinct-count extension to make sense at this point you’d need to use batch ingestion. The main reason for this is that Druid realtime ingestion doesn’t shuffle data and also doesn’t guarantee any particular specific sharding scheme (it is a bit “fluid”). I think it would be tough to change this in general for all Druid realtime ingestion methods, although you might have some luck for some specific methods.