Question on Kafka ingestion and mulit-tenancy

I’m planning a POC and looking at using Kafka where I am carefully setting up the partition ids in Kafka to the tenant’s ids. From the documentation it says Druid will leverage the partition id from Kafka in combo with the time to build a more granular segment. That is great if it does but what I’m also wondering is how I can leverage that fact when I query? I don’t see a kafka partition id I can query by but perhaps I’m missing something.

There is not a lot written about how to take advantage of it, so I’m wondering if that is the extent of it and if I wanted to go further I’d have to look at a reindex approach on older data to get even better query performance where I could explicitly define a secondary partition again one of my columns?

Hi Austin -

just noting for the rest that we also discussed this in ASF slack at .
Basically, we discussed reindexing and using single-dimension partitioning on tenant_id, and also the “trick” of setting queryGranularity
and segmentGranularity to something bigger (eg, 1 hour), and creating a secondary timestamp field with the desired granularity for use
in queries (along with tenant_id).

I heard this all from Gian in an AMA-type meeting not too long ago, and I’m operating from memory, so I hope I got things right.

Hey Austin welcome to the party :smiley:

There’s a good multitenancy article here that might be interesting for you:

I see the use of Kafka partitions on ingestion as about the partitioning of data inside Deep Storage - and thus the efficacy of parallelism when queries are executed on each Historical. That’s a separate thing to applying filters to data at query time.

For example, if you have some Tenants that only produce 5 rows a day, you will end up with tiny tiny Druid segment files that ultimately hit your query performance big time.

Hm not sure if I’m helping (!) but feel free to DM me :smiley: