How to partition druid data segent with tenant_id using kafka-indexing-service

Hi,

We are using kafka-indexing-service for real time ingestion. Since we have multiple tenant in our datasource(shared data source) and we query it with timestamp and tenant_id, I would like to know how can we automatically partition our data source with tenant_id using kafka-indexing-service?

I saw in druid documentation, it is suggested: “with realtime indexing, on option is partition on tenant_id upfront. You’d do this by tweaking the stream you send to Druid. If you’re using Kafka then
you can have your Kafka producer partition your topic by a hash of tenant_id”.

We already have our kafka topic partitioned by tenant_id, but i do not see the druid data segment been partitioned on tenant_id.

When we create batch load task, we can specify the batch load task with Single-demention partition like this:

Single-dimension partitioning

  "partitionsSpec": {
     "type": "dimension",
     "targetPartitionSize": 5000000
     "partitionDimension": tenant_id
 }

Do we have such option for kafka-indexing-service on stream ingestion?
Anybody has a sample supervisor spec json that partition druid data source with tenant_id or a dimension other than timestamp ?

Thanks
Hong

2017년 7월 27일 목요일 오전 9시 42분 20초 UTC+9, Hong Wang 님의 말:

Hi,

We are using kafka-indexing-service for real time ingestion. Since we have multiple tenant in our datasource(shared data source) and we query it with timestamp and tenant_id, I would like to know how can we automatically partition our data source with tenant_id using kafka-indexing-service?

I saw in druid documentation, it is suggested: "with realtime indexing, on option is partition on tenant_id upfront. You'd do this by tweaking the stream you send to Druid. If you're using Kafka then
you can have your Kafka producer partition your topic by a hash of tenant_id".

We already have our kafka topic partitioned by tenant_id, but i do not see the druid data segment been partitioned on tenant_id.

When we create batch load task, we can specify the batch load task with Single-demention partition like this:
Single-dimension partitioning

  "partitionsSpec": {
     "type": "dimension",
     "targetPartitionSize": 5000000
     "partitionDimension": tenant_id
}

Do we have such option for kafka-indexing-service on stream ingestion?
Anybody has a sample supervisor spec json that partition druid data source with tenant_id or a dimension other than timestamp ?

Thanks
Hong

I am seeking for the same thing.
How did you end up that work?

I am looking for the answer too, please do share if you figure out.
Thanks,