Hi Dave,
I don’t think there’s doc about it. Let me give you some explanation.
Before starting describing the sharding in KIS, I would like to make the relationship among the terms of shard, partition, and segment clear.
In Druid, a dataSource is first partitioned by time and we call each result partition ‘timeChunk’. A timeChunk can be further partitioned into segments. As a result, all segments of the same timeChunk belongs to the same time interval. Each segment belongs to a partition. And finally, the partition and shard are same in Druid.
Partitioning in KIS happens based on only the max segment size or timeout for publishing segments. This is why KIS uses only NumberedShardSpec among 5 types of shardSpec. NumberedShardSpec is mostly for identifying shards by their IDs (i.g., partitionNum in code) and doesn’t have any implications about what data is in the shard.
Here is a more detailed description.
Before 0.12, the supervisor launches kafka index tasks which run for taskDuration
time. Each task can create multiple segments if it reads events having different timestamps falling into different timeChunk. However, there should be a single segment generated by the same task per timeChunk. When the run time of the task reaches taskDuration, it starts publishing generated segments. If the task generated first segments for any timeChunk, they will have ‘0’ partition ID for NumberedShardSpec. If there were already some segments in timeChunk, the newly generated segments will have ‘max partition ID of existing segments + 1’ as its partition ID.
Please note that taskDuration might not be same with segmentGranularity. For example, segmentGranularity can be HOUR while taskDuration is 30 mins (which is default). In this case, 2 kafka index tasks can be submitted per hour (one by one) and they might create segments for the same timeChunk. In the end, they will create at most 2 segments per timeChunk each of which has 0 and 1 partition ID, respectively. This can be more complicated if you set partitions in Kafka.
After 0.12, things got a bit more complicated. Each kafka task can publish segments in the middle of taskDuration. As a result, a task can generate multiple segments per timeChunk. This is controlled by ‘maxRowsPerSegment’, ‘maxTotalRows’, and ‘handoffConditionTimeout’. So, the task can publish segments if its size reaches ‘maxRowsPerSegment’ or ‘maxTotalRows’ or it hasn’t published any segments for ‘handoffConditionTimeout’. Please check http://druid.io/docs/latest/development/extensions-core/kafka-ingestion for more details about the configurations. Finally, as before 0.12, kafka tasks publish segments after taskDuration.
Hope this helps. Please feel free to ask if you have more questions.
Jihoon