We are testing kafka indexing service on our product pipeline (segmentGranularity: fifteen_minute, taskDuration: PT1H). We can get data indexed correctly but we still have several problems:
Is it possible that kafka indexing service can append data to old segments created by batch hadoop indexing method (use case: our product is running batch hadoop indexing method and we want to upgrade it to kafka indexing service but new coming data may have old timestamps so kafka indexing tasks need to modify(append) old segment(same time period) created by batch indexing tasks). Based on my knowledge, batch hadoop indexing segments do not have numbered shard specs so if kafka indexing tasks (with same segmentGranularity as batch indexing tasks) try to create new segments within same time period, it will trigger error: something like could not allocate segment for row with timestamp[xxxxxxx]. We tried to specify shard specs when creating batch hadoop indexing tasks, but we found numbered or linear shard specs are only used for real time indexing tasks.
Any suggestions on how to set the segmentGranularity value? Because we use fifteen_minute currently, we saw a lot of small segments (about 300KB for each segment) and it will definitely impact query performance. So we try to increace segmentGranularity, e.g. one day, to make sure each segment size is around 500MB. If it’s the correct way, does it mean segmentGranularity value depends on actual data size and should be tuned based on data size?
Similar to segmentGranularity, any suggestions on taskDuration? Given larger segmentGranularity, it seems we also need to set larger taskDuration. But larger taskDuration means it will keep more data in real time pipeline (in memory or on middleManager disk) and it will take a longer time to push such data into historical nodes. Will such pushing tasks affect real-time indexing tasks as well as query performance? If so, does it mean we need to give more resources to middleManager nodes if we set larger taskDuration?
We know we can change(update new) kafka indexing supervisor spec during kafka indexing service’s lifetime. For example, we set segmentGranularity as fifteen_minute and it creates one segment 201609090000_201609090015, and then we update new supervisor spec with new segmentGranularity value equal to one hour, for new coming data with timestamp 201609090012 (delayed data), i think kafka indexing service will create new segment 201609090000_201609090100 for it, but this segment will have overlap with the old segment(201609090000_201609090015), how does druid handle such case? Can we get correct data if we query it?
Thank you very much.