Parallel running of Kafka task and Reindexing task

Hey ,

We are ingesting data into druid through kafka indexing service. Because of which lot of small segments has been created . As a workaround, we have created a reindexed task so that small segments can be merged together into more optimal size.

Thus we have few question here:

  1. If we reindex a particular interval by giving

“ingestionSpec”: {

“dataSource”: “prism-data-10”,

“intervals”: [“2016-07-12T00:00:00Z/2016-07-14T00:00:00Z”],

“granularity”: “DAY”

}

Does this mean that only intervals provided above will be reindexed and not other intervals.

  1. Will above reindexing task will delete all previous smaller segments created by kafka and merge them into bigger index (as config provide by index_hadoop).

  2. Can kaka indexing service task and batch reindexing can be run in parallel. We are seeing our batch reindexing task is waiting on lock while kafka indexing service task is running. How we can run both task in parallel.

Here is our reindex task full input :

{

“type”: “index_hadoop”,

“spec”: {

“dataSchema”: {

“dataSource”: “prism-data-10”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“dimensionsSpec”: {

“dimensions”: [

“event_id”,

“lang”,

“share_clicks”,

“ts_bucket”,

“old_hash_id”,

“ab_test”,

“event_name”,

“title”,

“noti_opened”,

“fullstory_time_total”,

“ts_back_valid”,

“custom_title”,

“targeted_city”,

“at”,

“short_view_event”,

“published_dt”,

“short_time”,

“notification_type”,

“variants”,

“device_id”,

“category”,

“toss_opened”,

“noti_shown”,

“event_source”,

“score”,

“author”,

“bookmark”,

“is_video”,

“source”,

“like_count”,

“share_view”,

“vid_length”,

“content”,

“fullstory_view”,

“ts_valid”,

“targeted_country”,

“video_event”,

“shortened_url”,

“toss_clicked”,

“hashId”,

“group_id”,

“img_url”,

“is_deleted”

]

},

“timestampSpec”: {

“format”: “millis”,

“column”: “at”

}

}

},

“metricsSpec”: [{

“type”: “count”,

“name”: “count”

}, {

“type”: “doubleSum”,

“name”: “fullstory_total_time”,

“fieldName”: “fullstory_time_total”

}, {

“type”: “longSum”,

“name”: “total_like_count”,

“fieldName”: “like_count”

}, {

“type”: “longMax”,

“name”: “total_share_views”,

“fieldName”: “share_views”

}, {

“type”: “longMax”,

“name”: “total_vid_length”,

“fieldName”: “vid_length”

}, {

“type”: “doubleSum”,

“name”: “total_short_time”,

“fieldName”: “short_time”

}, {

“type”: “hyperUnique”,

“name”: “distinct_user”,

“fieldName”: “device_id”

}, {

“type”: “hyperUnique”,

“name”: “distinct_event”,

“fieldName”: “event_id”

}, {

“type”: “hyperUnique”,

“name”: “distinct_hash_Id”,

“fieldName”: “hashId”

}, {

“type”: “longSum”,

“name”: “total_bookmark”,

“fieldName”: “bookmark”

}, {

“type”: “longSum”,

“name”: “total_fullstory_view”,

“fieldName”: “fullstory_view”

}, {

“type”: “longSum”,

“name”: “total_noti_opened”,

“fieldName”: “noti_opened”

}, {

“type”: “longSum”,

“name”: “total_noti_shown”,

“fieldName”: “noti_shown”

}, {

“type”: “longSum”,

“name”: “total_toss_clicked”,

“fieldName”: “toss_clicked”

}, {

“type”: “longSum”,

“name”: “total_toss_opened”,

“fieldName”: “toss_opened”

}, {

“type”: “longSum”,

“name”: “total_share_click”,

“fieldName”: “share_clicks”

}, {

“type”: “longSum”,

“name”: “total_short_views”,

“fieldName”: “short_view_event”

}, {

“type”: “longSum”,

“name”: “total_video_views”,

“fieldName”: “video_event”

}, {

“type”: “longSum”,

“name”: “total_ts_valid”,

“fieldName”: “ts_valid”

}, {

“type”: “longSum”,

“name”: “total_full_ts_valid”,

“fieldName”: “ts_back_valid”

}, {

“type”: “longMax”,

“name”: “is_ab”,

“fieldName”: “ab_test”

}, {

“type”: “longMax”,

“name”: “ab_variants”,

“fieldName”: “variants”

}],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: {

“type”: “none”

},

“intervals”: [

“2016-01-01T00:00:00.000Z/2017-12-30T00:00:00.000Z”

]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “dataSource”,

“ingestionSpec”: {

“dataSource”: “prism-data-10”,

“intervals”: [“2016-07-12T00:00:00Z/2016-07-14T00:00:00Z”],

“granularity”: “DAY”

}

}

},

“tuningConfig”: {

“type”: “hadoop”,

“partitionsSpec”: {

“type”: “hashed”,

“targetPartitionSize”: 5000000

},

“combineText” : true

}

},

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.7.2”]

}

I am no expert but IMO

  1. True . indexing only replaces segments for the interval specified.

  2. Yes

  3. locking does make sense . I would wait until the recent segments are not hot with data updates before re-indexing.

Hey Saurabh,

Just to elaborate on what Giri mentioned:

  1. The segment generated by the re-indexing task will overshadow the smaller segments it merged and will be used to respond to future queries, but the actual segments will not be deleted from deep storage and you will need to run something like a kill task (http://druid.io/docs/latest/ingestion/tasks.html) or use the coordinator delete endpoint if you want to remove them.

  2. Tasks hold locks on the intervals that they are generating segments for so multiple tasks cannot operate on the same interval simultaneously. As Giri mentioned, you should schedule your batch ingestion job sometime in the future after you’re sure the realtime indexing tasks are completed.