The duration of the batch index task

Hi guys,

We use index task (http://druid.io/docs/0.9.1.1/ingestion/tasks.html) to downsample historical data and adjust the segment granularity, for example realtime data is roll-up in minute, and yesterday’s data is roll-up in fifteen minute with segment granularity of one day, but in my cluster, the task almost need 24 hours or more to finish the work. My question are why the index task takes so long to finish? What do I need to do if I want to decrease the time needed for the index task?

Are you using a local hadoop index task? Those can be slow given the overhead to run mapreduce. With that said, 24 hours is a very very long time. Can you include your indexing spec?

No, I’m not using hadoop index task, just the index task on VM, and the situation is worse now, the task runs longer than 24 hours.

Dataset: about 18GB per day, the biggest segment in one hour about 1G.

Machine config for batch index task: 16GM memory, 4 core cpu, 80G SAS disk.

The conf for middleManager is similar with realtime index task:

druid.indexer.runner.javaCommand=/usr/lib/jvm/java-7-openjdk-amd64/bin/java

druid.indexer.runner.javaOpts=-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

druid.indexer.task.baseTaskDir=/srv/nbs/0/druid/task/

druid.indexer.task.restoreTasksOnRestart=true

HTTP server threads

druid.server.http.numThreads=25

Processing threads and buffers

druid.processing.buffer.sizeBytes=536870912

druid.processing.numThreads=2

One of the index spec (roll-up in one hour, six hour for segment granularity):

{

“task”: “index_request_2016-09-01T16:10:00.164Z”,

“payload”: {

“id”: “index_request_2016-09-01T16:10:00.164Z”,

“resource”: {

“availabilityGroup”: “index_request_2016-09-01T16:10:00.164Z”,

“requiredCapacity”: 1

},

“spec”: {

“dataSchema”: {

“dataSource”: “request”,

“parser”: {

“type”: “map”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “auto”,

“missingValue”: null

},

“dimensionsSpec”: {

“dimensions”: [

“app”,

“operator”,

“network”,

“host”,

“geo”,

“city”,

“path”

]

}

}

},

“metricsSpec”: [{

“type”: “longSum”,

“name”: “requestCount”,

“fieldName”: “requestCount”

}, {

“type”: “longSum”,

“name”: “requestTime”,

“fieldName”: “requestTime”

}, {

“type”: “longSum”,

“name”: “responseTime”,

“fieldName”: “responseTime”

}, {

“type”: “longSum”,

“name”: “fpTime”,

“fieldName”: “fpTime”

}, {

“type”: “longSum”,

“name”: “dnsTime”,

“fieldName”: “dnsTime”

}, {

“type”: “longSum”,

“name”: “sendBytes”,

“fieldName”: “sendBytes”

}, {

“type”: “longSum”,

“name”: “receivedBytes”,

“fieldName”: “receivedBytes”

}],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “SIX_HOUR”,

“queryGranularity”: {

“type”: “duration”,

“duration”: 3600000,

“origin”: “1970-01-01T00:00:00.000Z”

},

“intervals”: [

“2016-08-30T00:00:00.000Z/2016-08-31T00:00:00.000Z”

]

}

},

“ioConfig”: {

“type”: “index”,

“firehose”: {

“type”: “ingestSegment”,

“dataSource”: “request”,

“interval”: “2016-08-30T00:00:00.000Z/2016-08-31T00:00:00.000Z”,

“filter”: null,

“dimensions”: null,

“metrics”: null

}

},

“tuningConfig”: {

“type”: “index”,

“targetPartitionSize”: 10000000,

“rowFlushBoundary”: 1000000,

“numShards”: -1,

“indexSpec”: {

“bitmap”: {

“type”: “concise”

},

“dimensionCompression”: null,

“metricCompression”: null

},

“buildV9Directly”: true

}

},

“context”: null,

“groupId”: “index_request_2016-09-01T16:10:00.164Z”,

“dataSource”: “request”,

“interval”: “2016-08-30T00:00:00.000Z/2016-08-31T00:00:00.000Z”

}

}

One more question, for the same dataset (like the dataset in one day), different segment granularity (six_hour vs. one day) affect the duration of the index task?

Index task is not designed for any production workload whatsoever. It should be used for datasets < 1G. Can you run the hadoop index task instead?