Performance expectation for Index service

We are experimenting with Druid index service ingesting data from S3.

The data in one file (10 seconds worth of data):

  • TSV, ~180 columns
  • ~100,000 rows per file
  • ~10MB in gz

Our setup:

  • Overload and middleManager running on the same EC2 instance (r3.4xlarge)

Ingesting one file takes ~35 seconds. Can someone advice the following questions:

  1. Is this 35 seconds expected performance for this data set with the indexing service or it can be significantly improved?
  2. I’ve tried submit multiple tasks at the same time and expecting they are executed in parallel. My observation is the tasks are executed in serial, is this expect? Can we take advantage of some parallelism
  3. CPU and memory are quite under-utilized, how can we take advantage of the resource?

middleManager config;
druid.host=172.31.3.210
druid.port=8084
druid.service=druid/prod/middlemanager

Resources for peons

druid.indexer.runner.javaOpts="-server -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"

Peon properties

druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912
druid.indexer.fork.property.druid.processing.numThreads=9
druid.indexer.task.defaultRowFlushBoundary = 50000
#druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/mnt/persistent/zk_druid”, “maxSize”: 0}]
druid.indexer.fork.property.druid.server.http.numThreads=30

druid.worker.capacity=9
druid.worker.ip=172.31.3.210

Posting the overlord config as well:

Default host: localhost. Default port: 8090. If you run each node type on its own node in production, you should override these values to be IP:8080

druid.host=172.31.3.210
druid.port=8090
druid.service=druid/prod/overlord

druid.indexer.queue.startDelay=PT0M

Looking at https://groups.google.com/forum/#!topic/druid-user/J_NcHMiTCkg, looks like “indexing locks are a per-dataSource, per-interval thing”. Just to clarify if data source is the same OR time interval is the same for two tasks, only one task will be executed at a time, is this OR logic correct?

Hi Shuai, answers inline.

Inline.

Thanks Fangjin for the quick response.

** How are you ingesting the data? Are you using EMR or
just using the index task. FWIW, the index task is extremely sub-optimal for ingestion and is really there for quick and dirty POCs.**

The reason why I use the index service is it was recommend to use hadoop indexer only for files > 1G. Our files are mostly ~10M containing 100,000 rows where each row is ~180 columns. With our hardware setup, do you know how long we should expect to index one file? 35 seconds for 10M gz file seems to be long, I wonder if there’s any obvious pitfall I encountered.

I’m running the overloard and middle manager on the same host with config in previous post and run below, task config is pasted as well
curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @index_task.json localhost:8090/druid/indexer/v1/task

{
“type”: “index”,
“spec”: {
“dataSchema”: {
“dataSource”: “test”,
“parser”: {
“type”: “string”,
“parseSpec”:{
“format” : “tsv”,
“timestampSpec” : {
“column” : “timestamp”,
“format” : “yyyyMMddHHmmss”
},
“columns”: [“col1”, “col2”…]
“delimiter”:"\t",
“dimensionsSpec” : {
“dimensions” : [“col1”, “col2”…]
}
}
},
“metricsSpec”: [
{
“type”: “count”,
“name”: “count”
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “MINUTE”,
“queryGranularity”: “NONE”,
“intervals”: [“2015-10-30/2015-11-01”]
}
},
“ioConfig”: {
“type”: “index”,
“firehose”: {
“type”: “static-s3”,
“uris”: [“s3://bucket/prefix/test.gz”]
}
},
“tuningConfig”: {
“type”: “index”,
“targetPartitionSize”: 0,
“rowFlushBoundary”: 500000
}
}
}

Hey Shuai,

Your granularitySpec is not really conducive to best performance- MINUTE segment granularity over 2015-10-30/2015-11-01 (a day of data) will create over a thousand segments, each one needing its own reducer. Try segmentGranularity “DAY” instead (note you can still have more granular timestamps within the segment- that’s the queryGranularity).