Is there a way to speed up Hadoop indexer over S3?

Hi,

I’m using Hadoop indexer for loading data, but this is really slow - it looks like there’s a single mapper per indexer task. Is there a way to speed up this process?

I tried using tree different indexers (there are three slots available), and it works, but the performance is still not great.

Here’s is one my index spec files (I use different intervals in each of them: “2015-11-01/2015-11-08”, “2015-11-09/2015-11-16” and “2015-11-17/2015-11-24”) :

{
“type” : “index_hadoop”,
“spec” : {
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “granularity”,
“inputPath” : “s3n://af-druid/input/inappevents”,
“dataGranularity”: “day”,
“filePattern”: “.*.gz”,
“pathFormat”: “‘dt’=yyyy-MM-dd”
}
},
“dataSchema” : {
“dataSource” : “inappevents”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “none”,
“intervals” : [“2015-11-01/2015-11-08”]
},
“parser” : {
“type” : “string”,
“parseSpec” : {
“format” : “json”,
“dimensionsSpec” : {
“dimensions” : [“app_id”, “media_source”, “campaign”, “partner”, “fb_adgroup”, “fb_adset”, “af_siteid”, “af_sub1”, “af_sub2”, “af_sub3”, “af_sub4”, “af_sub5”, “country”, “region”, “city”, “ip”, “platform”, “device_type”, “event_name”, “sdk_version”]
},
“timestampSpec” : {
“format” : “auto”,
“column” : “timestamp”
}
}
},
“metricsSpec” : [
{
“name” : “count”,
“type” : “count”
},
{
“name” : “monetary”,
“type” : “longSum”,
“fieldName” : “monetary”
}
]
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec” : {
“type” : “hashed”,
“targetPartitionSize” : 5000000
},
“jobProperties” : {}
}
}
}

``

In S3 bucket there are 1000 of files per day, each about 8Mb.

Thanks in advance!

Michael

Forgot to add, I’m using the latest released version of Druid: 0.8.2.