Batch ingestion with Hadoop + Indexing service

Hi, I’ve posted to several thread regarding to Druid batch ingestion:
https://groups.google.com/forum/#!searchin/druid-user/druid$20batch$20shuai/druid-user/pv1Gt3xYmrQ/c_VXg-SYCQAJ
https://groups.google.com/forum/#!searchin/druid-user/druid$20batch$20shuai/druid-user/hG8ctsdt7XA/EWAHAgCgAQAJ
From above threads I believe we’ll have to use Hadoop via indexing service for our requirement. In the Imply document, I read it is required to connect Druid with a Hadoop cluster for things to work. Can someone help understand below questions?

  1. When I start an new Hadoop indexing task with the below config on Druid directly, I can see that MiddleManager is indexing in a single thread/core and runs everything in serial which results unacceptably slow ingestion. In my current setup, I did not copy any of yarn, core-site config nor trying to connect to a Hadoop cluster in anyway. In this case, how the indexing task (map - reduce) is still running towards completion?
  2. Is there a more detailed example then http://imply.io/docs/latest/ingestion-batch for hooking Druid up with Hadoop, I’m very new to Hadoop, apologies if it sounds dumb :slight_smile:
  3. My understanding is that Hadoop will run partitions in parallel say you specify targetPartitionSize, Hadoop will indexing num of partitions in parallel. In my case though, I saw everything is in serial while I’m sure I have enough capacity. I realize I did not explicitly connect to any Hadoop cluster, is this the problem?
  4. In current setup, to backfill 1hour worth of data, I’m trying to start 12 * 5min indexing tasks in parallel so overall it will finish in 1.5 hours for 1hour worth of data. I assume if everything is configured correctly, I should be able to submit one task and Hadoop will take care of the rest, is this correct?
    Thanks in advance
    curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @/tmp/task.json localhost:8090/druid/indexer/v1/task

{
“type” : “index_hadoop”,
“spec” : {
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : [
“s3://bucket/prefix/key.gz”,
]
}
},
“dataSchema” : {
“dataSource” : “wikiticker”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “hour”,
“queryGranularity” : “none”,
“intervals” : [“2016-03-06T04:00/2016-03-06T05:00”]
},
“parser” : {
“type” : “string”,
“parseSpec” : {
“format” : “json”,
“dimensionsSpec” : {
“dimensions” : [
“dim1”,
]
},
“timestampSpec” : {
“format” : “auto”,
“column” : “timestamp”
}
}
},
“metricsSpec” : [ {
“type” : “count”,
“name” : “count”
}]
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec” : {
“type” : “hashed”,
“targetPartitionSize” : 10000000,
“rowFlushBoundary” : 500000
},
“jobProperties” : {
“fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
“io.compression.codecs” : “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec”
}
}
}
}