unable to run hadoop index on remote hadoop cluster

hi , when submit a index task to overlord : curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @hadoop_indexer_config/mm_basicactive_raw_hadoop_local.json http://overlord:8505/druid/indexer/v1/task

I found the index task only occupy only one worker capacity,and in the index task’s log,I found log below:
2015-11-12T18:31:10,364 INFO [communication thread] org.apache.hadoop.mapred.LocalJobRunner -

2015-11-12T18:31:11,087 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner -
2015-11-12T18:31:11,088 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2015-11-12T18:31:11,326 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Finished spill 2
2015-11-12T18:31:11,328 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Merger - Merging 3 sorted segments
2015-11-12T18:31:11,328 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 3 segments left of total size: 169121063 bytes
2015-11-12T18:31:12,426 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Task - Task:attempt_local521403138_0001_m_000001_0 is done. And is in the process of commiting
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner -
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Task - Task 'attempt_local521403138_0001_m_000001_0' done.
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local521403138_0001_m_000001_0
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local521403138_0001_m_000002_0
2015-11-12T18:31:12,466 WARN [pool-21-thread-1] mapreduce.Counters - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-11-12T18:31:12,467 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1960ab49
2015-11-12T18:31:12,468 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Processing split: hdfs://nameservice1/druid/example_data/active_report_2015-10-21.log:134217728+134217728
2015-11-12T18:31:12,468 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer

when I submit hadoopindex task through " java io.druid.cli.Main index hadoop  $spec_file ", and I also found the index mapreduce task is ran by LocalJobRunner  and  using MR1.

I have already put core-site.xml,yarn-site.xml,mapred-site.xml in the directory  druid/config/_common  and I am sure these .xml include in classpath.

so ,my question is :
    1. does druid index task only support MR1 ,can we submit a druid index task to YARN ?
    2. how to distribute one index task to several peon?

can any one could help me ? many thanks.


 here is my overlord and Middle Manager config
druid.indexer.runner.type=remote
druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=hdfs://nameservice1/druid/log
druid.indexer.storage.type=metadata
druid.indexer.storage.recentlyFinishedThreshold=PT1H
druid.peon.mode=remote


here is my index task json config:
{
  "type" : "index_hadoop",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "test_source",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "servertime",
            "format" : "yyyy-MM-dd HH:mm:ss"
          },
          "dimensionsSpec" : {
            "dimensions" : [ "active", "appVer", "appVerName", "c_timestamp", "channel", "countryIso", "currChannel", "dPids", "dfPids", "dsPids", "ifPids", "ip", "isPids", "model", "networkCountryIos", "networkType", "scPids", "screenHi
ght", "screenWidth", "subChannel", "type", "uid", "verRelease", "wallpaperDownload" ],
            "dimensionExclusions" : [],
            "spatialDimensions" : [ ]
          }
        }
      },
      "metricsSpec" : [ {
        "type" : "count",
        "name" : "count"
      }, {
        "type" : "longSum",
        "name" : "clickedSuccess",
        "fieldName" : "clickedSuccess"
      }],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "HOUR",
        "intervals" : [ "2015-10-20T00:00:00.000+08:00/2015-10-28T00:00:00.000+08:00" ]
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "hdfs://nameservice1/druid/example_data/active_report/active_report_2015-10-15.log,hdfs://nameservice1/druid/example_data/active_report/active_report_2015-10-20.log"
      },
      "metadataUpdateSpec" : null,
      "segmentOutputPath" : null
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : -1,
        "maxPartitionSize" : -1,
        "assumeGrouped" : false,
        "numShards" : -1
      },
      "shardSpecs" : { },
      "indexSpec" : {
        "bitmap" : {
          "type" : "concise"
        },
        "dimensionCompression" : null,
        "metricCompression" : null
      },
      "leaveIntermediate" : false,
      "cleanupOnFailure" : true,
      "overwriteFiles" : true,
      "ignoreInvalidRows" : true,
      "jobProperties" : {

       },
      "combineText" : false,
      "persistInHeap" : false,
      "ingestOffheap" : false,
      "bufferSize" : 234217728,
      "aggregationBufferRatio" : 0.5,
      "useCombiner" : false,
      "rowFlushBoundary" : 80000
    }
  },
  "hadoopDependencyCoordinates" : ["org.apache.hadoop:hadoop-client:2.5.0-cdh5.3.1"],
  "classpathPrefix" : null,
  "dataSource" : "mm_basicactive_raw_online2",
  "resource" : {
    "availabilityGroup" : "xinglongtest",
    "requiredCapacity" : 12
  }
}

Inline.

I found the index task only occupy only one worker capacity,and in the index task’s log,I found log below:

hi , when submit a index task to overlord : curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @hadoop_indexer_config/mm_basicactive_raw_hadoop_local.json http://overlord:8505/druid/indexer/v1/task

The index task just acts as a driver to a remote Hadoop cluster. If you want to use a remote hadoop cluster, you should include the appropriate configuration files in the classpath of whatever node the job is being submitted to. There’s a lot of details of running remote hadoop clusters here: http://imply.io/docs/latest/ingestion.html

2015-11-12T18:31:10,364 INFO [communication thread] org.apache.hadoop.mapred.LocalJobRunner -

2015-11-12T18:31:11,087 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner -
2015-11-12T18:31:11,088 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2015-11-12T18:31:11,326 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Finished spill 2
2015-11-12T18:31:11,328 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Merger - Merging 3 sorted segments
2015-11-12T18:31:11,328 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 3 segments left of total size: 169121063 bytes
2015-11-12T18:31:12,426 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Task - Task:attempt_local521403138_0001_m_000001_0 is done. And is in the process of commiting
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner -
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Task - Task 'attempt_local521403138_0001_m_000001_0' done.
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local521403138_0001_m_000001_0
2015-11-12T18:31:12,466 INFO [pool-21-thread-1] org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local521403138_0001_m_000002_0
2015-11-12T18:31:12,466 WARN [pool-21-thread-1] mapreduce.Counters - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-11-12T18:31:12,467 INFO [pool-21-thread-1] org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1960ab49
2015-11-12T18:31:12,468 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Processing split: hdfs://nameservice1/druid/example_data/active_report_2015-10-21.log:134217728+134217728
2015-11-12T18:31:12,468 INFO [pool-21-thread-1] org.apache.hadoop.mapred.MapTask - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer


when I submit hadoopindex task through " java io.druid.cli.Main index hadoop  $spec_file ", and I also found the index mapreduce task is ran by LocalJobRunner  and  using MR1.


I have already put core-site.xml,yarn-site.xml,mapred-site.xml in the directory  druid/config/_common  and I am sure these .xml include in classpath.


so ,my question is :
    1. does druid index task only support MR1 ,can we submit a druid index task to YARN ?

Druid by default only supports MR2.

    2. how to distribute one index task to several peon?

Please see above comments. I think you may be misunderstanding on Druid hadoop indexing works.

hi,Fangjin,thanks for your reply,

so ,can we submit hadoop index task to a exist MR2(using YARN) cluster ?

when I run ‘java io.druid.cli.Main index hadoop $spec_file’, I found it always ran by LocalJobRunner,and I didn’t seen any mapreduce task on yarn .

Hi Jackson,

Can you check your mapred-site.xml and see if yarn was configured there?

e.g.,

<configuration>
<property>
 <name>[mapreduce.framework.name](http://mapreduce.framework.name)</name>
 <value>yarn</value>
 </property>
</configuration>
  • Jon

yes,and my mapred-site.xml is in druid-0.8.1/config/_common ,I am sure it is included in classpath.

actually ,I am really confused about the index task.

when I submit index task to overlord using spec file which contains ’ “type” : “index_hadoop” ’ or just run ’ java io.druid.cli.Main index hadoop $spec_file’ (I tried both,but both using LocalJobRunner, nothing is submited to yarn), then will run HadoopDruidIndexer ,right ?

 when I submit index task to overlord using spec file which contains ' "type" : "index" '  ,that will use Overlord-MiddManager-Peon framwork ,right ?