Pushing Segments to Google Storage from Hadoop Batch Indexer

Hello!
I’m successfully indexing data using google dataproc. However, the indexer tries to push the newly created segments to google storage. I’ve tried for a while to get this to work, but maybe someone here can provide more insight?

I’m compiling a fatjar from source. I think the source code snapshot is between 0.10.1 and 0.11.0

The full stack trace is below.

Thanks in advance!

Peter

Peters-MacBook-Pro:~/repos/druid (master *% u-66)$ ./index_it.sh

Job [5bee1048-2ff8-49e2-82d5-d3eedb7c01be] submitted.

Waiting for job output…

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/tmp/5bee1048-2ff8-49e2-82d5-d3eedb7c01be/druid_build-assembly-0.1-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

18/01/19 18:14:54 INFO util.Version: HV000001: Hibernate Validator 5.1.3.Final

18/01/19 18:14:54 INFO guice.JsonConfigurator: Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=, defaultVersion=‘0.1-SNAPSHOT’, localRepository=’/root/.m2/repository’, remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-client/2.3.0/hadoop-client-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-common/2.3.0/hadoop-common-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/thoughtworks/paranamer/paranamer/2.3/paranamer-2.3.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-auth/2.3.0/hadoop-auth-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/httpcomponents/httpcore/4.2.5/httpcore-4.2.5.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.3.0/hadoop-hdfs-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.3.0/hadoop-mapreduce-client-app-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.3.0/hadoop-mapreduce-client-common-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.3.0/hadoop-yarn-client-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.3.0/hadoop-yarn-server-common-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.3.0/hadoop-mapreduce-client-shuffle-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.3.0/hadoop-yarn-api-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.3.0/hadoop-mapreduce-client-core-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.3.0/hadoop-yarn-common-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/servlet/servlet-api/2.5/servlet-api-2.5.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.3.0/hadoop-mapreduce-client-jobclient-2.3.0.jar]

18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-annotations/2.3.0/hadoop-annotations-2.3.0.jar]

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/tmp/5bee1048-2ff8-49e2-82d5-d3eedb7c01be/druid_build-assembly-0.1-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

18/01/19 18:14:55 INFO util.Version: HV000001: Hibernate Validator 5.1.3.Final

18/01/19 18:14:56 INFO guice.JsonConfigurator: Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=, defaultVersion=‘0.1-SNAPSHOT’, localRepository=’/root/.m2/repository’, remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.metadata.storage.mysql.MySQLMetadataStorageModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.s3.S3StorageDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.firehose.s3.S3FirehoseDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.query.aggregation.histogram.ApproximateHistogramDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.hdfs.HdfsStorageDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO guice.JsonConfigurator: Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=, defaultVersion=‘0.1-SNAPSHOT’, localRepository=’/root/.m2/repository’, remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.metadata.storage.mysql.MySQLMetadataStorageModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.s3.S3StorageDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.firehose.s3.S3FirehoseDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.query.aggregation.histogram.ApproximateHistogramDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.hdfs.HdfsStorageDruidModule] for class[io.druid.initialization.DruidModule]

18/01/19 18:14:57 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@47b1f99f]

18/01/19 18:14:57 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=}]

18/01/19 18:14:57 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2

18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Using method itself for [druid.computation.buffer.size, ${base_path}.buffer.sizeBytes] on [io.druid.query.DruidProcessingConfig#intermediateComputeSizeBytes()]

18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Using method itself for [${base_path}.numThreads] on [io.druid.query.DruidProcessingConfig#getNumThreads()]

18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Using method itself for [${base_path}.columnCache.sizeBytes] on [io.druid.query.DruidProcessingConfig#columnCacheSizeBytes()]

18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Assigning default value [processing-%s] for [${base_path}.formatString] on [com.metamx.common.concurrent.ExecutorServiceConfig#getFormatString()]

18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[interface io.druid.segment.data.BitmapSerdeFactory] from props[druid.processing.bitmap.] as [ConciseBitmapSerdeFactory{}]

Jan 19, 2018 6:14:58 PM com.google.inject.servlet.GuiceFilter setPipeline

WARNING: Multiple Servlet injectors detected. This is a warning indicating that you have more than one GuiceFilter running in your web application. If this is deliberate, you may safely ignore this message. If this is NOT deliberate however, your application may not work as expected.

18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@72f294f]

18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=}]

18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.DruidNode] from props[druid.] as [DruidNode{serviceName=‘druid/internal-hadoop-indexer’, host=‘druid-indexer-1dot1-m.c.ad-veritas.internal’, port=0}]

18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.metadata.MetadataStorageTablesConfig] from props[druid.metadata.storage.tables.] as [io.druid.metadata.MetadataStorageTablesConfig@452ccb54]

18/01/19 18:14:58 INFO mysql.MySQLConnector: Configured MySQL as metadata storage

18/01/19 18:14:58 INFO indexer.HadoopDruidIndexerConfig: Running with config:

{

“spec” : {

“dataSchema” : {

“dataSource” : “wikiticker”,

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “json”,

“timestampSpec” : {

“column” : “time”,

“format” : “auto”,

“missingValue” : null

},

“dimensionsSpec” : {

“dimensions” : [ “channel”, “cityName”, “comment”, “countryIsoCode”, “countryName”, “isAnonymous”, “isMinor”, “isNew”, “isRobot”, “isUnpatrolled”, “metroCode”, “namespace”, “page”, “regionIsoCode”, “regionName”, “user” ],

“dimensionExclusions” : [ “deleted”, “added”, “delta”, “time” ],

“spatialDimensions” :

}

}

},

“metricsSpec” : [ {

“type” : “count”,

“name” : “count”

}, {

“type” : “longSum”,

“name” : “added”,

“fieldName” : “added”

}, {

“type” : “longSum”,

“name” : “deleted”,

“fieldName” : “deleted”

}, {

“type” : “longSum”,

“name” : “delta”,

“fieldName” : “delta”

}, {

“type” : “hyperUnique”,

“name” : “user_unique”,

“fieldName” : “user”

} ],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “DAY”,

“queryGranularity” : {

“type” : “none”

},

“intervals” : [ “2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z” ]

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “gs://forpeter/wikiticker-2015-09-12-sampled.json.gz”

},

“metadataUpdateSpec” : {

“type” : “mysql”,

“connectURI” : “jdbc:mysql://35.187.161.88:3306/druid”,

“user” : “druid”,

“password” : {

“type” : “default”,

“password” : “diurd”

},

“segmentTable” : “druid_segments”

},

“segmentOutputPath” : “gs://forpeter/”

},

“tuningConfig” : {

“type” : “hadoop”,

“workingPath” : “/tmp/druid-indexing”,

“version” : “2018-01-19T18:14:58.573Z”,

“partitionsSpec” : {

“type” : “hashed”,

“targetPartitionSize” : 5000000,

“maxPartitionSize” : 7500000,

“assumeGrouped” : false,

“numShards” : -1

},

“shardSpecs” : { },

“indexSpec” : {

“bitmap” : {

“type” : “concise”

},

“dimensionCompression” : null,

“metricCompression” : null

},

“leaveIntermediate” : false,

“cleanupOnFailure” : true,

“overwriteFiles” : false,

“ignoreInvalidRows” : false,

“jobProperties” : {

“hadoop.mapreduce.job.user.classpath.first” : “true”

},

“combineText” : false,

“persistInHeap” : false,

“ingestOffheap” : false,

“bufferSize” : 134217728,

“aggregationBufferRatio” : 0.5,

“useCombiner” : false,

“rowFlushBoundary” : 80000

}

}

}

18/01/19 18:14:58 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]

18/01/19 18:14:59 INFO indexer.JobHelper: Uploading jar to path[/tmp/druid-indexing/classpath/druid_build-assembly-0.1-SNAPSHOT.jar]

18/01/19 18:15:00 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]

18/01/19 18:15:00 INFO client.RMProxy: Connecting to ResourceManager at druid-indexer-1dot1-m/10.132.0.40:8032

18/01/19 18:15:00 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

18/01/19 18:15:00 WARN mapreduce.JobResourceUploader: No job jar file set. User classes may not be found. See Job or Job#setJar(String).

18/01/19 18:15:00 INFO input.FileInputFormat: Total input paths to process : 1

18/01/19 18:15:00 INFO mapreduce.JobSubmitter: number of splits:1

18/01/19 18:15:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515600795023_0009

18/01/19 18:15:01 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.

18/01/19 18:15:01 INFO impl.YarnClientImpl: Submitted application application_1515600795023_0009

18/01/19 18:15:02 INFO mapreduce.Job: The url to track the job: http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0009/

18/01/19 18:15:02 INFO indexer.DetermineHashedPartitionsJob: Job wikiticker-determine_partitions_hashed-Optional.of([2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z]) submitted, status available at: http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0009/

18/01/19 18:15:02 INFO mapreduce.Job: Running job: job_1515600795023_0009

18/01/19 18:15:09 INFO mapreduce.Job: Job job_1515600795023_0009 running in uber mode : false

18/01/19 18:15:09 INFO mapreduce.Job: map 0% reduce 0%

18/01/19 18:15:20 INFO mapreduce.Job: map 100% reduce 0%

18/01/19 18:15:30 INFO mapreduce.Job: map 100% reduce 100%

18/01/19 18:15:30 INFO mapreduce.Job: Job job_1515600795023_0009 completed successfully

18/01/19 18:15:30 INFO mapreduce.Job: Counters: 54

File System Counters

FILE: Number of bytes read=1053

FILE: Number of bytes written=449763

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

GS: Number of bytes read=2366222

GS: Number of bytes written=0

GS: Number of read operations=0

GS: Number of large read operations=0

GS: Number of write operations=0

HDFS: Number of bytes read=298

HDFS: Number of bytes written=99

HDFS: Number of read operations=8

HDFS: Number of large read operations=0

HDFS: Number of write operations=3

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=26145

Total time spent by all reduces in occupied slots (ms)=37026

Total time spent by all map tasks (ms)=8715

Total time spent by all reduce tasks (ms)=6171

Total vcore-milliseconds taken by all map tasks=8715

Total vcore-milliseconds taken by all reduce tasks=12342

Total megabyte-milliseconds taken by all map tasks=26772480

Total megabyte-milliseconds taken by all reduce tasks=37914624

Map-Reduce Framework

Map input records=39244

Map output records=1

Map output bytes=1043

Map output materialized bytes=1053

Input split bytes=298

Combine input records=0

Combine output records=0

Reduce input groups=1

Reduce shuffle bytes=1053

Reduce input records=1

Reduce output records=0

Spilled Records=2

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=437

CPU time spent (ms)=25090

Physical memory (bytes) snapshot=1078218752

Virtual memory (bytes) snapshot=11605217280

Total committed heap usage (bytes)=1108869120

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=94

18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: Job completed, loading up partitions for intervals[Optional.of([2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z])].

18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: Found approximately [40,337] rows in data.

18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: Creating [1] shards

18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: DetermineHashedPartitionsJob took 31286 millis

18/01/19 18:15:30 INFO indexer.JobHelper: Deleting path[/tmp/druid-indexing/wikiticker/2018-01-19T181458.573Z]

18/01/19 18:15:30 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]

18/01/19 18:15:30 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]

18/01/19 18:15:30 INFO client.RMProxy: Connecting to ResourceManager at druid-indexer-1dot1-m/10.132.0.40:8032

18/01/19 18:15:30 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

18/01/19 18:15:30 WARN mapreduce.JobResourceUploader: No job jar file set. User classes may not be found. See Job or Job#setJar(String).

18/01/19 18:15:30 INFO input.FileInputFormat: Total input paths to process : 1

18/01/19 18:15:30 INFO mapreduce.JobSubmitter: number of splits:1

18/01/19 18:15:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515600795023_0010

18/01/19 18:15:30 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.

18/01/19 18:15:30 INFO impl.YarnClientImpl: Submitted application application_1515600795023_0010

18/01/19 18:15:30 INFO mapreduce.Job: The url to track the job: http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0010/

18/01/19 18:15:30 INFO indexer.IndexGeneratorJob: Job wikiticker-index-generator-Optional.of([2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z]) submitted, status available at http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0010/

18/01/19 18:15:30 INFO mapreduce.Job: Running job: job_1515600795023_0010

18/01/19 18:15:42 INFO mapreduce.Job: Job job_1515600795023_0010 running in uber mode : false

18/01/19 18:15:42 INFO mapreduce.Job: map 0% reduce 0%

18/01/19 18:15:56 INFO mapreduce.Job: map 100% reduce 0%

18/01/19 18:16:07 INFO mapreduce.Job: map 100% reduce 100%

18/01/19 18:16:11 INFO mapreduce.Job: Task Id : attempt_1515600795023_0010_r_000000_0, Status : FAILED

Error: com.metamx.common.IAE: Unknown file system scheme [gs]

at io.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:274)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:621)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:462)

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

18/01/19 18:16:12 INFO mapreduce.Job: map 100% reduce 0%

18/01/19 18:16:23 INFO mapreduce.Job: map 100% reduce 100%

18/01/19 18:16:26 INFO mapreduce.Job: Task Id : attempt_1515600795023_0010_r_000000_1, Status : FAILED

Error: com.metamx.common.IAE: Unknown file system scheme [gs]

at io.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:274)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:621)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:462)

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

18/01/19 18:16:27 INFO mapreduce.Job: map 100% reduce 0%

18/01/19 18:16:38 INFO mapreduce.Job: map 100% reduce 100%

18/01/19 18:16:41 INFO mapreduce.Job: Task Id : attempt_1515600795023_0010_r_000000_2, Status : FAILED

Error: com.metamx.common.IAE: Unknown file system scheme [gs]

at io.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:274)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:621)

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:462)

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

18/01/19 18:16:42 INFO mapreduce.Job: map 100% reduce 0%

18/01/19 18:16:53 INFO mapreduce.Job: map 100% reduce 100%

18/01/19 18:16:57 INFO mapreduce.Job: Job job_1515600795023_0010 failed with state FAILED due to: Task failed task_1515600795023_0010_r_000000

Job failed as tasks failed. failedMaps:0 failedReduces:1

18/01/19 18:16:57 INFO mapreduce.Job: Counters: 42

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=22635343

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

GS: Number of bytes read=2366222

GS: Number of bytes written=0

GS: Number of read operations=0

GS: Number of large read operations=0

GS: Number of write operations=0

HDFS: Number of bytes read=281

HDFS: Number of bytes written=0

HDFS: Number of read operations=2

HDFS: Number of large read operations=0

HDFS: Number of write operations=0

Job Counters

Failed reduce tasks=4

Launched map tasks=1

Launched reduce tasks=4

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=30225

Total time spent by all reduces in occupied slots (ms)=305802

Total time spent by all map tasks (ms)=10075

Total time spent by all reduce tasks (ms)=50967

Total vcore-milliseconds taken by all map tasks=10075

Total vcore-milliseconds taken by all reduce tasks=101934

Total megabyte-milliseconds taken by all map tasks=30950400

Total megabyte-milliseconds taken by all reduce tasks=313141248

Map-Reduce Framework

Map input records=39244

Map output records=39244

Map output bytes=22253954

Map output materialized bytes=22410936

Input split bytes=281

Combine input records=0

Spilled Records=39244

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=281

CPU time spent (ms)=18460

Physical memory (bytes) snapshot=720211968

Virtual memory (bytes) snapshot=4434993152

Total committed heap usage (bytes)=671088640

File Input Format Counters

Bytes Read=0

18/01/19 18:16:57 INFO indexer.JobHelper: Deleting path[/tmp/druid-indexing/wikiticker/2018-01-19T181458.573Z]

18/01/19 18:16:57 ERROR cli.CliHadoopIndexer: failure!!!

java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:120)

at io.druid.cli.Main.main(Main.java:91)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)

Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.LegacyIndexGeneratorJob] failed!

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:202)

at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96)

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182)

at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:132)

at io.druid.cli.Main.main(Main.java:91)

… 13 more

ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [5bee1048-2ff8-49e2-82d5-d3eedb7c01be] entered state [ERROR] while waiting for [DONE].

Hi,
Have you added GCS connector jar to the classpath as recommended here - http://druid.io/docs/latest/development/extensions-core/hdfs.html

Hello,

Yes, I’m reading data from Google storage and the index json specification is also stored in GS. However, the job fails when it tries to push the segments to Google storage.

My hunch is that the batch hadoop indexer is a separate component and needs to be individually configured for Google.

Here’s the script that I use to launch the job in data proc:

gcloud dataproc jobs submit hadoop --quiet --cluster $cluster --project=ad-veritas --jar gs://forpeter/druid_build-assembly-0.1-SNAPSHOT.jar --region=global – io.druid.cli.Main index hadoop gs://forpeter/wikiticker-index.json

Here’s the index spec:
{

“type” : “index_hadoop”,

“spec” : {

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “gs://forpeter/wikiticker-2015-09-12-sampled.json.gz”

},

“metadataUpdateSpec” : {

“type”:“mysql”,

“connectURI” : “jdbc:mysql://35.187.161.88:3306/druid”,

“password” : “diurd”,

“segmentTable” : “druid_segments”,

“user” : “druid”

},

“segmentOutputPath” : “gs://forpeter/”

},

“dataSchema” : {

“dataSource” : “wikiticker”,

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “day”,

“queryGranularity” : “none”,

“intervals” : [“2015-09-12/2015-09-13”]

},

“parser” : {

“type” : “hadoopyString”,

“parseSpec” : {

“format” : “json”,

“dimensionsSpec” : {

“dimensions” : [

“channel”,

“cityName”,

“comment”,

“countryIsoCode”,

“countryName”,

“isAnonymous”,

“isMinor”,

“isNew”,

“isRobot”,

“isUnpatrolled”,

“metroCode”,

“namespace”,

“page”,

“regionIsoCode”,

“regionName”,

“user”

]

},

“timestampSpec” : {

“format” : “auto”,

“column” : “time”

}

}

},

“metricsSpec” : [

{

“name” : “count”,

“type” : “count”

},

{

“name” : “added”,

“type” : “longSum”,

“fieldName” : “added”

},

{

“name” : “deleted”,

“type” : “longSum”,

“fieldName” : “deleted”

},

{

“name” : “delta”,

“type” : “longSum”,

“fieldName” : “delta”

},

{

“name” : “user_unique”,

“type” : “hyperUnique”,

“fieldName” : “user”

}

]

},

“tuningConfig” : {

“type” : “hadoop”,

“workingPath” : “/tmp/druid-indexing”,

“partitionsSpec” : {

“type” : “hashed”,

“targetPartitionSize” : 5000000

},

“jobProperties” : {

“hadoop.mapreduce.job.user.classpath.first”: “true”

}

}

}

}

-pt

What does your Druid config look like? Are you loading the druid-google-extensions extension and setting druid.storage.type=google and druid.google.bucket=your-bucket-name and druid.google.prefix=yourprefix correctly?