segmentOutputPath using Google Cloud Storage

Hi all, I’m new in Druid. I’ve some problem when doing batch ingestion from sample data @quickstart/wikiticker-index.json with google cloud storage as deep storage. The ingestion status failed with error message.
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.2.jar:0.9.2]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_111]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_111]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_111]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_111]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_111]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]

… 7 more

Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:369) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]

at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_111]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_111]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_111]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_111]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]

… 7 more

2017-01-11T13:33:57,282 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_wikiticker_2017-01-11T13:27:16.427Z] status changed to [FAILED].

2017-01-11T13:33:57,283 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_hadoop_wikiticker_2017-01-11T13:27:16.427Z”,

“status” : “FAILED”,
“duration” : 397064

}

``

After I check the request body, I got some improper configuration in segmentOutputPath as below:
“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “quickstart/wikiticker-2015-09-12-sampled.json”

},

“metadataUpdateSpec” : null,

“segmentOutputPath” : “file:/some/directory/druid-0.9.2/gs:/bucket-druid/druid/segments”

},

``

It should be gs://bucket-druid/druid/segments instead of file:/some/directory/druid-0.9.2/gs:/bucket-druid/druid/segments

This is my configuration in common.runtime.properties.
druid.extensions.loadList=[“druid-hdfs-storage”, “druid-histogram”, “mysql-metadata-storage”]
druid.extensions.hadoopDependenciesDir=/some/directory/druid-0.9.2/hadoop-dependencies

druid.storage.type=hdfs

druid.storage.storageDirectory=gs://bucket-druid/druid/segments

druid.indexer.logs.type=hdfs

druid.indexer.logs.directory=gs://bucket-druid/druid/indexing-logs

``

How to fix this issue? Is there any misconfiguration?

Thank you.

I see now the documentation is not accurate since `druid.storage.storageDirectory is not valid for google storage.

You are also setting the type to hdfs right now.

Your correct config should be:

`
druid.storage.type=google
druid.google.bucket=bucket-druid
druid.google.prefix=druid/segments

druid.indexer.logs.type=
druid.indexer.logs.bucket=bucket-druid
druid.indexer.logs.prefix=druid/indexing-logs

``

``

Thanks for your answer. I’ve tried your config, but I still get incorrect segmentOutputPath.

"ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "quickstart/wikiticker-2015-09-12-sampled.json"
      },
      "metadataUpdateSpec" : null,
      "segmentOutputPath" : "file:/tmp/druid/localStorage"

},

``

Pada Kamis, 12 Januari 2017 12.59.42 UTC+7, Erik Dubbelboer menulis:

Hi,
you also need add druid-google-extensions to druid extensions folder and also add it to druid.extensions.loadList.

Thanks for your answer. Where can I get druid-google-extensions? I’ve tried to get it with pull-deps, there is not available.
java -classpath “/some/directory/druid-0.9.2/lib/*” io.druid.cli.Main tools pull-deps -c io.druid.extensions.contrib:druid-google-extensions

``

Is it correct?

I use druid 0.9.2 and some extensions libraries like mysql-metadata, hadoop 2.7.3, and gcs-connector 1.6.0-hadoop2

Pada Kamis, 12 Januari 2017 17.03.07 UTC+7, Nishant Bangarwa menulis:

I just compile druid from github. Druid-google-extensions is working and SegmentOutputPath is being right now. Thanks!
But I have new issue, after druid performing MapReduce, there is an error “no such file or directory”.
2017-01-13T11:59:36,956 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 0%

``
2017-01-13T11:59:37,263 INFO [Thread-2902] org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete.

2017-01-13T11:59:37,484 WARN [Thread-2902] org.apache.hadoop.mapred.LocalJobRunner - job_local524433689_0002

java.lang.Exception: java.io.IOException: No such file or directory

at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.7.3.jar:?]

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) [hadoop-mapreduce-client-common-2.7.3.jar:?]

Caused by: java.io.IOException: No such file or directory

at java.io.UnixFileSystem.createFileExclusively(Native Method) ~[?:1.8.0_111]

at java.io.File.createTempFile(File.java:2024) ~[?:1.8.0_111]

at java.io.File.createTempFile(File.java:2070) ~[?:1.8.0_111]

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:569) ~[druid-indexing-hadoop-0.9.3-SNAPSHOT.jar:0.9.3-SNAPSHOT]

at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:478) ~[druid-indexing-hadoop-0.9.3-SNAPSHOT.jar:0.9.3-SNAPSHOT]

at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]

at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]

at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) ~[hadoop-mapreduce-client-common-2.7.3.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_111]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_111]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_111]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_111]

at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_111]

2017-01-13T11:59:37,957 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Job job_local524433689_0002 failed with state FAILED due to: NA

Pada Jumat, 13 Januari 2017 10.04.11 UTC+7, Basith Zainurrohman menulis:

This is at https://github.com/druid-io/druid/blob/b0232b4e403b7e175f876d3224a0a36654974416/indexing-hadoop/src/main/java/io/druid/indexer/IndexGeneratorJob.java#L569
Which is in the hadoop reduce function. It can’t create a temp file. I’m guessing your hadoop cluster is not set up properly. Make sure /tmp or where hadoop.tmp.dir points to is writeable.

I’ve changed hadoop.tmp.dir to another path with writeable permission in my core-site.xml. It is still not working. This is my configuration in core-site.xml.

fs.gs.impl

com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

The FileSystem for gs: (GCS) uris.

fs.AbstractFileSystem.gs.impl

com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.

fs.gs.project.id

some-project-id

fs.gs.system.bucket

bucket-druid

fs.gs.working.dir

/

hadoop.tmp.dir

/some/directory/temporary

``

Pada Senin, 16 Januari 2017 14.56.10 UTC+7, Erik Dubbelboer menulis:

I’m afraid I can’t help much more with this. We use Google DataProc to manage our Hadoop cluster for us. So I don’t have much experience with configuring it correctly.

Thanks Mr Erik. I have found source of the problem on middleManger/runtime.properties. Previously, I set druid.indexer.task.hadoopWorkingPath with gs://bucket-druid/tmp/druid-indexing and got that error. But after I changed the path in local like var/tmp/druid-indexing, it works. So, is it the best practice to put working path in local? Can we put it on Google Cloud Storage? Thank you.

Pada Selasa, 17 Januari 2017 15.40.00 UTC+7, Erik Dubbelboer menulis:

To be honest I’m not sure why that doesn’t work. It looks like that is all handled by hadoop so in theory it should just work. I don’t think there is anything I can do about that from Druid code.