Hadoop ingestion failing when trying to ingest with google dataproc

Hey everyone,

index_hadoop is failing with exception

I know that cloud storage connector comes default with google dataproc and executing the ‘hadoop fs -ls gs://file-dumps’ gives the wikipedia file, but I found out that GoogleHadoopFileSystem class is in connector.jar. So, i have put in all the dataproc nodes that jar in /usr/lib/hadoop

I’ve never tried Druid with dataproc, but you could try putting the jar containing GoogleHadoopFileSystem in the druid classpath, e.g. in the druid-google-extensions directory.

Hi,

i have tried putting that jar in extensions which solved the GoogleHadoopFileSystem exception(Thanks Jonathan).

Now I am stuck with another problem

Caused by: java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.

This indicates that the task didn’t see any valid input data, so you most likely don’t need to adjust hadoop jar versions or classpath configurations at this point (which is a good sign).

For that type of error, the most common cause is timestamp issues, so I would double check your task interval, timezone settings, input timestamps, etc.,

Also try setting -Duser.timezone=UTC -Dfile.encoding=UTF-8 on your Druid services and druid.indexer.runner.javaOpts in your middle manager configs.

Hey Jonathan,

I am trying to ingest the wikipedia file itself and have tried all the below

-> checked the timestamps of the input file, they are all in the day ‘2015-09-12’

-> changed the intervals as [“2015-09-10/2015-09-21”](way expanded interval) in the ingestion spec

-> timezone setting

–> put ‘-Duser.timezone=UTC -Dfile.encoding=UTF-8’ while running all the druid services

–> put the above timezone setting in ‘druid.indexer.runner.javaOpts’

–> put the timezone setting in ingestion spec.

Ingestion task is still failing with same error.

The hadoop ingestion works fine with docker hadoop(as in hadoop tutorial), so does the problem resides with dataproc’s timezone?

wikipedia-index-hadoop.json (1.85 KB)

Hm, since you expanded the ingestion interval, and you’re using the wikipedia data, it doesn’t seem like the issue is with timezones.

Maybe there’s an issue with reading the input file, a couple of things I can think of to try:

  • Try setting a totally invalid gs:// path (like to a bogus file) and see if the task fails in a different way

  • Change the input source to a local file and put the wikipedia data on all the dataproc nodes, and see if druid+dataproc can ingest that wikipedia data in that case

  • Try putting the keyfile specified by google.cloud.auth.service.account.json.keyfile on the Druid middle manager nodes as well

Hey Jonathan,

I have tried those pointers and here are the results

1 -> Input path does not exist: gs://file-dumps/wikiticker-2015-09-12-sampled.json.gz1

2 -> failed with No buckets?? seems there is no data to index.

3 -> not sure how to do this

But, I tried to ingest a some what bigger file, ofcourse the task failed, but took more time to fail(ran the task 3 times to make sure)

also both those tasks resulted in different ‘map 0% reduce 0%’ pattern, the bigger file resulted in some more 'map’ing steps.

Reading the input file is not a problem, right?

wikipedia index task(failed ~ 85 seconds) - wikipedia.log

bigger file task(failed ~ 100 seconds) - test.log

Thanks

test.log (181 KB)

wikipedia.log (183 KB)

just trying to remind my problem, can anyone look into this please?

2019-01-02T13:49:49,037 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Path[var/druid/hadoop-tmp/wikipedia/2019-01-02T134823.668Z_63f313a9118343f7a293ee71269698da/20150918T000000.000Z_20150919T000000.000Z/partitions.json] didn’t exist!?

2019-01-02T13:48:26,609 INFO [main] io.druid.cli.CliPeon - * druid.indexer.task.hadoopWorkingPath: var/druid/hadoop-tmp

It looks like the task can read the input file, is able to run the initial “determine partitions” job, but the “index generator” job which runs after can’t see the output of the first job.

Try changing the hadoopWorkingPath to an absolute path that’s writable in GCS, and maybe try putting a “gs://” on that path as well

Hey Jonathan,

I have tried putting the gs:// path in hadoopWorkingPath, the task failed with

{

“code” : 403,

“errors” : [ {

"domain" : "global",

"message" : "Insufficient Permission",

"reason" : "insufficientPermissions"

} ],

“message” : “Insufficient Permission”

}

at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150) ~[gcs-connector-hadoop1-latest.jar:?]

at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) ~[gcs-connector-hadoop1-latest.jar:?]

at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) ~[gcs-connector-hadoop1-latest.jar:?]

How can I give the google.cloud.auth.service.account.json.keyfile in the druid middlemanager nodes?

I haven’t tried this, but I’m guessing you need google.cloud.auth.service.account.enable and google.cloud.auth.service.account.json.keyfile in the core-site.xml that’s present in your middlemanager’s classpath, and the keyfile param needs to point to the location of the GCS keyfile on that middle manager’s local filesystem.

Hey,

I will try putting those configs in the core-site.xml.

But doesn’t that cause any exception since those configs will be used to communicate with dataproc’s nodes? maybe dataproc nodes will complain that they cant find the keyFile specified the google.cloud.auth.service.account.json.keyfile’s value.

I cant find any related doc that explains the flow of the ingestion when task is asked to use dataproc cluster, can you give me give me any reference to the documentation? or can you just explain the high level flow of ingestion? I just cant understand how hadoop ingestion works

I have put ‘google.cloud.auth.service.account.json.keyfile’ and ‘google.cloud.auth.service.account.enable’ in the core-site.xml and tried ingesting again which resulted in SUCCESS.

But again, I removed those settings from core-site.xml and tried ingestion expecting the task to FAIL with 403 error but it didn’t.(I am not interested in this case for now).

Now, what can be the problem with hadoopWorkingPath in hadoop local file system?

I cant find any related doc that explains the flow of the ingestion when task is asked to use dataproc cluster, can you give me give me any reference to the documentation?

There’s no documentation for integrating with dataproc, this is uncharted territory, so to speak.

can you just explain the high level flow of ingestion? I just cant understand how hadoop ingestion works

The high level flow is something like this:

  • Determine partitioning information

  • Build segments based on that partitioning information

  • After segments are built, upload them to deep storage (“push”)

  • After segments are pushed to deep storage, add their metadata to the metadata store (“publish”)

  • Wait for historical nodes to download the published segments (“handoff”)

Now, what can be the problem with hadoopWorkingPath in hadoop local file system?

hadoopWorkingPath is meant to be a distributed path accessible across your hadoop cluster (typically would be an HDFS path) so a local file system path would fail.

Hey Jonathan,

Sorry I didnt ask my question clearly, I wanted to know the flow of ingestion when using dataproc.

As for hadoopWorkingPath, when I left the default value i.e., var/druid/hadoop-tmp as it is and try to ingest, I can see that the files are being created in HDFS and are being deleted.

Additionally, i have tried upload random file from the dataproc cluster to the google storage and it worked fine.

So, dataproc can create map/reduce intermediate files and can also upload to google storage, what could be the problem?

Sorry I didnt ask my question clearly, I wanted to know the flow of ingestion when using dataproc.

AFAIK none of the Druid developers have tried using Druid + dataproc and worked out what needs to setup there (or if it even will work in the end).

I have put ‘google.cloud.auth.service.account.json.keyfile’ and ‘google.cloud.auth.service.account.enable’ in the core-site.xml and tried ingesting again which resulted in SUCCESS.

Since the task succeeded, I would stick with this working configuration.

At this point, if you want to dig deeper into how Druid is interacting with dataproc and understand what’s failing in your scenario, I would recommend looking into the hadoop ingestion code and working with it at the code level, putting in your own debug messages, etc.