Run druid jobs as end user instead of 'druid' user

Hi,

We are managing a multi-user cluster and there are folders in HDFS that only the owner can read. When launching a druid job, as the job is launched as ‘druid’ user, this user has no authorization to read the source folder. Is there any way to launch druid jobs as end user instead of ‘druid’ user?

Thank you in advance.

Hi Mikel,

Druid doesn’t do anything special to launch jobs as any particular user. It follows whatever setup you have given it through its client side Hadoop configuration. By default, I think this uses the unix system user. So you could try changing the name of your unix user to whatever you want the Hadoop user to me. There might also be a way to do it through Hadoop configuration, but I’m not sure how to do that.

Hi Gian,

Thank you for your response. The scenario is the following one: we have a kerberized Hortonworks cluster, with one instance of Druid (launched as user ‘druid’). When user ‘A’ submits a job to Spark, Spark executes the job on Yarn impersonating user ‘A’, providing access to his private folders on HDFS. However, when user ‘A’ launches an indexing task on Druid, Druid executes the job on Yarn as user ‘druid’, without access to private folders from user ‘A’. I’m submitting Druid task as:

curl --negotiate -u:myuser -b ~/cookies.txt -c ~/cookies.txt -X ‘POST’ -H ‘Content-Type:application/json’ -d @my-load-task.json myhost:8090/druid/indexer/v1/task

And I get the following error:

Caused by: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=druid, access=EXECUTE, inode="/my-data-folder/my-data.csv":hdfs:myuser:drwx------

Best regards,

I see - it sounds like you want Druid to do impersonation too. I don’t think it’s supported currently (unless it can be done through client-side Hadoop configs, which Druid does let you customize). Any Hadoop experts know what the situation is here?

There are some impersonation capabilities built in Hadoop, see https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html .

That said, I haven’t tried it myself.

– Himanshu

Dear Gian and Himanshu,

I had success using the Druid CLI instead the REST API:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Dhdp.version=2.6.4.0-91 -classpath lib/*:/usr/hdp/current/hadoop-client/conf/ io.druid.cli.Main index hadoop …/my-load-task.json

Is there any limitation between using CLI or REST API?

Thank you very much!

Hi Mikel,

The main limitation is the CLI indexer doesn’t bother to acquire locks, so it could in theory conflict with some other running job without knowing it. The REST API based indexing methods all play nice with each other and acquire proper locks. Other than that they do basically the same thing.

Thank you very much, Gian.

Hi Mikel, where are you specifying the proxied user?

Marvin

Hi Mikel, where are you specifying the proxied user?

The job would would be submitted to Hadoop as the user that executes the CLI hadoop indexer.

Thanks,

Jon