Druid unable to access Azure Blob storage ,when creating tables from hive

Hi

We are trying to test Druid via Hive on Azure Blob storage.

We setup druid to blob using the following link

http://druid.io/docs/latest/development/extensions-contrib/azure.html

We are trying to create Hive tables using CTAS,but we get an error saying we are unable to access blob storage and user name is anonymous

Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: Unable to access container XXXXXXX in account XXXXXX.blob.windows.net using anonymous credentials, and no credentials found for them in the configuration.

Is there any property to set user name and password in druid config,we have mentioned blob account name, container name and the corresponding key.

Thanks

Vinayak

Hi can you add more context about the exception

like where are you seeing it? and add attache the entire stack.

Hi Slim

So we see the error when we use hive creatw as select(ctas) statement.

We have a hive table which is stored in orc format.

We try to index druid using hive ctas.

We typcasted the time column to timestamp and all necessary dimensions to string in the select.

The query in tez cli display shows all mappers and reducers completed.

But it fails with message mentioned in my previous post,I guess this would be the stage when data is moved from tmp to actual location.

Info regarding stack,

We are using Hortonworks Hadoop on hdinsight cluster and we are using hdfs compactible blob storage.Any hive table we create will end up in the blob.

Thanks

Vinayak

Hi,

Thanks for the explanation.

I haven’t tested this out with Azure so am in the learning process :smiley:

can you please let me know what this hive conf is set to ? hive.druid.storage.storageDirectory
As you said at the end of the CTAS Hive Client will be renaming the data to make it readable at the location pointed by the property i mention above.

Also can you attache the Hive server2 Log it might have more info.

Hi Slim

Thanks for the quick response.

So for the property

Intially we had set this property as /druid /segments.

When we ran the query it was successful but the table was empty.

When I tried to create an external table on the same druid segment name.It gave me an error that it was connected to druid but couldn’t find the data sources.

We guessed that we would have to give the absolute path for azure blob for this property.So on giving the absolute path we get the issue mentioned earlier.

We kept the same absolute path in druid config also.

Prior to this we had used hdfs batch indexing,by referring the wikiticker example.But since this a blob storage and it does have an hdfs path it just created the data on local Unix path.

I have a feeling that druid is internally looking at the local path, instead of blob path.And this probably because we didn’t mention absolute path intially

I will try to attach a cleansed version of hiveserver 2 logs sometime later today.I don’t have access to my system at the moment and it is 4am in my timezone :smiley:

Thanks

Vinayak

hive.druid.storage.storageDirectory

Hi Thanks for explanation.

So to be clear the ingestion of data from Hive to druid is a two step process.

Step One Hive will first create druid index called segments, and persist that to the location pointed by property hive.druid.storage.storageDirectory. Then Hive will insert in the druid metadata storage the informations about where are the segments and how to load it. At this point Hive is the main Actor and Druid is not doing anything yet.

Step Two Druid Coordinator will see the newly inserted metadata by Hive about the new segments, this metadata has all the informations about how to read and load the data.

That’s how the load process works.

Now i think your issue is the following

Since the druid segments data is created by Hive more accurately using HDFS file system driver for azure Druid will need to use HDFS as deep storage instead of Azure module.

Therefore to debug this:

ONE let’s make sure that step one succeed that means you can manually see the data in Azure at the location pointed by hive.druid.storage.storageDirectory and you can see the new added entries in the druid metadata store.

TWO (assuming ONE is all good) you need to use HDFS as deep storage and make sure that the Azure HDFS jar files presented on the class Path for druid Historical.

As you can see ONE is a Hive issue so please attache the hive logs if possible. TWO is a Druid issue so please attach the logs of coordinator and historical if possible to help debugging.

Thanks and good Luck.

Hi Slim

I tried your recommendations and you are right.

Hive was able to create segments in the path specified by hive.druid.storage.storageDirectory,Also I logged in my sql and checked the metadata in table druid_segment table.It looks Good.So issue is not with Hive.

Also Ensured the azure hdfs jar is present in class path for historical.

In the Coordinator log ,I see this particular error repeating for different tables

com.metamx.common.ISE: /druid/loadQueue/cbs-w2.example.com:8083/hana_oracle_druid_datastore_test_vin_16_2017-07-01T00:00:00.000Z_2017-07-02T00:00:00.000Z_2017-11-16T13:17:26.786Z was never removed! Failing this operation!

Similarly In Historical Node :

2017-11-16T14:23:43,368 WARN [ZkCoordinator-0] com.metamx.common.RetryUtils - Failed on try 2, retrying in 1,509ms.

java.io.FileNotFoundException: File /druid/segments/hana_oracle_druid_datastore_test_vin_15/20161101T000000.000Z_20161102T000000.000Z/2017-11-16T12_49_30.577Z/0/index.zip does not exist

I have attached the log snippets.Looks like the issue revolves around zookeeper.Do I need to login into zookeeper node and do some clean up?

Thanks

Vinayak

log_druid.txt (12.1 KB)

Hi Glad we are making progress.

Now the question does this file /druid/segments/hana_oracle_druid_datastore_test_vin_15/20161101T000000.000Z_20161102T000000.000Z/2017-11-16T12_49_30.577Z/0/index.zip exists in Azur ?

is it readable by the druid historicals? for instance can you download that file using HDFS commands to the local box where the historical is running?

Can you make sure that Historical has there proper Hadoop configs to access the Azure File system?

Hi Slim

So the file is present azure,but really sure if Historical can read the data,I logged as druid user and was able to do hadoop fs get.I was able to get the index file and unzip it.

Also I verified all jar path is for hdfs is set properly.

But having looking error was i guess error revolves around zookeeper,coordinator and historical in the historical logs(attached) I could see

New request[LOAD: hana_oracle_druid_test_adk_10_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_2017-11-17T07:40:51.759Z] with zNode[/druid/loadQueue/cbs-w2.example.com:8083/hana_oracle_druid_test_adk_10_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_2017-11-17T07:40:51.759Z].

I go the znode path /druid/loadQueue/cbs-w2.example.com:8083/hana_oracle_druid_test_adk_10_2017-08-01T00:00:00.000Z

But I don’t find the entry here,I find it another znode under loadQueue called cbs-m1.example:8083.

I have attached the log of historical,Currently two have three historical node configured.I plan to reduce to three.

I also see druid.zk.service.host having three entries including cbs-w2.example.com:8083 and cbs-m1.example:8083.

I was not involved in the installation of the cluster so not really sure if this would cause some issue between historical and zookeeper.

Thanks

VInayak

druid_historical.txt (13 KB)

Hi i don’t think it is a coordinator issue or Zk.

it is a Historical issue.

Looking at the following logs

I can see that the HDFS storage module is using the local file system driver (RawLocalFileSystem) to read the file instead of Azure driver.

Not 100% sure, this might be a configuration issue. Your Historical node is not getting the right core-site.xml hdfs-site.xml that is why is using the wrong file system. Can you check if that is true or false?


java.io.FileNotFoundException: File /druid/segments/hana_oracle_druid_test_adk_10/20170801T000000.000Z_20170901T000000.000Z/2017-11-17T07_40_51.759Z/0/index.zip does not exist

at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:624) ~[?:?]

at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:850) ~[?:?]

at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:614) ~[?:?]

at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422) ~[?:?]

at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146) ~[?:?]

at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:348) ~[?:?]

at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786) ~[?:?]

at io.druid.storage.hdfs.HdfsDataSegmentPuller$1.openInputStream(HdfsDataSegmentPuller.java:107) ~[?:?]

The other issue can be the metadata does have relative path can you please attache the metadata entry (row) for that segment?

Hi Slim

We decided reinstall druid on our cluster.We reduced to one historical node and broker.Our Earlier installation did core-site,and related XML. in druid path,we used a symlink to original.

So when we try to start druid.It give us an error that could find file system called WASB(windows azure storage blob).So our core-site.xml has the default fs path as WASB path along with associated container and key for it connect to Blob.But since we we tried to use druid hadoop extension,I am guessing this didn’t recognize the wasb path and looking for HDFS path.

When we remove the core-site.xml and other related XML.Druid starts fine.But when we build and index manually through druid rest api ,it uses local as deep storage.

When we try through Hive ,we see it being created in WASB path mentioned in "hive.druid.storage.storageDirectory,which is expected behaviour which you had posted earlier.Since Hive is configured to run on wasb,Hive is working as expected.

The issues now looks like druid is not able to access WASB. We are going to try azure extension.From druid’s etc path I see separate folder for my sql and s3,along with HDFS.I believe i would have create one for azure and add associated jars.One confusing thing was I found an azure jar in the druid-hdfs-extension folder,along with other hadoop jar.Not sure if there was any configuration for hdfs-extension to behave like WASB

Will keep you posted on this,let me know if you have any suggestions

Thanks

Vinayak

Thanks

Vinayak

Hi Slim

We Finally got it resolved.The Azure extension worked but it didn’t pull the segment into the cache.

We end up using the HDFS extension.We need to add to azure-hadoop.jar in the hadoop druid extension folder.

The azure-hadoop jar comes along with wasb installation.

Thanks for all your help,I have better understanding of Druid and the druid-hive implementation

Thanks

Vinayak