Intake from HDFS so utterly frustrating

Hi,

I am still fighting with https://github.com/apache/incubator-druid/issues/8840 - with all the dependency injection hiding everything, I just don’t see how to fix this. At least not without making the local paths in historical incompatible.

It got so complicated to debug, I even set up a druid staging system (zk,postgres,coordinator/overlord,indexer,historical) all on one box. Works like a charm, should have done that a long time ago.

Now that I got some freedom to test (and stuck on finding where in the code things go wrong), I thought: well, maybe we can just use s3 instead, seems to be the most used. As we were forced to HDFS a longtime ago (s3 transfer costs were starting to exceed hardware costs), I set up Minio instead. Quick and painless.

The indexers get this in there runtime property:

druid.storage.type=s3

druid.storage.bucket=druid-deepstorage

druid.storage.baseKey=dev

druid.s3.accessKey=foo

druid.s3.secretKey=bar

druid.s3.protocol=http

druid.s3.enablePathStyleAccess=true

druid.s3.endpoint.signingRegion=us-east-1

druid.s3.endpoint.url=http://my.host:9000/

druid.storage.useS3aSchema=true

And in the “index_hadoop” task, I add:

“jobProperties”: {

“fs.s3a.access.key”: “foo”,

“fs.s3a.secret.key”: “bar”,

“fs.s3a.connection.ssl.enabled”: false,

“fs.s3a.endpoint”: “http://my.host:9000/”,

“fs.s3a.path.style.access”: true

},

And bam, I get a very nice segment in minio generated by Hadoop.

HOWEVER, in the database, the payload is “loadSpec”:{“type”:“local” } (the path is the correct one inside the bucket)

This is beyond frustrating and I really would appreciate some ideas. All I want is a new segment with a valid meta entry based on files in HDFS. Or is there any other task type that can source from HDFS?

I am this close to just pushing it back into Kafka and using “index_kafka”, that’s how desperate I am.

Thanks!

Hagen