How to use non-Amazon S3 for deep storage?

I am attempting to configure non-Amazon S3 deep storage for my druid cluster. I’ve configured my common-properties file with the following properties for S3 storage and commented out the local storage properties:

druid.s3.accessKey
S3 access key.
Must be set.
druid.s3.secretKey
S3 secret key.
Must be set.
druid.storage.bucket
Bucket to store in.
Must be set.
druid.storage.baseKey
Base key prefix to use, i.e. what directory.
Must be set.

There is no mention of a property to specify for the S3 host, so based on searching past posts in the group I also added this property:

druid.extensions.coordinates=[“io.druid.extensions:druid-s3-extensions:0.6.173”]

and created a file conf/druid/_common/jets3t.properties that contains the following:

Uncomment to use s3 compatibility mode on GCE

s3service.s3-endpoint=mys3host.com

s3service.s3-endpoint-http-port=443

s3service.disable-dns-buckets=true

s3service.https-only=false

Despite all of this, after restarting all of the druid processes none of my data is being stored in S3 when I run my batch index job. Druid still appears to be using local storage since I can successfully query the data after loading. I don’t receive any errors during the load.

Any suggestions on how to load data to non-Amazon S3 deep storage would be much appreciated. I am running Druid version 0.10.0.

Have the same issue - any feedback would be great.

Lots of experimentation on difference configurations… no dice so far. Am able to write logs to S3, but can’t write index

-g

Latest results are that I can get my log files to write to S3 but not the segments, which seems to be an issue others have faced. Just to be clear, I’m trying to connect to s3 on the endpoint “myendpoint.com” and load segments into the bucket “mybucket”.

I’ve tried a few options:

  • Add an entry to the jets3t.properties file for s3service.s3-endpoint=myendpoint.com. When I do this alone, then my ingestion task uses an incorrect segmentOutputPath and the job fails:

“segmentOutputPath” : “s3n://mybucket/druid/segments”

java.lang.Exception: java.lang.IllegalArgumentException: Invalid hostname in URI s3n://mybucket/druid/segments

  • Don’t use jets3t.properties and just specify my S3 endpoint address as a prefix to my bucket in the common.runtime.properties file like druid.storage.bucket=myendpoint.com/mybucket. If I do just this then it attempts to connect to amazon rather than my local s3 and the connection fails:
2017-08-01T20:58:32,391 DEBUG [pool-20-thread-1] org.jets3t.service.utils.RestUtils$ThreadSafeConnManager - Get connection: {s}->**https://myendpoint.com.s3.amazonaws.com**:443, timeout = 0

2017-08-01T20:58:32,393 ERROR [pool-20-thread-1] io.druid.indexer.JobHelper - Exception in retry loop java.net.ConnectException: Connection refused (Connection refused)
  • Specify the endpoint in both locations listed above. If I do this, the job seems to make the most progress but eventually it still dies because it ends up duplicating the hostname like this:

ERROR [pool-20-thread-1] io.druid.indexer.JobHelper - Exception in retry loop java.net.UnknownHostException: myendpoint.com.myendpoint.com: Name or service not known

At this point, I’m running out of ideas. Is anyone out there willing to share a sample configuration where they are successfully storing segments on a non-Amazon instance of S3?

I’m about to the point of abandoning s3 as a backend data storage and writing to hadoop but I am hopeful that someone out there has a working config they are willing to share.

Hi Brandon,
did you also set druid.storage.type=s3 in your common.runtime.properties?

Also, you should add “druid-s3-extensions” to druid.extensions.loadList in your common.runtime.properties.

Thanks,

Jihoon

2017년 8월 2일 (수) 오전 6:07, Brandon Dean engrdean@gmail.com님이 작성:

Yes, I have set all of the below properties in the common.runtime.properties file:

druid.extensions.loadList=[“druid-s3-extensions”]

druid.storage.type=s3

druid.storage.bucket=mybucket

druid.storage.baseKey=druid/segments

druid.s3n.accessKey=mykey

druid.s3.accessKey=mykey

druid.s3n.secretKey=mysecrectkey

druid.s3.secretKey=mysecretkey

druid.indexer.logs.type=s3

druid.indexer.logs.s3Bucket=mybucket

druid.indexer.logs.s3Prefix=druid/indexing-logs

As I mentioned, the logs are being successfully stored in s3, it’s just the segments that I can’t get working.

Thanks,

Brandon

I got this to work using pithos + cassandra. Here were my configs:

conf/druid/_common/jets3t.properties

s3service.s3-endpoint=

s3service.s3-endpoint-http-port=

s3service.disable-dns-buckets=true

s3service.https-only=false

I haven’t yet configured HTTPS (this is an internal system, but I will still do that next). If your host supports https (via s3cmd’s ~/.s3cfg file, for instance, which is the default), then that should work.

My common.runtime.properities is not much different than yours:

druid.storage.type=s3

druid.storage.bucket=

druid.storage.baseKey=druid/segments

druid.s3.accessKey=<access_key>

druid.s3.secretKey=<secret_key>

druid.indexer.logs.type=s3

druid.indexer.logs.s3Bucket=my-bucket

druid.indexer.logs.s3Prefix=druid/indexing-logs

I have noticed that you have the s3n above. I do have that when I submit my batch jobs, but not here in the common config file.

So all that used to work as-is until yesterday, where I switch to use minio, and now druid won’t store anything in S3. Need to get to the bottom of this one, but I thought I would let you guys know I had a successful setup for the last 2 months.

Joel

https://github.com/apache/incubator-druid/issues/7875 was recently reported against master,
but maybe there was the same issue in old versions.