Fail to configure data ingestion from minio object storage

Hi Druiders,

I’m looking for a configuration allowing me to ingest data from minio object storage bucket.

I’m using local single nano quickstart with local for test.

I found some material to setup deep storage on minio but it is not my need. I want to setup ingestion.

I tinkered jets3t config and ioConfig part of the ingestion config, but this latter one does not allow to set a specific endpoint.

I mainly had two errors, first ‘host not found’ when it was trying to use bucket DNS but I override this config. And then “Unable to find a region via the region provider” but minio does not require region.

I’m lost…

Have you tried letting the UI create the ingestion spec for you? You would just use the S3 option but provide min.io specific details like host name, bucket, keys etc.

To my knowledge, the UI does not allow to specify a custom s3 endpoint.

The UI is a fancy way to automatically create the Ingestion Spec for your ingestion. You can use the UI as a starting point, build it like you would an S3 source and add/modify details pertaining to min.io and submit the result as your ingestion spec to Druid.

I know that, in the end, the UI produce a JSON describing the ingestion. But I haven’t found how to set a custom S3 endpoint in this description.

This should help: Native batch ingestion · Apache Druid

I know this page. It proposes AWS S3, Azure Blob and Google Cloud Storage - using “type” property - but no custom S3 endpoint.

I am not sure what you mean by a custom S3 endpoint? are you running minio locally?

I’m running minio on its own FQDN (https://mydomain.tld/) but even locally I would have to set the local endpoint (http://localhost:9000) somewhere. And it seems there is no way to do it.

You can set the endpoint using: druid.s3.endpoint.url

This can be found at S3-compatible · Apache Druid

If you want, I can try setting up a local minio instance and testing it out but this will take some time.

This extension allows you to do 2 things:

  • Ingest data from files stored in S3.
  • Write segments to deep storage in S3.

It proposes to set custom S3 endpoint for ingestion and for deep storage, once and for all. It seems to be a strong assumption. What if I want to use on prem minio for deep storage and ingest data from AWS.

One of my client is using an other COTS for object storage, S3 compatible. They have two endpoints per regions, one for low cost and one for high perf. They will want to mix offers and regions.

Do not bother. now I know where to look, I’ll use your tip as a workaround for my experimentation, and thanks for that. But it will not make its way to production, unfortunately.

Deep storage and ingestion are completely separate things and druid allows them to either be in the same or different S3 locations. There is no requirement for them to both use the same endpoint.

That’s good news,

But then there should be at least two location where I can set S3 endpoint, one for deepstorage, one for ingestion. I can’t find them.

Maybe this will help?

I wrote this a while ago on setting up Raspberry Pis but I was using Min.io for both Deep Storage and to ingest data – it works cool…

Thank for the message.

The thing is, minio works fine.

Just, how can I setup druid to have deepstorage on Minio and an ingestion on AWS S3 ?

AHA sorry @setop … I think I understand that you want to basically mix and match – Deep Storage in Min.IO but then ingestion from S3? I’ve not seen any posts around-and-about that have the explicit config you’ll need… the closest that I got in the docs is here:
S3-compatible · Apache Druid
Native batch ingestion · Apache Druid

If you find a workaround / solution please do post here as I’d love to know…

I’m surprised it is not a more common case. After all deep storage where you store segments and ingestion source are totally uncorrelated but can use se same protocol (S3) by coincidence. They are tons of implementations of S3 compatible object storage. Make the assumption that it will be the same for deep storage and ingestion is a bit strong.

I have no solution right now, except using an additional component between ObjectStorage and Druid to process the data (that would solve my other issue too). I’m afraid it has to go through a feature request to allow to configure the S3 endpoint in any ingestion configuration.

I’ve ran into this setup, so basically I have my k8s sitting on cephfs as default SC, the mount to use PVC for deepstorage is fine but then I need to get some data from local S3 into druid. What this local setup is facing are:

  1. druid might need to ingest from different s3 endpoints, and druid-operator doesn’t seem to allow dynamic s3 setups. I totally gets setop’s comment about them being separated.

  2. Also, local s3 gets self-signed SSL for ingress, so basically an endpoint to https://s3.localhost.dev/BUCKET/test.json would create a bunch of error around ‘default’ region and that “cannot locate certificate” error by java.

So guess if anyone could try setup a local k8s running druid-operator, with a local self-signed minio as s3 endpoint will prove just how difficult it is. Any ideas?