Druid docker cluster- issue with ingestion

Hi,

I have created a docker cluster on windows as per : Docker · Apache Druid

docker cluster is running successfully and all the services are running.

I am successfully ingesting using below command and it is successful:

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @druid_spec.json http://localhost:8081/druid/indexer/v1/task

But when i login to druid console( http:localhost:8888), i see that segments & datasources are empty. Also i am not able to query the datasources and gives the message that the data source object is not existing.

please let me know what i am doing wrong.

Regards,
Deepu.

Hi D_K,

Welcome to the druid forum.

I’d first start by checking that you are submitting the API to the Overlord and not the Coordinator.

Then I would check if the ingestion task actually runs. If it does, maybe there is a failure in the task. If not, you can check the Overlord logs to get an idea of what’s happening.

Thanks!

Thanks alot @Vijeth_Sagar for your response.

I did the following:

I am loading the ingestion file to overload as per attached script below:

#!/usr/bin/env bash

: ${INPUT_FILE:=“yellow_tripdata-index.json”}
: ${DRUID_HOST:=“localhost”}
: ${DRUID_PORT:=8090}
: ${API_TASK_PATH:=“druid/indexer/v1/task”}
: ${PROTOCOL:=“http”}

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @${INPUT_FILE} “${PROTOCOL}://${DRUID_HOST}:${DRUID_PORT}/${API_TASK_PATH}”

  • I am getting the response from ingestion script above as below:

$ sh 03-load_to_druid.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3755 0 57 100 3698 10875 689k --:–:-- --:–:-- --:–:-- 916k{“task”:“index_yellow_tripdata_2022-03-14T08:38:20.896Z”}

  • When i go to the console localhost:8888, i see the ingestion task failed with error:
    Request failed with status code 404

  • Please note that the ingestion_spec and the csv file for data load are residing in my windows folder.

Not sure why the ingestion job is failing with 404 error. Request your help in this regard.

Alternatively, do you have any link or documentation that gives step-by-step installation guide and ingestion steps for docker installation of druid?

Thanks @Vijeth_Sagar for your response

  • I am using now the docker compose from anskarl/druid-docker-cluster and able to bring up the cluster having following services including overload:

  • when i do an ingestion to overload as shown below, i am getting a task id back in response but in the console (localhost:8888) i see the ingestion job failed with error Request failed with status code 404

#!/usr/bin/env bash

: ${INPUT_FILE:=“yellow_tripdata-index.json”}
: ${DRUID_HOST:=“localhost”}
: ${DRUID_PORT:=8090}
: ${API_TASK_PATH:=“druid/indexer/v1/task”}
: ${PROTOCOL:=“http”}

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @${INPUT_FILE} “${PROTOCOL}://${DRUID_HOST}:${DRUID_PORT}/${API_TASK_PATH}”

$ sh 03-load_to_druid.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3755 0 57 100 3698 10875 689k --:–:-- --:–:-- --:–:-- 916k{“task”:“index_yellow_tripdata_2022-03-14T08:38:20.896Z”}

  • please note that ingestion_spec.json and sample data for laoding in csv format are in a same windows folder:dataset

Could you help me identify reason for 404 error in ingestion task?

That would indicate to me that the processes cannot talk to one another, somehow? Or perhaps that you have another network issue, like that the processes cannot access one of the Druid dependencies?

TBH when I’m just testing something in Druid, I just spin up a Ubuntu VM and run one of the quickstarts… but that’s probably because I’m OLD!!

thanks @petermarshallio . I will try with a different docker-compose file and see if it works.

Btw, we are planning to use druid as a replacement for PowerBI. Can you or anyone suggest any link/installation guide on setting up druid in AWS?

Also has anyone tried connecting angular/high chart with druid for building custom dashboards? Any suggestions on building custom dashboard with druid as backend ?

If you are trying to ingest a file on disk, it has to be accessible to the cluster. I have not run tested Druid on Docker myself but I’d try with the file in the container to make sure.

You could follow the clustered setup tutorial here, I am not sure I know of anything tailored to AWS though:

I think that what you are seeing is a network access problem. You would need to mount your local folder into the pods so they have access locally to your file or copy the file into the middle manager pod. Alternatively you can put the file in some other accessible storage like S3.
You might also be interested in a kubernetes/minikube based deployment that sets up min.io locally to provide S3-like platform locally. Take a look at this blog: Clustered Apache Druid® on your Laptop - Easy! - Imply

Sure i will try the suggestions. Thanks.

Has anyone tried using superset & druid in production? We have a customer facing application and wanted to create dashboards in superset and then make it available in the web application. The data for superset will come from druid.
Any idea how this can be achieved? Please note that the dashboard from superset will need to be embedded into the webapplication.

Yes, we do have customers working with Superset in production.

Superset has support for Apache Druid (from their docs): Apache Druid | Superset.

1 Like

Hi,
I managed to run druid cluster in docker-compose but now facing a new issue.

When i ingest a second data source, it says 0.00% available

Please note i am running druid on Java 1.8

druid version : “0.22.1”

Attached is the screenshot:

I had a similar issue when setting up a cluster on Kubernetes. In my case I had deep storage configured locally which caused the middle manager to store the files on its local disk. The problem is that the historical don’t share the same local disk. Changing it to s3 whether by using an actual S3 bucket or by using min.Io to configure it in Kubernetes solved it.

My previous post on this thread includes the blog post that describes how to do minio on Kubernetes. Perhaps that will help.

Late to the party @D_K but you can also see a list of related videos and things here:
https://www.druidforum.org/tag/superset

Hi @petermarshallio @Vijeth_Sagar @Sergio_Ferragut

I am trying to use multi-stage-query(MSQ) ingestion feature of druid (24.0.0) and trying to create a de-normalized table in druid from multiple huge datasets (facts & dimension) sitting in AWS S3 bucket.

Request a sample query on this using EXTERN. Also can someone help me locate this section in druid article on using AWS S3 and EXTERN?

Regards,

Hey @D_K,
You’ll need the s3 extension on your Druid extensions loadList:

druid.extensions.loadList=["druid-s3-extensions"]

Overview of SQL Ingestion
The Extern Function
The first parameter to EXTERN is an Input Source, here is the S3 Input Source docs with many examples.

Then I think the easiest way to setup an example is to use the Druid Console in the Query view and click on “Connect external data” to the right of the query tabs.

  1. Select “Amazon S3” and either use a list of URIs to S3 files or a list of S3 prefixes ( S3 prefix is a URI to a folder that contains the files ),
  2. provide an IAM Access key,
  3. select Secret access key type as “default” and
  4. provide Secret access key.

The result will be a generated SQL statement with the appropriate Input parameters to the EXTERN function.

1 Like

Hi @Sergio_Ferragut
Any idea why below error is happening when i run a join query between two datasets?

Error: Resource limit exceeded
Subquery generated results beyond maximum[100000]
org.apache.druid.query.ResourceLimitExceededException