Queries are so Slow

Currently, I’m working with an Apache Druid Warehouse that stores near of 300 Million rows and have a size of 44GB. We are developing a Flask API that uses Gunicorn and Celery to develop SQL queries in Druid. It exists a React App that generates multiple requests to the Flask API and then API requests Data to Druid in a properly SQL query. Our problem is that Druid’s response lasts a lot of time. i.e. When we send our stack of requests we achieve 46 requests in the worst case, it can take near of 1.3 minutes to return the last response when they are not in cache. We work a lot on our front end and the API optimization, two cache systems in API and Druid were implemented, however, performance is bad yet. we suspect that problem is located in the Druid data source.

Our Druid Datasource has the next features:

  1. Total Data Size 44.01 GB
  2. Segment size(rows) minimum: 1, average: 0.151M, maximum:0.637M
  3. Segment Granularity: Day
  4. Total Rows: 295.465.723
  5. Avg. Row Size: 148
  6. Replicate Size: 44.01 GB
  7. Compaction: Not Enable.

Then we run a query over our data source and we find that the segment with the greatest amount of rows has 636688 rows and byte size of 80859007.

I think that we need to make compaction actions in our data source to increase rows per segment, that’s according to recommendations about segments in Druid documentation. Before ingesting our data source again, I want to know if the compaction of segments will enhance the query performance? Or we need to take another approach to this issue.

As additional information, Our query requests use COUNT(DISTINC expr) and GROUP BY, however, we don’t know if this kind of expression causes the bad performance.

Thanks a lot

OK. A few questions:

  1. Has this query ever performed well, or is this a new query over existing data?
  2. Are you using sketches?
  3. How many days of data is this looking at vs how many segments do you have

The answer to #3 will tell you if compacting will help much. If you have too many small segments, performance will degrade (as there is 1 thread per segment and opening too many segments will take a lot of CPU) and compaction should help.

If you are using sketches, the COUNT DISTINCT should be pretty snappy.

Let us know how this goes… @Mark_Herrera and I will be keeping an eye on this.

Thanks for your help,

About your questions,

  1. No this query was not performed well before. This is our first implementation using Druid.
  2. No, we are not using sketches.
  3. We have 1958 segments, and we have near of 1424 days of data.

As you can see, segments size is so small with respect to Apache Druid documentation advice (near of 5M rows).

We see to change query sintax using Sketches, but i’m still thinking about segmentation of datasource.

Thanks a lot

Depending on your use case, you may want to ingest the data as sketch, especially if the data you are doing a count distinct is higher cardinality and highly accurate approximate answer is ok. It wll also improve rollup if the column is high cardinality. If the column is low cardinality, it might not make a difference. You can test the queries by taking out the count distinct to see if it makes a difference or not.

In terms of the segments, compacting will get you less segments, but they are still pretty small. This post is long, but comprehensive on how and where you should look to improve performance:

Hi crojas,

A few questions:

  • Could you please run the query in the druid UI and confirm that druid is the bottleneck?
  • What is the cluster topology? How many nodes and what type of machines does the cluster have?
  • Are you applying a filter on __time and what is the typical duration applied on queries?
  • groupBy queries are costlier than others and we may have to tune your merge buffers
  • Your segments are on the smaller side but not alarmingly so. Compaction may help but I will need the above answers to guide you there

Hi Vijeth,

  1. I run the next query, which is standard in our application.

SELECT prestador_nombre, COUNT(*) FROM RIPS WHERE edad>11 AND edad<19 GROUP BY (prestador_nombre) ORDER BY 2 DESC LIMIT 100

Running without cache it takes 2.58 s

  1. We do not use a cluster. Our Druid is hosted in a single machine with 300GB RAM, nvmme disk and was set up using a set of dockers.

  2. No, we are not applying filters on __time column, most of our queries are applied in a column that contains an ID of users, however, the data distribution of this column is not uniform.

  3. Could you suggest to us any documentation about tune merge buffers?

hi crojas,

Thanks for sending these answers.

  • Right off the bat, we are losing some of the parallelism as well as fault tolerance built into druid as we are not using a cluster. I would highly suggest it.
  • Without knowing the exact setup it is going to be difficult to be precise in any suggestions, but please make sure you are allocating the right resources to the different services needed to run druid from here: Basic cluster tuning · Apache Druid (This will also have information on tuning the merge buffers)
  • Not using a filter on “__time” will mean that it will do a full table scan which is not going to be performant. We always want to put some filter on this table so we can prune segments. If this is not possible, the next best option would be to use single dim or range partition on that user ID column you mentioned.

Hope some of this info. is useful. Thanks!

1 Like