[druid-user] Re: Retriving large results from durid throws error

This may be useful for you:

You may need to increase the query timeout periods in your config.

That said, I’d (personally) like to know more about the use case, as any query that takes… well… longer than 5 seconds would make me think about using a different database anyway. Especially SELECT *…

Hi Peter,

We have the below business use case.

We have a pre-computed flat table with 100+columns (2TB size) in SnowFlake having the transaction activity of various customers.

Users want to query this table with various filters ( customer id, transaction age, created date, department ),
The resulting data will be in 0.5 -1 million records ( avg ), Users will feed this csv data into their machines and proceed with analytics.

We experience query requests of 100k/day on this table.

While SnowFlake perfectly handles the requests, the computation cost is bursting up each month. So to avoid that, we thought of pushing this data to Druid and make users query from there.

The Ingestion was perfectly fine. Querying data upto 0.1 to 0.3 million records hold good. Even if I keep the timeout as a parameter, it pulls partial results . I see this is a druid bug .
https://github.com/apache/druid/issues/11422

Please help us how we can configure druid for our business case. We went through the below configuration, but it did not help. We still face the same Issue.

There’s an awesome video from @Gian_Merlino2 who talks about query execution in Druid – I think if you google the “beyond the basics” talk from Virtual Druid summit you will find it…

Querying with filters: make sure that you have indexes on your columns properly. And if there are columns that someone will always use, or will use a lot, make sure that you have got the dimension order correct (most popular first) and that you have considered using the right sub-partitioning scheme.

Querying note: I didn’t see “time period” in your query. Time is central to Druid. Not just a timestamp at ingestion time (which does the primary partitioning) but also at query time. The length of time that you pull out in queries is directly proportional to the number of segments that Druid must pull out and process. Thus, if you don’t have enough cores for Druid to run through each segment in parallel in a time that suits your performance goals, you will not meet them.

Result set size: Druid is a pyramid. Think map-reduce. The angle of those pyramid walls is REALLY important in Druid. If you query 1m records and want to get 1m records back, you are not playing by Druid’s rules. But if you query 1m records and GROUP BY and you get 10 records back, ORDERed and LIMITed, using Data Sketches for high-cardinality data — all those kinds of tricks - you are well on the road to success.

You can try to decode exactly what is happening in Druid by watching that video from Gian, and then looking carefully at the metrics for each process to discover exactly where the pressure is in the Druid fan-out / fan-in at query time. I suspect that this is at the broker, so you may need to look at things like merge buffers and the like – but be cautious as Druid is not a data warehouse, and it is not a datalake…