Druid query returns partial result with APIs

Hi Team,

We have started druid a few months back. Our druid cluster is quite small and simple having only 1 data source and around 1 billions rows. Our few queries return millions of row when we get a count but when we tried to extract the result through Node/Python, it returns only 5 to 10% of rows (around 10K-15K). We are using Druid SQL to fetch the queries. I read that Druid SQL automatically converts SQL into “scan select” when the volume is high hence druid does not require high buffer memory. Seems like something is blocking druid to return the complete result.

Any idea which properties should I tuned? Your reply will be appreciated.

Thanks

1 Like

Hi Jignesh:

Please check out information about The LIMIT clause from the druid.io online doc: https://druid.apache.org/docs/latest/querying/sql.html

Hope it helps.

Thanks Mark for sharing your thought.

I have requirement to export all the transaction details of particular client which having millions of rows. When export through “select * from client_id=1” it return only partial results. Is this expected behaviour?

Best Regards,

Jignesh

Jignesh,

You may want to setup query timeout so Druid keep servicing “In Flight” Queries. There are bunch of parameters which should be tuned accordingly so your query response will not truncate.

druid.server.http.defaultQueryTimeout

  • Default is 300000 milli

druid.broker.http.readTimeout

  • Default is PT15M

druid.router.http.readTimeout

  • Default is PT8M

Also set druid.broker.http.maxQueuedBytes

  • Make sure it fits within your heap
  • this setting is per query and so usage could be up to druid.server.http.numThreads * druid.broker.http.maxQueuedBytes)
    If results are exceptionally large then also look into druid.query.groupBy.maxOnDiskStorage

“resultFormat” : “arrayLines” is more optimized for large result sets. This can be set at query time.

Hope this helps.

Gaurav

I also like to explain a few things. Every query in Druid has two parts.

  1. Compute

  2. Data Transfer

Compute part is super efficient and response time totally depends on how much compute power we have in Druid cluster. Data transfer has external dependencies involved like how good your network transfer rate is between server and client. Druid sends http 200 response once compute is done and response is initiated. By this time data transfer may take time and hence you want to make sure the timeout is properly set so query responses are completed within timeout boundaries.

Druid will terminate query response if it goes beyond the timeout so now it is up to the client app to do a check for partial response.

It works after increasing the timeout limit. Thanks Gaurav.

Sometime we are facing the issue in the availability of segments and it never reaches to 100% availability. If I increase the cache size to 1TB then it starts working but this unnecessary consume 60% of space on disc. Is this right solution or expected behaviour?

Regards,

Jignesh