Druid is returning different result for the same query

Hi everyone,

we use Druid quite intensively and heavily rely on its result, but from time to time the same query is returning different result. We have a cluster of ~30 instances and sometimes we need to restart historical process in different instances.

I experienced a weird behavior of Druid while restarting historical nodes: it is returning a partial result until all nodes (that have segments for the requested data) are up and running.

I mean having partial result is obvious in that situation, but I would expect an empty result or an error until the entire cluster is up and stable again.

In my investigation I was forcing Druid to not use cache and during restarting all historical nodes I was firing the same query multiple times. Those are the results that I got:

query_1

// If Druid is not stable should return an empty result

query_2
// If Druid is not stable should return an empty result

query_3
“version”: “v1”,
“timestamp”: “2017-08-01T00:00:00.000Z”,
“event”: {
“requests”: 1522346231, // Druid starts to return incomplete result
“imps”: 3502031,
“clicks”: 13745
}

query_4
“version”: “v1”,
“timestamp”: “2017-08-01T00:00:00.000Z”,
“event”: {
“requests”: 16895798204, // all metrics are increasing
“imps”: 34952339,
“clicks”: 128912
}

query_5
“version”: “v1”,
“timestamp”: “2017-08-01T00:00:00.000Z”,
“event”: {
“requests”: 17529400203, // still increasing
“imps”: 36228979,
“clicks”: 133302
}

query_6
“version”: “v1”,
“timestamp”: “2017-08-01T00:00:00.000Z”,
“event”: {
“requests”: 17529400203, // from now on Druid is stable and it returns the correct result
“imps”: 36228979,
“clicks”: 133302
}

query_7
“version”: “v1”,
“timestamp”: “2017-08-01T00:00:00.000Z”,
“event”: {
“requests”: 17529400203, // from now on Druid is stable and it returns the correct result
“imps”: 36228979,
“clicks”: 133302
}

``

Does anyone know why this is happening?

Is there a way to configure Druid to avoid that behavior?

There is no full proof way, but some hack.

Druid response contains a header called “X-Druid-Response-Context” which is one way of reporting some metadata back to the user. It is usually a serialized json map string. (it may get truncated though if header value gets too large and in that case the string wouln’t be a valid json map and may get fixed in future https://github.com/druid-io/druid/pull/3319 ) but not much is reported inside that header for now so it wouldn’t be truncated.

Now, There is an undocumented query context flag that could potentially help you.

If you specify “uncoveredIntervalsLimit”: 1 in the query context then Druid broker would return “uncoveredIntervals” key inside the map above in case Druid broker was not able to find any segments for any time interval in the given query. You can use presence of that key as a notification that partial results have been returned.

However, it would only work if your data is ingested using hash partitioning spec where number of shards in a partition set are known in advance.

hth,
Himanshu

thank you very much Himanshu, it was very helpful and solved the issue!

regards,

Filippo