I recently created a Druid query which uses two hours of clickstream data to create so-called trending articles. The corresponding queries are attached. I created one query that uses groupby and another one that uses topN to compare results and performance.
When comparing the results of the queries, I noticed several inconsistencies:
When querying older data (provided by historical nodes), both queries will give same results. I checked the results and they appear to be correct. When querying recent data (from MiddleManager Nodes) however, both queries will give different responses, which, when checked, are both wrong.
This raised a few questions:
Is there a difference in how Queries are processed between middleManager and historical nodes?
Considering both queries, how can the answers differ so much? The approximation used in TopN-Queries should not cause this big a difference, right?
How can this be fixed?
trend-groupby (1.54 KB)
trend-topn (1.35 KB)