GroupBy v2 strategy not merging results correctly when queried through broker

Hey, recently upgraded from 0.9.1.1 to 0.9.2 and wanted to try out the new groupBy strategy but it doesn’t appear to be aggregating right when queries are sent through the broker.

If we send the query directly to the historical then it does return the correct result.

It sounds similar to the other topic (https://groups.google.com/d/topic/druid-user/TVyS-B-QQ2E/discussion ) but all nodes are running 0.9.2 and have been restarted several times but the problem is persisting.

In a groupBy on two dimensions(key1, key2) multiple events with the same keys come back.

If I add a filter for a specific (key1, key2) then it will aggregate correctly.

When I query immediately after re-indexing, before any merge tasks on that datasource have run, I get more unmerged events than when I query after merge tasks have ran for that datasource.

The datasource and the query are both pretty basic (not even a nested groupBy or anything) so I feel like it’s more likely that I’m missing something rather then running into a bug.

Any help would be appreciated.

Hey Daniel,

Could you attach the query you’re using and the results you’re getting?

Also, do you have any realtime stuff going on in your setup? If so: what kind (realtime node, tranquility, kafka indexing service)? And do you still have this problem if you exclude the realtime interval (query for older intervals only)?

Attached is the query and some of the results.

There isn’t any realtime stuff going on.

I believe each event has one other it should have been merged with (didn’t check all of them but every one that I did showed up twice). When I queried before the merge tasks ran I was seeing more than 5 events that should have been merged together.

groupby_results.json (1.09 KB)

groupBy_query.json (1.29 KB)

I see you don’t have “groupByStrategy” in your query context… are you setting it through runtime properties? If so, what property are you setting and are you setting it on the brokers, or historicals, or both?

I have the below in the common.runtime.properties:
druid.query.groupBy.defaultStrategy=v2

druid.processing.numMergeBuffers=4

``

If I specify groupByStrategy=v1 in the query context then it works as expected.

The Broker has 3 processing threads and the Historicals have 2 processing threads. (druid.processing.numThreads)

What happens if you put “groupByStrategy”: “v2” in the query context?

The same behavior as when I don’t specify anything in the context.
I believe I’ve also tested specifying v2 in the context when I don’t have the defaultStrategy runtime property set and saw the same behavior.

Hmm, your query is pretty straightforward, I don’t see any reason why it should be breaking.

Could you please attach your broker and historical runtime properties (and common properties)?

And are you totally sure everything is running 0.9.2? Not even any “unsupervised” daemons hanging around?

Everything appears to be running 0.9.2 and I don’t see any old druid processes hanging around. Also, looking around in zookeeper all of the announcements and listeners in there are pointing to the currently running hosts/ports.

The other mysql and zookeeper properties are set from the command line.

broker.runtime.properties (542 Bytes)

historical.runtime.properties (448 Bytes)

common.runtime.properties (3.83 KB)

I see you have caching enabled; if you add “useCache”: false and “populateCache”: false to your query context then do you get the correct results? If so, then I bet there is some problem with the caching.

Ah, yeah, that did it! Also makes sense why it was working correctly when querying the historical directly since caching was off on the historical.

Is there a need to flush the cache or something before using v2 or is caching itself just not playing nicely?

I know that groupBy’s are in the “uncacheable” list by default but it looks like that’s because they were overwhelming the cache (https://github.com/druid-io/druid/pull/638) not that it wasn’t functioning. Still interesting that v1 was working correctly.

I’m 99% sure this is a bug somewhere in the caching for groupBy and not anything you’re doing wrong. I’ll raise an issue for it in a bit and investigate.

Could you try doing the query with “populateCache”: true but “useCache”: false? What kind of results do you get then?

“populateCache”: “true”, “useCache”: “false”
and
“populateCache”:“false”, “useCache”: “true”

both have the incorrect behavior.

Thanks for taking a look, let me know if there’s anything I can help with just let me know.

Hey Daniel,

I just raised this bug that I think you’re hitting: https://github.com/druid-io/druid/issues/3820

I haven’t tested this yet, but I think that if you move caching from broker to historical then groupBy v2 should work fine. Historical caching tends to scale better in large clusters anyway (it allows historicals to handle some of the merging work) so you might actually prefer this. If you have a chance to try that then please let me know.

The way to do that would be set these properties on historicals:

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.historical.cache.unCacheable=

And then set useCache, populateCache to false on the broker.

Thanks for reporting this issue.

Thanks!

I’m following the github issue as was going to do some testing tonight to see if {historical cache + v2} performs better than {broker cache + v1}.
I think when we first started we saw that caching on the brokers worked better for us than caching on historicals but maybe the boost from v2 is enough to offset that now.