OR filters performance degradation

I ran a bunch of Load Testing queries and was baffled by some inconsistencies between almost identical queries. Narrowed down the differences to an OR filter over a single dimension. I would love to get some input on whether this is a common bottleneck or possibly a configuration issue. Let me know if you need more information about the setup.

Test Setup:

Druid 0.8.2

2 historical nodes on r3.8xlarge, 31 processing threads

1 broker node on r3.8xlarge, 31 processing threads
caching OFF

Query Setup:

Not identical but randomized based on an expected production simulation, the difference being the size of the OR filter.

“filter”: {

“type”: “and”,

“fields”: [

{

{

"type": “or”,

"fields": [

**{ **

"type: “selector”,

"dimension": “K”,

"value": "V1"

},

**{ **

"type: “selector”,

"dimension": “K”,

"value": "V2"

},

Results:

Note: The LoadTester wasn’t built to necessarily spam or overload the system, so its running at a measured pace instead of as fast as possible.

Control - Ran without the OR selector at a pretty consistent 40 q/s, with ~.3 sec response

Ran on a single historical without the OR selector at 39 q/s, .4 sec response

selectors | q/s | avg response | 50% broker qt | 80% broker qt | 95% broker qt | hist-0 segment wait | hist-1 segment wait|

1 selector | 40 | .33 sec | 23 ms | 80 ms | 100 ms | 20 ms | 20 ms |

2 selector | 37 | .55 sec | 50 ms | 100 ms | 200 ms | 50 ms | 50 ms |

3 selector | 35 | .86 sec | 87 ms | 220 ms | 400 ms | 140 ms | 210 ms |

4 selector | 27 | 2.24 sec | 120 ms | 600 ms | 2000 ms | 140 ms | 970 ms |

5 selector | 22 | 3.57 sec | 150 ms | 500 ms | 8000 ms | 180 ms | 2700 ms |

6 selector | 18 | 5.23 sec | 170 ms | 45000 ms | 160,000 ms | 210 ms | 160000 ms |

10 selector| 10 | 13.35 sec | 250 ms | 1200 ms | 210,000 ms | 215 ms | 210000 ms |

Hoping someone can help explain/decipher these results for me based on the test cases. Once again let me know if you need more information or want me to run some different tests.

Thanks!

Michael

Hey Michael,

Is it possible this is just due to more rows needing to be scanned? As you add more OR filters the number of rows that needs to be scanned should go up, and query time should go up along with that.

Hey guys I’m back with more specific and less varied results. I’ve added the queries that I’ve used along with row counts for the datasets being returned. I also ran the loadtesting sequence twice, the second time adding an extra filter to narrow the resultset down a little more. As you can see the query without the OR filter didn’t do amazingly, but was still faster than the 10x OR selector queries significantly. Also it didn’t exhibit the cliff behavior, which is worrying. Once again please let me know if you want any more data/information.

Query A:

{

“queryType”: “timeseries”,

“dataSource”: “events”,

“intervals”: [

"2015-11-11/2015-12-01"

],

“granularity”: {

"type": "period",

"period": "P1D",

"timeZone": "Etc/UTC"

},

“filter”: {

"type": "and",

"fields": [

  {

    "type": "selector",

    "dimension": "app_id",

    "value": "XXX"

  },

  {

    "type": "or",

    "fields": **[**

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-06T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-07T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-08T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-09T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-10T00:00:00+00:00"

}

]

  },

  {

    "type": "selector",

    "dimension": "attribution_action",

    "value": "install"

  }

]

},

“aggregations”: [

{

  "name": "Count",

  "type": "count"

},

{

  "name": "Total",

  "type": "doubleSum",

  "fieldName": "count"

}

]

}

Results:

selectors | rows | q/s | avg response | 50% broker qt | 80% broker qt | 95% broker qt | hist-0 segment wait | hist-1 segment wait|

0 selector | 43M | 14 | 8.0 sec | 135 ms | 870 ms | 8123 ms | 400 ms | 7K ms |

1 selector | 132K | 41 | .2 sec | 35 ms | 400 ms | 400 ms | 5 ms | 5 ms |

5 selector | 580K | 16 | 5.1 sec | 120 ms | 800 ms | 170K ms | 219 ms | 170K ms |

10 selector| 1M | 9 | 16.2 sec | 200 ms | 850 ms | 245K ms | 622 ms | 245K ms |

Query B:

{

“queryType”: “timeseries”,

“dataSource”: “events”,

“intervals”: [

"2015-11-11/2015-12-01"

],

“granularity”: {

"type": "period",

"period": "P1D",

"timeZone": "Etc/UTC"

},

“filter”: {

"type": "and",

"fields": [

  {

    **"type": "selector",**

"dimension": “geo_country”,

"value": "CAN"

  },

  {

    "type": "selector",

    "dimension": "app_id",

    "value": "XXX"

  },

  {

    "type": "or",

    "fields": **[**

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-06T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-07T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-08T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-09T00:00:00+00:00"

},

{

"type": “selector”,

"dimension": “attribution_day”,

"value": "2015-11-10T00:00:00+00:00"

}

]

  },

  {

    "type": "selector",

    "dimension": "attribution_action",

    "value": "install"

  }

]

},

“aggregations”: [

{

  "name": "Count",

  "type": "count"

},

{

  "name": "Total",

  "type": "doubleSum",

  "fieldName": "count"

}

]

}

Results:

selectors | rows | q/s | avg response | 50% broker qt | 80% broker qt | 95% broker qt | hist-0 segment wait | hist-1 segment wait|

0 selector | 4M | 30 | 1.78 se | 200 ms | 825 ms | 2.5K ms | 350 ms | 350 ms |

1 selector | 17K | 41 | .21 sec | 50 ms | 250 ms | 500 ms | 12 ms | 12 ms |

5 selector | 63K | 17 | 5.6 sec | 123 ms | 310 ms | 4K ms | 112 ms | 1200 ms |

10 selector| 120K | 9 | 15 sec | 150 ms | 650 ms | 29K ms | 200 ms | 3800 ms |

Hey guys here is some more compelling data from a couple of loadtests I just ran. These tests are on literally the exact same data, the only difference is how I got the data. I’ve got 2 dimensions attribution_day and attribution_hour. Both just bucket a secondary timestamp so that it can be used for queries. As you can see filtering on a single day vs. the 24 hours return the same row count. The performance however is very different just like in the other tests.

selector type | selectors | rows | q/s | avg response | 50% broker qt | 80% broker qt | 95% broker qt | 99% broker qt | hist-0 segment wait | hist-1 segment wait|

1 DAY | 1 selector | 154,921 | 41 | .19 sec | 50 ms | 200 ms | 400 ms | 500 ms |530 ms | 450 ms |

24 HOURS | 24 selector | 154,921 | 24 | 2.91 sec | 100 ms | 325 ms | 1.5 K ms | 60K ms |450 ms | 1.5K ms |

Thanks and let me know if you need more info,

Michael

Bump hoping for a little bit of input on this issue.

Do you notice time differences between the queries on your historical
nodes as well? The historicals emit a "query/segment/time" metric for
each individual segment, if you look at those, you can see how long
the historicals are spending while evaluating your query.

If that shows the exact same behavior, then my best guess would be
that somehow the dictionary lookup is slowing things down. If that is
the case, then I would expect to see the same performance cliff by
just creating a really large AND as well. Have you tried that?

--Eric

I don’t have the query/segment/time metrics readily available, but the query/segmentAndCache/time (caching is off so I’m assuming it’s the same metric) shows .5 ms avg for the DAY query and 5 ms avg for the HOUR query. So that’s a pretty significant difference that I didn’t pick up on before. Interestingly it doesn’t show the same difference between the two historicals that segment/wait/time showed. I’ll try a large AND query tomorrow and report back with some results.

Thanks,

Michael

.5 ms or .5 seconds?

.5 ms on a per-segment basis makes me wonder if your segments are very
small. What is the average size of your segment?

--Eric

The segments are on the small side. They’re between 1.5M-3.5M rows and between 150 MB-300 MB. You think that would cause this kind of issue?

Michael

With that size, no. If they were really tiny, maybe (it does the
dictionary lookup for the filter for each segment, so if there are a
lot of segments, the dictionary lookup could dominate the timing).

--Eric

I ran some LoadTesting with a large AND lookup (for the same segments with a bunch of AND NOT selectors) and it negatively affected query performance. However it didn’t fall off the same cliff as the OR selectors. It was a much more shallow and linear performance slowdown, which is much more manageable especially considering AND selectors are typically limiting query size vs OR selectors which are expanding it.

I also ran LoadTesting on a different datasource (which is already much slower) with the same OR selector differences resulting in the same rows of results. The HOUR OR-selector query performance fell off the same kind of cliff, with segment/wait/time and 80 percentile and above queries slowing to a crawl. The DAY selector on the other hand looked fine.

Michael

When you say you did "AND NOT" queries, I'm assuming you are doing the
equivalent OR query with DeMorgan's laws? I.e.

(a || b) == !(!a && !b)

Which should mean that as you add more things, it will still be
expanding scope instead of decreasing it.

Are you specifying either roaring bitmap or CONCISE or just using the default?

Also, you didn't clarify before. You said "the
query/segmentAndCache/time shows .5 ms avg for the DAY query and 5 ms
avg for the HOUR query". Is that really 0.5ms? How selective is the
filter set that you are adding? The query numbers you are seeing of
even ~5ms per segment is rather perplexing with 3 million row
segments. Even our fastest benchmarks would say that it takes ~50ms
to scan over 3million rows, so something seems fishy there.

For example, if I take all of the information you have provided and
compare with the results you listed:

selectors | rows | q/s | avg response | 50% broker qt | 80%
broker qt | 95% broker qt | hist-0 segment wait | hist-1 segment wait|
0 selector | 4M | 30 | 1.78 se | 200 ms |
825 ms | 2.5K ms | 350 ms | 350
ms |
1 selector | 17K | 41 | .21 sec | 50 ms

250 ms | 500 ms | 12 ms |

12 ms |
5 selector | 63K | 17 | 5.6 sec | 123 ms |
310 ms | 4K ms | 112 ms |
1200 ms |
10 selector| 120K | 9 | 15 sec | 150 ms |
650 ms | 29K ms | 200 ms | 3800
ms |

Just taking the "0 selector" case, we have

1) query scanned 4 million rows
2) segments are 1.5 to 3.5 million rows, so let's assume 1.5MM and say
that it scanned 3 segments.
3) per-segment scans take 5 ms, so there would've been 5 ms of
parallel processing time on 3 threads (one per segment) per historical
4) you have 2 historicals, each with 31 threads, which means we expect
60 (well, 62, but let's just take 60 for simplicity) threads to be
available for queries.
5) With 60 threads operating and 3 threads needed per query, we should
expect 20 concurrent queries. Each of which taking 5ms on the
historical nodes.
6) With 20 concurrent queries each taking 5ms on historical nodes, we
expect to be able to do 200 rounds of those per second, which means we
should achieve 200 * 20 = 4000 queries per second

Yet, you are saying that you saw 30 queries per second and somehow
achieved "query/wait" metrics of 350ms. "query/wait" happens when the
processing threads are backed up and work is waiting to be given to a
thread. But, given the 5ms values that you have reported and the low
query rate above, that level of "query/wait" should be impossible.

To achieve that level of "query/wait" one of the following must be true:

1) the query/segment/time (or query/segmentAndCache/time if caching is
turned off, you can make sure of this by adding "useCache": false and
"populateCache": false to the "context" of the query) must be greater
than 5ms
2) There are not actually 31 processing threads set on the machine
(simplest way to verify this is to remove the -XX:MaxDirectMemory
setting and see what the error message tells you about memory usage)

Just to simplify this, please run a new test with the following
parameters and then let's branch out:

1) 0 selectors, 4 millions rows of data. Please report the number of
segments for this time period on this data source.
2) Bounce all processes
3) Run the single query 100 times as fast as possible, do *not*
collect the results. (This is done in order to give the JVM a chance
to do all of the JIT it's gonna do and to make sure the data is all
loaded into memory on all of the replicas)
4) Run the single query one time, report the results for the following
metrics as reported by Druid: "query/time" on the broker,
"query/node/time" per historical on broker, "query/time" on the
historicals, "query/segment/time" on the historicals, and "query/wait"
on the historicals.

I'm not sure how you are collecting metrics from your system, but if
you don't have an easy way to slice into the various dimensionality of
the metrics and are just dumping them to a log file or something, if
you set the "queryId" on the context before sending the final query
for reporting, you should be able to use that to grep out the metrics
for the given query. Otherwise, you can find the query in the
broker's request logs, yoink the id from there and us that to grep
stuff out.

These numbers will establish a baseline of performance without any
filters applied and also give us an understanding of any environmental
bottlenecks that might exist.

--Eric

I was not using DeMorgan’s laws which might have been more effective of a test. I was just chaining a bunch of mutually exclusive AND NOTs together over a single dimension. We are currently just using the default bitmaps, but I would love to know more about the differences. I was under the impression that concise was the default.

The particular queries for the metrics that you are referencing were actually over 154,921 rows over 20 days worth of segments. So each query is actually scanning tiny pieces of 480 seperate segments. I really should have passed this information on earlier but I got super loose with my documentation as I was pretty focused on just the differences between OR Selectors rather than big picture. Also my strategy for testing load was based on earlier testing I’d done, in which I was spinning a bunch of threads up that each mimicked a user traversing the site. So part q/s being off is that there is some time waiting for specific threads issuing queries.

I’ll knock these tests out and get some results back to you but just to be clear, when you say 4M rows you’re asking for unfiltered queries over several segments?

Thanks!

Michael

Hey Eric here are the results. Nothing too exciting:

Rows | Segments | q/segment/time | hist-0 q/time | broker q/time | hist-0 q/node/time | hist-0 q/wait/time |

2991K | 1 | 184 | 186 | 206 | 206 | 0 |

Let me know where you’d like me to go with it next. Had a hectic day but should be able to get results back to you much faster from here on out.

Thanks,

Michael

Cool, ok, that number makes a bit more sense. Assuming it's one of
the queries from above, it's a bit slower than I would expect, but ok.
From this we know that scanning a single segment takes ~180 ms and
there's about 20ms of overhead added by the broker.

Now, for this one query on this one segment, please add the filters
back and watch these same metrics. Let's see how they change.

--Eric

Alright here are the two queries run once with just the secondary timerange filters:

Just to be clear, you ran the queries with the filters a couple of
times before doing those measurements, right? (Want to make sure the
dictionaries were primed in memory)

Assuming that they are primed, we see 6 milliseconds going to
essentially looking up the dictionary values for the HOUR. I'm
assuming you have 24 things in there, so I wonder if each 4 things is
adding one millisecond. If you drop it down to 12 does it tend to
only take an extra 3 milliseconds? (i.e. is the time increase linear
in the number of things on the OR?).

6 milliseconds for 24 lookups in memory seems slow to me. We might
need to look at optimizing that lookup (I've always thought that it
would be better represented as a trie instead of the sorted array that
it is today, I'm not sure if a trie would actually be a faster lookup
though...)

--Eric

Unfortunately I don’t have the secondary time bucketing to really test 12 hours, but I used a different DAY filter with similar row count. I also primed all these queries for 30 seconds before sending in the query with a query_id marker (including the ones from before that i’ve put in).

Selectors | Query | Rows | Segments | q/segment/time | hist-0 q/time | broker q/time | hist-0 q/node/time | hist-0 q/wait/time |

1 | 1 DAY #1| 20K | 1 | 11 | 13 | 14 | 12 | 0 |

1 | 1 DAY #2| 20K | 1 | 12 | 12 | 12 | 13 | 0 |

1 | 1 DAY* | 10.3K | 1 | 11 | 11 | 12 | 12 | 0 |

12 | 12 HOUR | 9.6K | 1 | 12 | 13 | 14 | 13 | 0 |

24 | 24 HOUR #1| 20K | 1 | 17 | 17 | 19 | 18 | 0 |

24 | 24 HOUR #2| 20K | 1 | 16 | 16 | 18 | 18 | 0 |

*other day with less rows

Also my main concern was less with the general selector slowdown, and more with the OR selector queries seeming to do much worse under heavy load. Judging by my percentiles it looked like the 95-99% slowest queries started to timeout, and turn the whole system into molasses with q/wait/time. The drop in performance under load without the selectors was much more linear and predictable.

Right, so, I'm trying to get baselines for these things. Btw, how can
the data not be setup for the 24 hour down to 12? I thought you were
essentially doing "hour == 01 || hour == 02 || hour == 03 || hour ==
04 || ... || hour == 24" Which should be pretty easy to eliminate
just 12 of those. You can then include an "|| day == theDay" to get
the same number of rows.

Either way, so we see some speed up, but definitely not linear...

Now, if we take these queries and increase them to just this one, sent
under load but with a larger time period. What happens? I.e. let's
scale the test up to all 480 segments. Start by just having a single
process doing it over and over again and see what performance you get.
Hopefully it's still 10ms per segment, with 62 total cores, that
should be 480 / 62 rounds (8 rounds) of 10ms, so hopefully each query
takes roughly 100 to 200 ms. If you then add another client sending
the same query as fast as possible in serial, you will hopefully see
the same thing because you won't saturate the CPUs yet.

As you add more and more clients, we expect to saturate the CPUs,
which will result in query/wait and start to slow down some queries.
Hopefully we can do some timings with more and more clients until we
reproduce what you have experienced with query/wait slowing things
down greatly.

It will be interesting to know what the above 5 timings are that you
have gotten for the above queries for one of the queries that runs
while the system is backlogged.

Then please do the same thing with the DAY query and see how those
numbers change.

Fwiw, the expectation is that latencies just mathematically increase
with the saturation level of CPU. For the OR query, for each segment,
it apparently will do ~6 milliseconds of overhead for the lookups,
over 8 rounds of queries that amounts to an expected 48 extra ms of
overhead.

--Eric