I’m seeing some interesting behavior on druid query performance based on the dimensional filters that I am applying. Was hoping I might get some insight here. This is the test I ran:
-Queries are made on a druid data source with 22 dimensions.
-I made two timeseries queries with all parameters the same, except for the filters. useCache was also disabled in the druid query.
-Query1 had a filter with 3 values or’d, applied across dimension A. Dimension A has cardinality 4.
-Query2 had a filter with 3 values or’d, applied across dimension B. Dimension B has cardinality 7.
-I ran the druid queries by curling the druid broker, and prepending the curl with “time” to measure its performance.
-Query1 would average around 0.7s real time, and Query2 would average around 0.35s real time.
My understanding up to this point was that each dimension is stored as a column in the segment, and when filtering, an internal bitmap of each column for each possible value is pulled out and or’d (if the value is specified in the filter). From this, I don’t see how there could be such a vast difference (2X) in query performance, simply by filtering on different columns. Can anyone shed some light on this? Is there something else that could affect query performance from the dimensions I am missing?
I’m not sure if I’ve described the problem well enough, please let me know if I can supply any additional information.