On many levels, Druid makes a strict distinction between dimension colulmns and metric columns. This distinction is most pronounced in how the two types of columns are stored in segment files, but it also permeates up to the tools build on top of druid, such as facet.
Consider the following image, which is a dashboard built on top of facet:
- dimensions (categorical values): for filtering
- metrics (numerical values): for stats you care about
This scenario appears to be common in the ad-tech world, and druid originated out of the ad-tech world, so it’s easy to understand why druid makes this distinction. In this setting, a table that contained only dimension columns or only metric columns would be useless.
However, druid is not actually this limiting. With a topN or groupBy query, we can calculate the distribution of our result set on our dimensions. So a table that contains only dimension columns could be useful — consider this dashboard, which druid could power. Except for time, all of its columns are categorical dimensions. So at present, here’s how you should actually think about druid’s column types:
- dimensions (categorical values): for filtering, for stats you care about
metrics (numerical values): _____________, for stats you care about
This points out a hole—why can’t I filter on numeric values? This hole means that (as far as I know), druid could not power easily this dashboard, this dashboard, or this dashboard. Notice how each of them has a histogram that you can select intervals on (drag over the histograms). If these dashboards were backed by druid they could render the histograms using the approximate histogram aggregator, but you would not be able to drag across the histograms to select a range of interest and add that to the rest of your filters.
What if we made it possible to have numeric dimensions? Or what if we had filterable metrics (not sure how you’d do this, maybe some approximate sketch magic)? Clearly, queries that involved such dimensions would be more expensive. Filtering on dimensions now is fast because we have specialized bitmap indexes that are useful when checking for exact matches. We couldn’t use these for numeric dimensions/filterable metrics. But we could possibly include some other type of index. Or we could just not index these columns, and accept that filtering on metrics is slow and should be done with caution (similar to groupBy queries). If we could filter on metrics, druid’s column types would then look like:
- dimensions (categorical values): for filtering, stats you care about
metrics (numerical values): for filtering, stats you care about
This would blur the lines between these two column types. Categorical columns would be quicker for filtering, and slower for aggregating stats, whereas metrics would be slower for filtering, and faster for aggregating stats. But everything would be possible with every column type, which would be powerful.
To summarize, I’m not really proposing any specific enhancement, just pointing out what I see is a hole. Maybe this is a hole that druid shouldn’t fill (druid is not a general purpose database, after all), but if it weren’t too hard to fill, it would make druid a good bit more powerful. Or maybe someone here who is more familiar with druid disagrees that this hole exists at all, and can show me how I can create my draggable histograms