A case for "numeric dimensions" or "filterable metrics"

On many levels, Druid makes a strict distinction between dimension colulmns and metric columns. This distinction is most pronounced in how the two types of columns are stored in segment files, but it also permeates up to the tools build on top of druid, such as facet.

Consider the following image, which is a dashboard built on top of facet:

Basically:

  • dimensions (categorical values): for filtering
  • metrics (numerical values): for stats you care about

This scenario appears to be common in the ad-tech world, and druid originated out of the ad-tech world, so it’s easy to understand why druid makes this distinction. In this setting, a table that contained only dimension columns or only metric columns would be useless.

However, druid is not actually this limiting. With a topN or groupBy query, we can calculate the distribution of our result set on our dimensions. So a table that contains only dimension columns could be useful — consider this dashboard, which druid could power. Except for time, all of its columns are categorical dimensions. So at present, here’s how you should actually think about druid’s column types:

  • dimensions (categorical values): for filtering, for stats you care about
  • metrics (numerical values): _____________, for stats you care about
    This points out a hole—why can’t I filter on numeric values? This hole means that (as far as I know), druid could not power easily this dashboard, this dashboard, or this dashboard. Notice how each of them has a histogram that you can select intervals on (drag over the histograms). If these dashboards were backed by druid they could render the histograms using the approximate histogram aggregator, but you would not be able to drag across the histograms to select a range of interest and add that to the rest of your filters.

You might object and say that this is all possible with druid right now. As far as I know, it would be but possible but quite painful and pretty hacky. For example, you could include a numeric column twice–once as an approximate historgram aggregator, and once as a discretized dimension, and try to hack together range selection based on those two columns, but I don’t think that would work very well. At ingest time you’d have to decide on the bins you used to discretize the numeric values, which is painful (because this one column might contain data from many customers, each with very different distributions). I can think of other hacky ways of backing this filterable histogram (using high cardinality columns and javascript filters, or groupBys, both of which would be expensive).

What if we made it possible to have numeric dimensions? Or what if we had filterable metrics (not sure how you’d do this, maybe some approximate sketch magic)? Clearly, queries that involved such dimensions would be more expensive. Filtering on dimensions now is fast because we have specialized bitmap indexes that are useful when checking for exact matches. We couldn’t use these for numeric dimensions/filterable metrics. But we could possibly include some other type of index. Or we could just not index these columns, and accept that filtering on metrics is slow and should be done with caution (similar to groupBy queries). If we could filter on metrics, druid’s column types would then look like:

  • dimensions (categorical values): for filtering, stats you care about
  • metrics (numerical values): for filtering, stats you care about
    This would blur the lines between these two column types. Categorical columns would be quicker for filtering, and slower for aggregating stats, whereas metrics would be slower for filtering, and faster for aggregating stats. But everything would be possible with every column type, which would be powerful.

To summarize, I’m not really proposing any specific enhancement, just pointing out what I see is a hole. Maybe this is a hole that druid shouldn’t fill (druid is not a general purpose database, after all), but if it weren’t too hard to fill, it would make druid a good bit more powerful. Or maybe someone here who is more familiar with druid disagrees that this hole exists at all, and can show me how I can create my draggable histograms :slight_smile:

Hey Conrad,

Thanks for writing this up, the links to the example dashboards are especially helpful.

In my view ‘numeric dimensions’ make total sense, it’s just a question of figuring out how we’d want things to behave and then deciding how to implement it. I think what makes sense to me is that they should be part of the grouping key for ingestion (like string dimensions), you should be able to filter on them, they probably shouldn’t have bitmap indexes (people are probably going to be doing range filters rather than equality filters), and you should be able to aggregate them at query time, but you shouldn’t provide an aggregator for them at ingestion time.

So that points to the on-disk form looking like a numeric metric, but the ingestion behavior being more like a dimension.

It may be worth having min/max statistics on the columns or even on blocks of the columns to make range filtering faster. That might work out if the column is near the left hand side of the sort order. Otherwise range filters would be served by scanning the column, I guess? Is there anything better we could do?

Fwiw, there may still be some times where a discretized dimension is better than a full resolution numeric dimension. There are lots of data sets out there where discretizing an interesting metric will give you a good rollup ratio but storing the full resolution number will not. So the discretized version could be a huge space savings, and can be worth the loss of resolution.

It may be worth having min/max statistics on the columns or even on blocks of the columns to make range filtering faster. That might work out if the column is near the left hand side of the sort order. Otherwise range filters would be served by scanning the column, I guess? Is there anything better we could do?

There is indeed more that we could do. We could sort the values in such a way that we create an index that speeds up range queries. The idea here isn’t very hard to explain on a whiteboard, but a little awkward to explain conversationally…I’ll give it a shot.

Imagine that the raw values of your numeric column looks like this: [4,7,2,9]. Implicitly, each of these values has a position in the array, so we could re-write this as tuples of (position, value) pairs: [(0, 4), (1, 7), (2, 2), (3, 9)].

We could then sort these pairs by the “value” member of each pair – e.g., sorted_pos_array = [(2,2), (0, 4), (1, 7), (3, 9)]

At query time we’d be presented with the range of values we want [MIN, MAX]. We would then do binary search on that array of pairs to find where MIN is located in our array of tuples – call this value min_position. We could quickly figure out which rows to include in our response by starting at sorted_pos_array[min_position], then reading in the positions of relevant rows by looping over the rest of sorted_pos_array until the values there no longer fell within the range of the query.

How much would all this cost? We’d have to pay the O( n log(n) ) price of sorting the column at ingest time. In terms of storage, we’d have to store another array with two values for each row to keep track of sorted_pos_array. At query time, the computational complexity of figuring out which rows to include in the result set is the log(n) price of looking up min_position via binary search, plus the cost of scanning over part of ‘sorted_pos_array’ – this cost grows linearly with the size of the result set. So if the size of the result set is much smaller than the size of your table, the savings will be significant.

Fwiw, there may still be some times where a discretized dimension is better than a full resolution numeric dimension. There are lots of data sets out there where discretizing an interesting metric will give you a good rollup ratio but storing the full resolution number will not. So the discretized version could be a huge space savings, and can be worth the loss of resolution.

Yes, if you know the bins you’re interested in in advance, then discretizing numeric dimensions into low cardinality dimensions will be cheaper than all this, both at indexing time and query time. For my particular use case though, the relevant bins aren’t known in advance though…in fact, they vary greatly depending on the query I carry out.

Ok, so increasing the row size from 4 bytes to 12 bytes, while cutting down the time to resolve a filter from a full column scan to a binary search + scan of just the rows that match the filter (although the things you are scanning are twice as big). That sounds worth it if you are doing a lot of range filters.

I think something like that would be great in Druid.

I wonder if we should start an official proposal thread on the changes, although I think there’s some work to do to get an understanding of the scope of changes requires to support this use case.

In case this helps, some info on how Lucene does it.

See the “Discretely numerical” section for summary

http://blog.parsely.com/post/1691/lucene/

Start on slide 24

http://www.slideshare.net/VadimKirilchuk/numeric-rangequeries

Cheers,

Roger

Roger: thanks for the suggestions – if we do indeed propose to create numeric dimensions, then we should definitely look into lucene’s tree-based solution.

Fangjin: To keep this idea alive and actionable we should indeed write up a proposal thread soon, but I don’t think we’ve explored the possible space of solutions yet enough for writing up a concrete proposal.

I’ll try to briefly frame the problem as well as some of the possible solutions. When we’ve settled on a solution, it’ll be time to come up with a proposal

Problem: I can create an approximate histogram, but I can’t filter based on a numeric range visible in that histogram.

**Possible solutions **

  1. Require data to be pre-binned at ingest time (already supported)
  2. Create a new ‘approximate histogram’ dimension type similar to the approximate histogram metric, but which supports filtering.
  3. Create a general purpose ‘numeric dimension’ that supports numeric range queries. This solution may or may not involve some kind of specialized numeric index.
  4. Abolish the distinction between dimensions and metrics. Create a DSL that enables users to associate whatever data structures they want with each column—the choice of data structures (including string values, numeric values, inverse indexes, numeric indexes) will define what types of queries/aggregators are possible on a column.
    I’ve ordered these from least change/lowest impact to most change/highest impact. Feel free to augment this list with other ways of solving the problem.

Just my two cents on this from a UI dev perspective. I always saw the separation between dimensions and measures as only meaningful in terms of the roll-up, dimensions get grouped while measures get aggregated. Roll-up in a key reason for what makes Druid cost effective to run.

From a user’s perspective I could not care less about rollup what is.

This is why I went out of my way to abstract it out in Plywood (formally known as facet): https://github.com/implydata/plywood/blob/master/test/simulate/simulateDruid.mocha.coffee#L527

If configured to do so Plywood can treat Druid dimensions as bucketed strings or as raw numbers it then uses javascript tricks to make it work as expected.

Where I am going with this is that ideally you would be able to define a virtual attribute in Plywood where you say: “hey I have this thing called revenue, and it is composed of these metrics: min_revenue, max_revenue, sum_revenue and this dimension: revenue_buckets (which is bucketed to 0.05)”. Then you should be able to filter / aggregate revenue as your heart desires while still preserving a reasonable level of roll up.

I hope to be able to surface the above approach in Pivot (https://github.com/implydata/pivot) which, right now, exhibits the symptoms that you described of dimensions and metrics propagating up to the UI.

Thanks for your thoughts Vadim, I’m glad to hear that someone else is thinking along similar lines.

Hey Conrad,

#4 sounds great to me as long as we avoid generifying things so much that it hurts performance. The one caveat is that I don’t see how we can completely remove the distinctions between dimensions and metrics, since for pre-aggregation at ingestion time, one of them needs to be grouped on and the other needs to be aggregated. But that could potentially be a property of the ingestion rather than something that is really tightly linked to the storage format.

I think #3 is a good place to start. Developing it will require generifying two really pervasive assumptions: dimension values are strings, and dimensions have bitmap indexes. I think if those two assumptions are gone, we’d be in a world where implementing #4 seems a lot less daunting than it does now.

I agree that if/when we did #4 then segment files would get bigger for some use cases and smaller for others, and that’s okay. I think it’s also okay if pre-aggregation becomes impossible for some use cases, as in my mind it is one of the key features of Druid but is not the only one. There’s lots of people happily using Druid today for datasets that it cannot meaningfully pre-aggregate (like server metrics that have already been pre-aggregated at the host).

On your last question- I unfortunately do not have bandwidth to do this work personally right now. But I do think it would make sense for the project.

+druid-development.

Thanks for your thought Conrad. You touched on a lot of good points.

Most of the changes required to implements 3 or 4 are not so much at the storage level, which already supports a bunch of different types of columns. In fact, a dimension in Druid is technically just a combination of a dictionary, an integer column and a bitmap index, so we just need to generalize those types of constructs. I think bitmap-only columns, new types of indices (numeric), partial indices (similar to postgresql), and dimensions without indices could be very useful as well. At the query level however, the dimension / metrics distinction is quite obvious, and the assumption that dimensions are strings runs pretty deep, so that will require some non-trivial work.

To remove the concept of metrics and dimensions altogether doesn’t seem crazy. That doesn’t necessarily mean we have to lose the possibility to do aggregation at ingestion. We could just call it an optional on-the-fly group by at ingestion time. It’ll be up to the user however to understand how the data was aggregated at ingestion time to make sense of it.

I think it makes sense for Druid to first lose the assumption that dimensions are strings. Doing so will solve the thornier query side issues, and also pave the way to remove the notion of metrics and dimensions altogether if we decide to go down that path.

The second step would be to generalize the storage side to accommodate a greater variety of column constructs. That would make the concept of dimension / metric not a property of the column itself, but a set of minimum characteristics a column has to satisfy in order to be used as such.

At that point, the notion of dimension or metric simply becomes a query concept, similar to how grouping columns and aggregates work in a sql group-by statement. Removing dimensions and metrics altogether, would then mean inventing new types of queries that don’t make those types of distinctions.

NB: As far as implementing index-less or index-only dimensions, this could be done independently of the other efforts, if needed. This would also add the ability to filter without an index, which will make range filters and some forms of numeric filtering possible.

Xavier

Hi

i am new to open source community and druid. how can i install druid.

how to move a file to druid and check the results. any web link information pls share

thanks.

Hey Gayathri,

There is an official Druid quickstart at http://druid.io/docs/latest/tutorials/quickstart.html and we also have a quickstart for the Imply distribution at https://docs.imply.io/on-premise/quickstart.

Also, since you already necroed the thread, I’m at least happy to report that a year and a half later, the dream of “numeric dimensions” and even “filterable metrics” is a reality. What ended up happening is a partial solution 4. I still think a full solution 4 makes sense for the future.

This stuff exists now:

  • You can create long and float dimensions at ingestion time (see http://druid.io/docs/latest/ingestion/schema-design.html#numeric-dimensions).

  • Long/float dimensions end up as long/float columns that use exactly the same storage format as a long/float metric.

  • Regardless of whether a long/float column was originally ingested as a dimension or metric, it’s always groupable, filterable, and aggregatable. This works because the storage format is the same anyway. It doesn’t always make sense (filtering on a metric that you pre-aggregated rarely makes sense) but it’s up to the user to make sure their queries are doing something useful in this regard.

  • Column capabilities do define what is possible as far as querying on the column, as suggested by Conrad.

  • Rollup at ingestion time is optional, as suggested by Xavier. If you do this then Druid still syntactically wants a “dimensions” array at ingestion time, but it’s only used to hint column types and not for pre-aggregation.

And this stuff doesn’t exist:

  • Ability to add any kind of indexes (numeric or bitmap) to a long/float column

  • Ability to disable bitmap indexes for a string column

  • Some nicer syntax like the DSL that Conrad suggested

If anyone is interested in helping out with the latter three then that would be awesome.

Hi,

Is filter on metrics possible now in druid?