Hierarchical dimensions?

Does Druid support the notion of OLAP hierarchical dimensions?

For example:

Word

— Americas

  >-- USA
  >--- Canada

— Europe

– Germany

— France

— UK

In other words, data is held at the child (leaf) node level, and queries can be run to generate total for Americas or Europe

Not explicitly, although in your example you can get a similar result by storing “region” and “country” as two separate dimensions.

Not explicitly, although in your example you can get a similar result by storing “region” and “country” as two separate dimensions.

Gian

Doesn’t this double the storage requirements?

Odds are it won’t be that bad, since the country dimension should compress very well compared to region.

Would like to re-visit the Hierarchy question. I’m brand new to Druid and I’ve been an OLAP guy for a long time using Essbase (Oracle) I would love to be able to start using Druid as I see its potential but without hierarchical dimensions it will be tough as a replacement to Essbase which is heavily used in financial analysis. Any thoughts around how easy it would be to get this working with proper hierarchies?

Pablo

Hi Pablo, did you ever get an answer to this question? Also considering Druid as an Essbase replacement. Thanks!

Andreas

We do exactly this type of query in our Druid instance using the lookup functionality. The fact data is stored at the lowest grain (in your example, country), and then we have lookup tables to compute higher levels (like region, continent, etc) from the country dimension value. Our queries can then filter and group by whatever level we need using the appropriate lookup and extraction function on the dimension. We have hierarchies and other chains of dimensions at least 5 levels deep that run without problem.

The biggest challenge with this approach is handling your lookup tables. We ended up writing an hbase integration for our lookups, with a robust caching layer on top of it, and that solved our problems. Remember, depending on where it is specified in your query, a lookup may have to happen as often as every row scanned on the historical node, so even lookups that execute in the single digit millisecond range will quickly add up over millions of rows. A robust caching layer can reduce this to micro or nanoseconds per lookup. Once the vectorized query work is completed, allowing batching of rows during processing, this lookup process can probably be improved quite a bit more.

Will