Multiple high cardinality dimensions

Hi.

I am new to Druid and just finished startup tutorials.

The problem that I need to solve is quite clear.
Please check druid is adoptable.

2billion events come from kafka.
20 fields exist and 10 of them are very high cardinality dimensions(each field can have 1 billion distinct values for a day).

this is fields and cardinality

second query need to fixed as,

select extra1, extra2, count, distinct user count
from datasource
where ecentid=1
and extra3=ooo
__time=Yesterday
group by extra1, extra2

Hey,

Nothing stands out in the first query but you’ll potentially run into some issues with the second query depending on the combined cardinality of the set made up of extra1, extra2 after the filter is applied. You’ll either need to have sufficient memory to hold a hashmap of sufficient size for each “bucket” within each segment or enable spilling to disk. There’s some details on the implementation of groupBys here -> http://druid.io/docs/0.9.2/querying/groupbyquery.html#implementation-details

Also, if you’re only interested in getting distinct number of users have a look at the hyperunique aggregator -> http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html

Cheers,

Dylan

hi Dylan.

For second query, combination of extra1+extra2 after filter applied will not so high because we know the complexity of each eventid and the according extra1,2,3,9 field's cardinality.

You mean it will be ok if filter can narrow down the cardinality of aggregating field?

Also I will use hyperloglog counting .

How about ingesting and rollup performance? At first actually I worried about the rolling up, the 'curse of dimension'. The possible combination of extra1x2x3x4 is 100000000x100000000x100000000x100000000. AFAIK druid create pre calculated cubes. If i misunderstood pls let me know.

Hey,

Yeah, if your filter cuts down the set’s cardinality significantly I wouldn’t foresee the issues I described.,

I’m not so sure about how things will work out on the indexing side. Worst case, every event is unique and Druid generates 2 billion rows per day which is perfectly manageable if you have a sufficient number of shards/tasks. You can also disable rollup through a parameter in the ingestion spec which might be worth trying out.