Queries on high cardinality dimension

Hi Everyone,

– Please bear with me if I am not asking it right.

I would have no worries with that usage. I would recommend using semantic partitioning for your segments instead of hash partitioning, though. If there is some interdependence between the two dimensions, that’s even less of a worry then.

—Eric

Hey Eric,

Thanks for replying.

When u say semantic partitioning do you mean writing a custom partitioner for a segment on the basis of dimensions?

Also, on the cardinality; when do you think it will become a problem. I mean what if the cardinality of dimension B reaches 1b. Do you know what is the upper limit?

Another query pattern I am looking at is the calculation of unique values. Will the above assumption hold in this case as well?

Example:

{

“queryType”: “timeseries”,

“dataSource”: “D”,

“intervals”: “2017-12-01T08Z/2018-01-01T08Z”,

“granularity”: “all”,

“context”: {

“timeout”: 60000

},

“aggregations”: [

{

“name”: “main.countDistinct_B-d4c”,

“type”: “cardinality”,

“fields”: [

“B”

]

}

]

}

Thanks

If you use semantic partitioning (yes, the dimension shard spec, it exists you don’t have to write it), then even 1b wouldn’t really be a problem given the queries you’ve shown.

If you were to do a query that expected a 1b row result set, that could be somewhat slow. But the queries you’ve shown would tend to generate rather small result sets so I don’t think it will be a problem.

—Eric

Thanks, Eric. I will try this out and report back.

Hey Eric,

Is the partitionspec configurable for real-time ingestion? All the examples I have seen till now are for hadoop ingestion.

Even here the code is using HadoopIndexConfig by default.

https://github.com/druid-io/druid/blob/master/indexing-hadoop/src/main/java/io/druid/indexer/partitions/PartitionsSpec.java

Am I missing something?

Thanks

It’s really batch only. For real-time workloads you can still ingest using hash and then re-index the real-time segments using the dimension shard spec. With the kind of data you are dealing with, doing this will also generally shrink the total size of your segments do to better data locality and the compression benefits that come with that.

Also, when you list the dimensions in your ingestion spec, list the one you are more likely to filter most by first. Basically, the data will be sorted in the segment based on the order you list your dimensions. So putting the commonly filtered one first will cause those queries to have very good data locality.

—Eric