Hash-based vs Single-dimension partitioning

Currently in our druid deployment we are using Hash-based partitioning with SegmentGranularity is DAY and segment size is 30GB per day. For performance we wanted to partition the data based on Single-dimension. We have run Hadoop batch job with setting partitionDimensions to a Single-dimension with SegmentGranularity is DAY and segment size is 9GB per day. Ran few queries like total count and topN queries on some dimensions and everything seems normal.

Wanted to understand the differences, performance and consequences between the below Partition types

Hash-based partitioning(all dimensions) vs

“partitionsSpec”: {

“type”: “hashed”,

“numShards”: 10

}

Hash-based partitioning with Single-dimension vs

“partitionsSpec”: {

“type”: “hashed”,

“numShards”: 10,

partitionDimensions":[]}

}

Single-dimension partitioning

“partitionsSpec”: {

“type”: “dimension”,

“targetPartitionSize”: 100000000,

“partitionDimension”: “”

how is data segmented?

will the data be balanced across the shards?

why Hash-based partitioning segment size is 30GB per day and Hash-based partitioning(Single-dimension) segment size is 9GB?

Hey Rahul,

All of these partitioning modes control how segments within a time chunk are partitioned (Druid always partitions first into time chunks based on segmentGranularity, then may also partition further after that). In a nutshell, this is what you get with the three modes:

  1. Hash-based partitioning (all dimensions): Segments are partitioned based on a hash of all dimensions (including timestamp). Usually this leads to segments with the most even sizes but the biggest footprint. The reason for the larger footprint is that this method does not do anything special to promote data locality, which is important for compression.

  2. Hash-based partitioning (single dimension): Segments are partitioned based on the hash of a single dimension. It generally has a smaller footprint than option [1] due to the fact that data locality is improved. This tends to improve compression, and it also tends to reduce the sizes of dictionaries and indexes. One downside is that if the single dimension is skewed (e.g. a single value appears in tens of millions of rows) then it will create unevenly sized segments, perhaps very much so.

  3. Single-dimension partitioning: Segments are partitioned based on ranges of a single dimension. This has similar footprint benefits to option [2], but is often a better choice because the ranges are chosen based on what will create the most evenly sized segments. It has the added benefit that the Druid Broker can use it to prune the list of segments to query, if you’re filtering on the single partition dimension.

In general, if you have a particular dimension you are often filtering by and that ‘naturally’ partitions your queries, option [3] is going to be best. Even without that, it might still be a good choice if it improves your compression without creating too much of an unevenly-sized segment problem.

One critical tip with [2] and [3] is that you should generally be putting the partitioning dimension first in the dimensionsSpec at ingest time. This will cause Druid to sort segments by that dimension, further improving compression.

Gian