Partitiondimension - usage

Hi,

I was going through secondary partition usage on https://groups.google.com/forum/#!searchin/druid-user/partitionDimensions|sort:date/druid-user/QxZLwtoehFI/G4U0L8NkBgAJ.

I have following questions regarding secondary partition.

a. Single dimension partition as given below

“partitionsSpec”: {

            "type": "dimension",

“targetPartitionSize”: 100000000,

“partitionDimension”: “”

         }

Is this functionality available only for hadoop indexing? Is it possible to achieve similar functionality using native (index) indexing?

b. I was trying the following using native (index) indexing. The data source is single zip file and it has data for 31 days. segmentGranularity is set to ‘DAY’, so 31 “time” segments.

i have geography_desc having 6 values (North America, South America, EU…). 90% of the queries are on this dimension. So trying partitioning on this dimension.

“tuningConfig”: {
“type”: “index”,
“partitionDimensions”: [ “geography_desc” ],
“maxRowsPerSegment”: 50000,
“maxTotalRows”: 20000
}

Using the above config, the indexing task created thousands of partitions, as expected. We intentionally set the maxrows to 50000, just to check how segments are getting created.

But if we use the spec to following

“tuningConfig”: {
“type”: “index”,
“partitionDimensions”: [ “geography_desc” ],
“maxRowsPerSegment”: 50000,
“maxTotalRows”: 20000,

“forceGuaranteedRollup”: true

}

This created <100 segments, each segment with max 50000 rows. But when we do count(*), we are getting only 2000+ as values. In the unified console, it shows all segments, but numRows as 0 for except 1 segment(1 day). Incidentally this 1 segment (1 day), the number of rows is <50000, so it created only 1 segment for the day. So we were trying to understand where did all the segments data disappear (segments are showing up in the unified console).

After looking at the deepstorage, we realized the partitions are with “1”,“2”… but “0” is missing. only for the 1 day (which has <50000 rows), the segment is “0” and only this segment data is returning in query. So i think for all other days the “0” th segment is missing (i am guessing). Could this be a bug?

BTW, this was done on 0.15.0 version.

Any comments/input is welcome.

Regards, Chari.

Hi Chari,

For your first question, currently native indexing does not support range (single-dimension) partitioning. This table gives a nice comparison of the various batch ingestion methods:

https://druid.apache.org/docs/latest/ingestion/hadoop-vs-native-batch.html

Thanks,

Chi

This created <100 segments, each segment with max 50000 rows. But when we do count(*), we are getting only 2000+ as values.

With rollup enabled, count(*) would likely be different from the number of input rows since some rows would be combined. If you want to check that the two cases are matching, you’d want to add a Count aggregator as a column during ingestion and do a SUM() to determine how many input rows contributed to the data.

In the unified console, it shows all segments, but numRows as 0 for except 1 segment(1 day). Incidentally this 1 segment (1 day), the number of rows is <50000, so it created only 1 segment for the day. So we were trying to understand where did all the segments data disappear (segments are showing up in the unified console).

I believe numRows = 0 indicates that the segment hasn’t been loaded on historicals yet, you may need to wait some time for those to be completely loaded from deep storage to your historicals.

After looking at the deepstorage, we realized the partitions are with “1”,“2”… but “0” is missing. only for the 1 day (which has <50000 rows), the segment is “0” and only this segment data is returning in query. So i think for all other days the “0” th segment is missing (i am guessing). Could this be a bug?

The “0th” partition doesn’t have a partition number in the segment string, e.g.:


wikipediax_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2019-09-13T01:48:08.405Z

wikipediax_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2019-09-13T01:48:08.405Z_1

wikipediax_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2019-09-13T01:48:08.405Z_2

wikipediax_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2019-09-13T01:48:08.405