Help in dimensionsSpec specification

Hi,
Recently in the Virtual Druid Summit in Mastering Data Layout, I learnt that the order in which dimensionsSpec is specified, that’s the hierarchy in which indexes will be created. I have a few questions on the same:Currently, my Hadoop Indexing json looks something like :

{
“type”: “index_hadoop”,
“spec”: {
“dataSchema”: {
“dataSource”: “my_data_source”,
“parser”: {
“type”: “string”,
“parseSpec”: {
“format”: “tsv”,
“timestampSpec”: {
“column”: “utc_hour_min”,
“format”: “yyyyMMddHHmm”
},
“columns”: [
“utc_hour_min”, “a”, “b”, “c”, “d”, “e”, “f”, “g”, “h”, “i”
],
“delimiter”: “\u0001”,
“listDelimiter”: “\u0002”,
“dimensionsSpec”: {
“dimensions”: [
“a”, “b”, “utc_hour_min”, “c”, “d”, “e”, “f”, “g”
],
“dimensionExclusions”: ,
“spatialDimensions”:
}
}
},
“metricsSpec”: [
{
“type”: “longSum”,
“name”: “h”,
“fieldName”: “h”
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “FIFTEEN_MINUTE”,
“queryGranularity”: “FIFTEEN_MINUTE”,
“intervals”: [
“2020-09-12T00:00:00.000+0000/2020-09-12T00:15:00.000+0000”
]
}
}
}
}

  • The order in which I have listed columns in the dimensionSpec is a logical hierarchy in which we see our dimensions. But almost all my queries are filtered on dimension d and e . There are absolutely no queries being filtered on a and b . So, I will have to place d and e first, right?
  • Do we also have to explicitly place utc_hour_min column, my timestampSpec column, first in the dimensionsSpec , so that the dimensions listing in it will look like the following? Asking since, Druid will always sort first by the timestamp column.
    “dimensions”: [ “utc_hour_min”, “d”, “e”, “a”, “b”, “c”, “f”, “g”]

Thanks,

Ashwin

hi I sent a similar response to you on slack, so whichever you read first:

"But almost all my queries are filtered on dimension d and e ."This is the key bit, and it means that it would be best to put “d” and “e” first in the dimensions list in the dimensionsSpec (this is for sorting). It would also be good to consider partitioning by “d” or “e” as well: https://druid.apache.org/docs/latest/ingestion/index.html#partitioning

Do we also have to explicitly place utc_hour_min column, my timestampSpec column, first in the dimensionsSpec , so that the dimensions listing in it will look like the following?Druid’s going to sort by timestamp regardless of what you do in your dimensionsSpec.
Btw, ordinarily, you wouldn’t include the timestamp column at all in the dimensionsSpec. I’m not sure why you decided to include it here. To properly answer the question would require understanding what you were trying to accomplish. Can you give me some more info and I can help you out?

Thanks,
Matt