Indexing from partitioned data in hdfs

Hey everyone,

I’m trying to index parquet data from hdfs to druid using index_parallel

the data is created by spark, partitioned by date. looks like this:

.

└── mytable

└── date=2020-01-01

└── parquet-data.snappy.parquet

└── date=2020-01-02

└── parquet-data.snappy.parquet

└── date=2020-01-03

└── parquet-data.snappy.parquet

└── …etc

is there anyway i can only specify only the root path (like /mytable) for the path in ingestion spec?

thank you.

Have you looked through this?

https://druid.apache.org/docs/latest/ingestion/data-formats.html#parquet

How have you defined your ingestion spec?

Cheers!

oh i didn’t know that option is for this kind of case
here is my current spec:

{
“type”: “index_parallel”,
“spec”: {
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “hdfs”,
“paths”: “/path/to/mytable”
},
“inputFormat”: {
“type”: “parquet”
}
},
“tuningConfig”: {
“type”: “index_parallel”
},
“dataSchema”: {
“dataSource”: “mydatasource”,
“granularitySpec”: {
“type”: “uniform”,
“queryGranularity”: “DAY”,
“rollup”: false,
“segmentGranularity”: “DAY”
},
“timestampSpec”: {
“column”: “date”
},
“dimensionsSpec”: {
“dimensions”: [

“field1”,

“field2”,

“field3”

]
}
}
}
}

after i followed the docs using flattenSpec and every possible combination of it that i can think of:

  • {
    “useFieldDiscovery”: true
    }
  • {
    “fields”: [
    {“name”: “date”, “type”: “root”}
    ]
    }
  • {
    “useFieldDiscovery”: true,
    “fields”: [
    {“name”: “date”, “type”: “path”, “expr”: “$.date”}
    ]
    }

but it still doesn’t work

Meaning you still get an error?

the task was succeed, but it wouldn’t load the data.