S3 Parquet Files Ingestion - Missing Dimensions

Hey,

I’m trying to load parquet files from S3 using Hadoop indexing.

I’m loading ~15 dimensions, and while some are loading just fine, others seem to be completely empty (cardinality 0) even though they do have values most of the time.

I suspect that only dimensions that have no empty values in any row are loaded correctly (e.g “hour” and “type” are always filled), but i didnt prove that.

Any thoughts?

INFO [pool-31-thread-1] io.druid.segment.StringDimensionMergerV9 - Completed dim[profileid] conversions with cardinality[0] in 23 millis.
INFO [pool-31-thread-1] io.druid.segment.StringDimensionMergerV9 - Completed dim[info] conversions with cardinality[0] in 0 millis.
INFO [pool-31-thread-1] io.druid.segment.StringDimensionMergerV9 - Completed dim[hour] conversions with cardinality[24] in 8 millis.
INFO [pool-31-thread-1] io.druid.segment.StringDimensionMergerV9 - Completed dim[isexit] conversions with cardinality[0] in 1 millis.
INFO [pool-31-thread-1] io.druid.segment.StringDimensionMergerV9 - Completed dim[infoappname] conversions with cardinality[0] in 1 millis.

INFO [pool-31-thread-1] io.druid.segment.StringDimensionMergerV9 - Completed dim[type] conversions with cardinality[2] in 1 millis.


This is the ingestion spec


> ```
> {
> ```

> ```
>   "type": "index_hadoop",
> ```

> ```
>   "spec": {
> ```

> ```
> 
> ```

> ```
>     "dataSchema": {
> ```

> ```
>       "dataSource": "sample_7",
> ```

> ```
>       "parser": {
> ```

> ```
>         "type": "parquet",
> ```

> ```
>         "parseSpec": {
> ```

> ```
>           "format": "timeAndDims",
> ```

> ```
>           "timestampSpec": {
> ```

> ```
>             "column": "date",
> ```

> ```
>             "format": "yyyyMMdd"
> ```

> ```
>           },
> ```

> ```
>           "dimensionsSpec": {
> ```

> ```
>             "dimensions": [
> ```

> ```
>               "profileid",
> ```

> ```
>               "info",
> ```

> ```
>               "hour",
> ```

> ```
>               "isexit",
> ```

> ```
>               "infoappname",
> ```

> ```
>               "type"
> ```

> ```
>             ],
> ```

> ```
>             "dimensionExclusions": [
> ```

> ```
>               ... quite a lot
> ```

> ```
>             ],
> ```

> ```
>             "spatialDimensions": []
> ```

> ```
>           }
> ```

> ```
>         }
> ```

> ```
>       },
> ```

> ```
>       "metricsSpec": [
> ```

> ```
>         {
> ```

> ```
>           "name": "count",
> ```

> ```
>           "type": "count"
> ```

> ```
>         },
> ```

> ```
>         {
> ```

> ```
>           "name" : "unqiuesessions",
> ```

> ```
>           "type" : "hyperUnique",
> ```

> ```
>           "fieldName" : "uniquesessionid",
> ```

> ```
>           "isInputHyperUnique" : false,
> ```

> ```
>           "round" : false
> ```

> ```
>         }
> ```

> ```
>     ],
> ```

> ```
>       "granularitySpec": {
> ```

> ```
>         "type": "uniform",
> ```

> ```
>         "segmentGranularity": "DAY",
> ```

> ```
>         "queryGranularity" : "NONE",
> ```

> ```
>         "intervals": ["2018-07-21/2018-07-23"]
> ```

> ```
>       }
> ```

> ```
>     },
> ```

> ```
>     "ioConfig": {
> ```

> ```
>       "type" : "hadoop",
> ```

> ```
>       "inputSpec": {
> ```

> ```
>         "type": "static",
> ```

> ```
>         "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
> ```

> ```
>         "paths": "s3n://mypath/part-0048*,s3n://mypath/part-0049*"
> ```

> ```
>       }
> ```

> ```
>     },
> ```

> ```
>     "tuningConfig": {
> ```

> ```
>       "type" : "hadoop",
> ```

> ```
>       "reportParseExceptions" : true,
> ```

> ```
>       "jobProperties" : {
> ```

> ```
>         "fs.s3n.awsAccessKeyId" : "",
> ```

> ```
>         "fs.s3n.awsSecretAccessKey" : "",
> ```

> ```
>         "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
> ```

> ```
>         "fs.s3.awsAccessKeyId" : "",
> ```

> ```
>         "fs.s3.awsSecretAccessKey" : "",
> ```

> ```
>         "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
> ```

> ```
>         "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec",
> ```

> ```
>         "mapreduce.map.java.opts" : "-server -Xmx1536m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
> ```

> ```
>         "mapreduce.reduce.java.opts" : "-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
> ```

> ```
> 
> ```

> ```
>       },
> ```

> ```
>       "partitionsSpec": {
> ```

> ```
>           "type": "hashed",
> ```

> ```
>           "targetPartitionSize" : 5000000
> ```

> ```
>        }
> ```

> ```
>     }
> ```

> ```
>   }
> ```

> ```
> }
> ```

If anyone else is struggling with this issue - when loading data from parquet files, use case-appropriate column names. Unlike hive, druid loading from parquet is case-sensitive