Issue while specifying Dimensions, No column is ingested in Druid, just shows metrics

Hi,

I have an ORC file that i need to ingest and i’m able to successfully ingest it using the below mentioned spec.

I didn’t have column list, so i specified it.

Now i do not want every column to be present in my dimension, So i give a list to dimension as well only those columns which i want in Druid.

I change ‘“dimensions”: ’ in below spec to “dimensions”: [“a”,“b”]

However, whenever I do this, it ingests the data and only show timestamp, count, d i.e. timestamp and metrics only and no dimensions

What is happenning here and how can i correct it?

{

“type” : “index_hadoop”,

“spec” : {

“dataSchema”: {

“dataSource”: “test”,

“parser”: {

“type”: “orc”,

“parseSpec”: {

“format”: “orc”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “millis”,

“missingValue”: “2019-11-28T12:00:00Z”

},

“dimensionsSpec”: {

“dimensions”:

}

},

“hasHeaderRow”: false,

“listDelimiter”: “,”,

“columns” : [“a”,“b”,“c”,“d”,“timestamp”]

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “longSum”,

“name”: “d”,

“fieldName”: “d”,

“expression”: null

}],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “MINUTE”,

“queryGranularity”: “MINUTE”,

“rollup”: true,

“intervals”: [ “2019-11-28/2019-11-29” ]

},

“transformSpec”: {

“filter”: null,

“transforms”:

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec” : {

“type” : “static”,

“inputFormat”: “org.apache.orc.mapreduce.OrcInputFormat”,

“paths” : “hdfs://path”

}

},

“tuningConfig” : {

“type”: “hadoop”,

“jobProperties”:{

“mapreduce.job.classloader.system.classes” : “java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml”,

“mapreduce.job.classloader” : “true”

},

“maxRowsInMemory”: 100000,

“useCombiner” : true,

“indexSpec”: {

“bitmap”: {

“type”: “concise”

},

“dimensionCompression”: “lz4”,

“metricCompression”: “lz4”,

“longEncoding”: “longs”

},

“numBackgroundPersistThreads”:1

}

}

}

It should look something like this where dimensions include only what you want

{

“spec”: {

“dataSchema”: {

“dataSource”: “hdfs_orc”,

“parser”: {

“type”: “orc”,

“parseSpec”: {

“format”: “timeAndDims”,

“timestampSpec”: {

“column”: “time”,

“format”: “auto”

},

“dimensionsSpec”: {

“dimensions”: [

“serviceapprovaldate”,

“claimnumber”,

“claimtypecode”,

“paidamount”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

},

“typeString”: “structtime:string,name:string

},

“metricsSpec”: ,

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: {

“type”: “none”

},

“rollup”: false,

“intervals”: [

“2016-11-22T00:00:00.000Z/2016-12-20T00:00:00.000Z”

]

},

“transformSpec”: {

“filter”: null,

“transforms”:

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“inputFormat”: “org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat”,

“paths”: “/apps/hive/warehouse/gen3c1_staging.db/blue_fact_claimpatient_complex_imply”

},

“metadataUpdateSpec”: null,

“segmentOutputPath”: null

},

“tuningConfig”: {

“type”: “hadoop”,

“workingPath”: null,

“version”: “2018-10-10T17:49:37.285Z”,

“partitionsSpec”: {

“type”: “hashed”,

“targetPartitionSize”: 5000000,

“maxPartitionSize”: 7500000,

“assumeGrouped”: false,

“numShards”: -1,

“partitionDimensions”:

},

“shardSpecs”: {},

“indexSpec”: {

“bitmap”: {

“type”: “concise”

},

“dimensionCompression”: “lz4”,

“metricCompression”: “lz4”,

“longEncoding”: “longs”

},

“maxRowsInMemory”: 75000,

“leaveIntermediate”: true,

“cleanupOnFailure”: true,

“overwriteFiles”: false,

“ignoreInvalidRows”: false,

“jobProperties”: {

“mapreduce.job.user.classpath.first”: “true”,

“hdp.version”: “2.6.4.0-91”,

“mapreduce.job.classloader.system.classes”: “-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.”

},

“combineText”: false,

“useCombiner”: false,

“buildV9Directly”: true,

“numBackgroundPersistThreads”: 0,

“forceExtendableShardSpecs”: false,

“useExplicitVersion”: false,

“allowedHadoopPrefix”: ,

“logParseExceptions”: false,

“maxParseExceptions”: 0

},

“uniqueId”: “0019020ec9f04a4a93300cffc8be81fa”

},

“hadoopDependencyCoordinates”: [

“org.apache.hadoop:hadoop-client:2.7.3.2.6.4.0-91”

],

“type”: “index_hadoop”

}

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Hey Eric,

But the thing is I don’t have header in my orc file. So I should pass a column list for druid to know the names of all the columns, and then specify the dimensions that should be used.

Is that correct or am I missing something?

Finally figured out. DRUID IS CASE SENSITIVE.
I don’t know why but it was taking all the columns as upper case. I think this has something to do while creating the orc first time. Got the issue when i removed missing value in timestamp and it said unidentifed column in timestamp and showed me list of columns, all of which were in uppercase.
This should have been mentioned somewhere. Anywyas, thanks.