Error while loading parquet files from remote HDFS cluster using Druid console

Hi ,

We are trying test the ingestion of parquet files stored in a remote HDFS cluster to druid using Apache Parquet Extension. Druid is running on a 3 node cluster which is version 0.16.0. Connectivity between Druid cluster and HDFS master and data nodes are configured and necessary ports are added in inbound traffic configuration,

HDFS master node port - 8020

HDFS data node port - 50010

Steps followed to load data

  1. Launched druid console (http://<ROUTER_IP>:8888)

  2. From web UI, selected Load data -> HDFS -> Submit task

  3. Specified the ingestion spec and submit

When I do this, seeing a strange warning message in coordinator-overlord and middleManager logs

2019-11-28T13:44:58,249 WARN [TaskMonitorCache-0] org.apache.druid.segment.indexing.DataSchema - No metricsSpec has been specified. Are you sure this is what you want?

``

But my ingestion spec has proper metricSpecs defined. Why does this warning appear ?

Ingestion spec

{

“type”:“index_hadoop”,

“spec”:{

“ioConfig”:{

“type”:“hadoop”,

“inputSpec”:{

“type”:“static”,

“inputFormat”:“org.apache.druid.data.input.parquet.DruidParquetInputFormat”,

“paths”:"/user/hive/druid_test1/"

}

},

“dataSchema”:{

“dataSource”:“druid_test”,

“parser”:{

“type”:“parquet”,

“parseSpec”:{

“format”:“timeAndDims”,

“timestampSpec”:{

“column”:"_time",

“format”:“auto”

},

“dimensionsSpec”:{

“dimensions”:[

“account_id”,

“division”,

“department”,

“company”,

“co”,

“url”,

“url_host”,

“web_category”,

“transaction_result_code”,

“avc_app_type”,

“avc_app_behavior”,

“acl_action”,

“response_type”,

“server_contact_mode”,

“http_response_code”,

“event_date”,

“hour”,

“index”,

“source_type”,

“category”,

“analysis_period”,

“display_name”,

“description”,

“year”,

“month”,

“day”,

“week_day”,

“week_day_offset”,

“start_time”,

“end_time”,

“event_count”,

“web_cat”

],

“dimensionExclusions”:[

],

“spatialDimensions”:[

]

},

“metricsSpec”:[

{

“type”:“count”,

“name”:“event_count”

},

{

“type”:“doubleSum”,

“name”:“download_bytes”

},

{

“type”:“doubleSum”,

“name”:“upload_bytes”

}

],

“granularitySpec”:{

“type”:“uniform”,

“segmentGranularity”:“DAY”,

“queryGranularity”:“HOUR”,

“intervals”:“uniform”,

“rollup”:true

}

}

}

},

“tuningConfig”:{

“type”:“hadoop”,

“jobProperties”: {

“mapreduce.job.classloader”: “true”

},

“partitionsSpec”:{

“type”:“single_dim”,

“targetPartitionSize”:500000,

“partitionDimension”:“url”

}

}

}

}

``

Attaching the logs as well

coordinator-overlord.log (188 KB)

middleManager.log (20.5 KB)

Hi Manu,

Metric Spec is part of DataSchema, while you have kept it in parser. Just take MetricSpec outside the parser and that should solve your problem.

Awesome. Made the change and works like a charm. Thanks again Shubham!!