help with hadoop indexer: cardinality issue

Hi,

From what I can tell, the indexing job is reducing rows when it should not be.

I have distilled it down to the the scenario explained below.

Basically I indexed 5000 rows where the dimensionValues are unique (there are 5000 distinct combinations).

I indexed these with a queryGranularity of “none”

I was expecting the Druid index to have 5000 distinct values, but it ends up with 3815 values.

Any guidance/help on this? Happy to provide data/files to reproduce. The dataset is very small now. I am using druid-0.8.0

regards,

Harish Butani

I have created a dataset containing the following columns:

“o_orderkey”,

“o_custkey”,

“o_orderdate”,

“l_partkey”,

“l_suppkey”,

“ps_partkey”,

“ps_suppkey”

The input dataset has 5000 rows, and I have validated that the rows are distinct.

(btw this is the tpch dataset, and i formed a dataset of keys. The combination of keys is unique. )

  1. My indexing Spec has the following parts:
  • note queryGranularity is “none”

“dataSchema”: {

“dataSource”: “tpchSmall”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “o_orderdate”,

“format”: “iso”

},

“columns”: [

“o_orderkey”,

“o_custkey”,

“o_orderdate”,

“l_partkey”,

“l_suppkey”,

“ps_partkey”,

“ps_suppkey”

],

“delimiter”: “|”,

“dimensionsSpec”: {

“dimension”: [

“o_orderkey”,

“o_custkey”,

“l_partkey”,

“l_suppkey”,

“ps_partkey”,

“ps_suppkey”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “YEAR”,

“queryGranularity”: “NONE”,

“intervals”: [

“1993-01-01/1997-12-31”

]

}

}

  1. When run the Indexing Job, here are the output from the 2nd Job:

Map-Reduce Framework

Map input records=5000

Map output records=3815

Map output bytes=845500

Map output materialized bytes=857005

Input split bytes=226

Combine input records=0

Combine output records=0

  1. And if I run the following query:

{

“queryType” : “groupBy”,

“dataSource” : “tpchSmall”,

“dimensions” : ,

“granularity” : “all”,

“aggregations” : [ {

“jsonClass” : “FunctionAggregationSpec”,

“type” : “count”,

“name” : “alias-1”,

“fieldName” : “count”

} ],

“intervals” : [ “1992-12-31T16:00:00.000-08:00/1997-12-31T16:00:00.000-08:00” ]

}

The result is: 3815

Hi, have you had a chance to read:
http://druid.io/docs/latest/ingestion/faq.html

Specifically, “Not all of my events were ingested”.

You are using the wrong query aggregator.

Hi,

Thank you for pointing me to the link. My issue was an incorrect interval specification. It needs to be 1992-1998; I had cut and pasted from an earlier ingestion spec, that was for a smaller time duration.

I wondered about using longSum vs count, when the granularity was “none”, I had tried both, and it didn’t seem to matter; but now i know the issue was with a wrong interval spec.

regards,

Harish.