Druid 0.9 : Arrays parsed as [Ljava.lang.Comparable;

Hi all,

We face a bug since we have upgraded Druid to the 0.9 version (0.9.1 actually).

One of our columns is of array type, and it’s now parsed both on the right way (values are exploded) but we with some [Ljava.lang.Comparable;…

This is only happening in hadoop indexing mode, the realtime through tranquility being right.

We checked our log files and they are corrects.

Do you think it’s a misconfiguration that would have impacts in the 0.9.x or a bug in Druid?

Mehdi.

Hey Mehdi,

What format is your raw data in (tsv/csv/json/etc)? Could you attach your indexing spec and perhaps a sample row of your data?

Hi Gian, sorry for the delay i missed your answer.

Our data are in json format, and it looks like this :

{“timestamp”:1467941894270, “id1”:6, “id2”:1136, “id3”:2399, “categories”:[9,8], “valid”:1}

``

And the problem occurs on the column categories.

The spec for the hadoop indexing job :

{

“type”: “index_hadoop”,

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.6.0”, “org.apache.hadoop:hadoop-aws:2.6.0”],

“spec”: {

“dataSchema”: {

“dataSource”: “reporting”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”

},

“dimensionsSpec”: {

“dimensions”: [

“id1”,

“id2”,

“id3”,

“categories”

],

“dimensionExclusions”: [

“valid”

],

“spatialDimensions”:

}

}

},

“metricsSpec”: [{

“type”: “longSum”,

“name”: “valid”,

“fieldName”: “valid”

}],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “hour”,

“queryGranularity”: “hour”,

“intervals”: [“INTERVALS”]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “S3FILES

}

},

“tuningConfig”: {

“type”: “hadoop”,

“useCombiner”: “true”,

“indexSpec”: {

“bitmap”: {

“type”: “roaring”

}

}

}

}

}

``

I tried parsing the above dimensionSpec and it worked fine for me.
Could you share the complete stack trace of the error you got ?

It might just be that one or more of your row is not parseable, which might be causing this.

Have you tried setting ignoreInvalidRows in tuningConfig ?

I have the same issue.
The hadoop indexer job doesn’t seem to ingest multi-valued dimension data properly but only some values translate to Comprable@XXXXX string values.

When ingesting parquet data, it seems that the issue can be overcome by putting the following into the jobProperties section of the ingestion spec:

"jobProperties": {
   "parquet.avro.add-list-element-records" : "false"
}

However, I'm ingesting gzipped json data at the moment and always observe this strange behaviour. I have a multivalued dimension that only has perhaps 100 distinct combinations, but the Hadoop Indexer tells me that the multi-valued dim has a cardinality of 3.8 million.

I'll try out ignoreInvalidRows and report back if this changes anything.

I already have sanitized the input for Druid such that a multivalued dimension field is never null because Druid fails to ingest data then. I also thought that the ingestion process might not be dealing with empty arrays properly, so I made sure to always have at least one entry in the multivalued dim, but I'm still seeing the Comparable entries.

I’ve now tried out setting ignoreInvalidRows=true and although it reduces the cardinality of the multi-dim field from 3.8 mil to 120k, the data volume is still too large as is the cardinality. I also still see the Comparable@XXXXX data.

I tried upgrading to the latest Druid release, but try as I may I couldn’t get Druid to work with EMR so far. Classpath hell. Tried everything in the druid recommendations page but so far no luck.

update: I ingested the same data with the Indexer Task instead of the Hadoop Indexer Task and am getting valid results.

for the gzipped json input, I’ve been using a combiner. Next, I’ll try whether disabling the combiner will work

it’s the combiner.

I tested with gzipped json input and with parquet input and in both cases, disabling the combiner removed the trashy records.

By disabling the combiner, I mean to set following in the ingestion-spec:

"tuningConfig": {
  "useCombiner": false,
  ...
}

The combiner is engaged in class IndexGeneratorJob:

if (config.getSchema().getTuningConfig().getUseCombiner()) {
job.setCombinerClass(IndexGeneratorCombiner.class);
job.setCombinerKeyGroupingComparatorClass(BytesWritable.Comparator.class);
}

Perhaps there’s something wrong in BytesWriteable.Comparator which is a Hadoop class ?

@Mehdi: I see that your ingestion spec has the combiner enabled. If you like, perhaps you could try with the combiner disabled and report back whether this removes the trashy records on your end as well.


so, it’s the combination of using a combiner and multi-valued dimensions that leads to occurences of these [LComparable… entries.

we have one dimension with a cardinality of 100000 and the amount of trashy records generated did not seem to have any noticeable effect on the resulting data volume.

we introduced another dimension with cardinality 100 and ingesting this blew up the data volume by 4x.

I don’t understand yet, in which cases the combiner fails because there are always records that are not trashy too and with lower cardinality, the combiner seems to generate more trash.

I filed the following issue https://github.com/druid-io/druid/issues/4547

short update:

the bug occurs with Druid 0.9.1 and one has to disable the combiner to get rid of it.
After upgrading to Druid 0.9.2 + hadoop 2.7.2 the bug is gone even when the combiner is enabled.