guava lib Exception in indexer with Cardinality

Hi,
I am trying to run index_hadoop job with “cardinality” in one of the “metricsSpec”.

We are using CDH 5.1.0 with hadoop ver. 2.3.0

We get the following exception in the “index-generator” job which one of peon fires

on the yarn cluster.

the Job classpath has guava-16 library. I suspect this is coming from guava-11 lib which the hadoop/yarn is installed with.

Anyone else having this issue? Thanks in advance.

Vinay, you should use the “hyperUnique” aggregator in your ingestion spec in order to query for the cardinalities of dimensions. The cardinality aggregator is not really designed to be used at ingestion time, and we should document that better. I don’t know if that will actually bypass the class of error you are seeing though, as error you are seeing is a guava dependency problem. Others have been successful with recompiling Druid with a downgraded version of Guava. I would try using the “hyperUnique” aggregator first.

Thanks Fangjin,
Yes. hyperUnique seems to work ( at least got past the indexing part ). Although, as you said, I am not sure if it will fail in some other step.

Anyways I will try recompiling with 11.0.1 version of guava also.

Thanks again!

-Vinay

Sorry to revive an older thread, but I just hit the same issue when trying to do some Hadoop ingestion.
Unfortunately I don’t think I can use the hyperUnique aggregator, since I want to do the cardinality aggregator over multiple fieldNames.

In either case, I was wondering if specifying the cardinality aggregator at ingestion (either batch or realtime) actually improves performance when doing queries with that cardinality aggregator?

Now I specify both in the ingestion spec and the query spec the same cardinality aggregator, so will it use the pre-calculated values or will it aggregate it at query time on-the-fly?
In other words, is there a use for specifying the cardinality aggregator at ingestion time?

Greetings,

Maarten

Hi Maarten, see inline.

Sorry to revive an older thread, but I just hit the same issue when trying to do some Hadoop ingestion.
Unfortunately I don’t think I can use the hyperUnique aggregator, since I want to do the cardinality aggregator over multiple fieldNames.

In either case, I was wondering if specifying the cardinality aggregator at ingestion (either batch or realtime) actually improves performance when doing queries with that cardinality aggregator?

The best query performance is achieved by setting the hyperUnique aggregator at ingestion and query time. The cardinality aggregator at query time will be slower.

Now I specify both in the ingestion spec and the query spec the same cardinality aggregator, so will it use the pre-calculated values or will it aggregate it at query time on-the-fly?
In other words, is there a use for specifying the cardinality aggregator at ingestion time?

The cardinality aggregator works by scanning a set of strings and then calculating results and should only be used at query time. The hyperUnique aggregator can be used at ingestion time to build hyperloglog objects where individual strings are discarded and cardinality info is stored in a specialized data structure.