HyperUnique problem

Hey guys,

We have a problem with HyperUnique aggregator.

Our use case is the following:

A dataSource visits with all dimensions and metrics and dataSources named visits_ with all metrics but only one dimension.

We create the first dataSource from s3 files and the second from the first.

So we expected keep the same number of uniqueVisitors for a same period, but seems like it doesn’t work.

Any suggestion or explanation ?

Thanks,

Ben

Hi Benjamin,

What is the difference in uniqueVisitors you are seeing ?

Also, Is it possible for you to write a Unit test or share some sample dataset with ingestion spec files which can reproduce what you are seeing ?

Cheers,

Nishant

Hey Nishant,

I can’t share anything, sorry. But we are in touch with Fangjin to find some solutions. It seems like the problem is HyperLogLog can’t fit our problem.

See the schema attached for the problem explanation. I’m investigate on HLL with others technologies (like Spark) to confirm that HLL is not answering our problem, not depending on the implementation.

Thanks,

Ben

Ok, Just curious, does the number don’t match only for Case 2 where the smaller DS is created from a detailed druid DS or for both the cases 1 & 2 ?

They don’t match in both cases. And this is our problem, coherence is really important for customers.

Hey Ben,

The differences could be due to the fact that we’re using a variant of HyperLogLog with a space optimization that can cause clipping. In our variant, each register cannot actually hold the full range of possible values – they’re storing deltas from a global “base”. So errors can be sensitive to the order in which data points are added (what gets clipped is depends on when the base offset gets incremented, which depends on the order in which things are added).

This blog post has details about the variant of HLL in Druid: http://druid.io/blog/2014/02/18/hyperloglog-optimizations-for-real-world-systems.html

Thanks Gian,

Yeah it seems like HLLCV1 causing the bug.

It could be great to use the HLLCV0 without any improvement but the first version of HyperLogLogCollector file on Github is already with HLLCV1.

Ben

Hello, i am working with Ben on the hllcv1 problem explained above.

I wanted to try with the HLLCV0 implementation, i modified the class hyperloglogcollector.

But i have many tests that does not work. I do not find any older version of druid with examples of the HLLCV0 version usage.

Can you help me with an older version of druid or am i doing something wrong ?

Thank you

Julien

PS : Here is the modified hyperloglogcollector code

// Methods to build the latest HLLC
public static HyperLogLogCollector makeLatestCollector()
{
  //return new HLLCV1();
  return new HLLCV0();
}

public static HyperLogLogCollector makeCollector(ByteBuffer buffer)
{
//int remaining = buffer.remaining();
//return (remaining % 3 == 0 || remaining == 1027) ? new HLLCV0(buffer) : new HLLCV1(buffer);
return new HLLCV0(buffer);
}

Believe the unit tests unless there is really strong evidence not to.

Just as a warning: without really, really well documented and tested changes, it will probably be hard to get a change to HLL accepted into druid. It is always possible, however, to have an extension aggregator(factory) which consumes HLL complex objects and does whatever you want with them. If it turns out to be better (for certain measures of better: faster or more accurate but with a space tradeoff, for example) than the stock HLL aggregator, then it would make a great addition to druid. But not all users are concerned with 100% accuracy between two runs of the HLL algorithm, and are more concerned about speed and memory pressure (i.e. us at MMX).

You might find a better audience on the developer’s mailing list for such changes.

Ok, thank you Charles, i created a post in the dev group : https://groups.google.com/forum/#!topic/druid-development/9U5u-Sc_oB4

Julien