Weird hyperLogLog output

Hi!

When doing a topN query, asking for hyperLogLog metric, sometimes we get values like these:

1.7976931348623157E308

The weird thing is that for many different values we get that number (and they should differ in normal cases). Is this an error on our environment? Or maybe a wrong format on the output? Can we specify the format in which we want to obtain the metric (integer, float, etc)?

Thanks,

This is Double.MAX_VALUE and probably means that the estimate was too large to correct via the algorithm we use. Is this reproducible? And do you think your “real” unique counts are larger than about 10^19?

Wow very weird, since it’s impossible that the value is above 10^19.

The count metric is always above hyperloglog, count is a value way below that number. For example:

“count” : 10387841,

“unique” : 1.7976931348623157E308,

“count” : 10387837,

“unique” : 1.7976931348623157E308,

When it works OK the result is something like this:

“count” : 5860320,

“user_unique” : 5170144.904699262,

Yes this is reproducible, every time I run this simple split metric with count and hyperloglog metric, it returns those weird values.

Do you think this is a problem on hyperloglog (in our environment)? Otherwise how can I detect the problem?

Thanks Gian!

It might be interesting to set “finalize”: false in your query context to see the actual HLL objects returned by this query, rather than the cardinality estimates. That’d possibly provide some clues as to how they ended up the way they did.

Hi Gian,

I did that, and here is the output:

  • For the wrong numbers, same HLL objects were returned:

+zUCceNCa8lTe0RFUjQhEyQSRkSDNEIxQ0Q1JCJ0MmMmVCNDRhVDNUN0UYpEcTEjJUc1RBQ0QyNTWlYhMzREhTIyEzRENEEiIzRXEjZDJEFEgjIydCMzNkEhRDQqQmRVI1IzNDIzM0I5ZIFEVSUUJVYRUiMhU1JCUjNCWTI1MjQ0MiQkYyVEIyNVIjMxRURjFhQjMxMUN4cUMUJCZiklVDQjIBEzFANCdEMDMSQjRTKAQkQzRCMiJDJCZlVUYTEjYjIjY1IlZ1NVUwUhNFdUNUUiVSEUNyIyIyIzIxOkREMyJTVCViJjMSJCNkI0MiI0IiUxQzJkRFI0RoNDYyIlEyQ5IyMnSEdlEXU0QlQiQkJhRFX2VBQ0YiFCaDIwYzRVNyMkNkUzckQnMVQzNEc4RjNCUUJCRDIoUyM4MyM2RCQkRDOGgjIjciZUQkRWJEKDJUU2NWUzcyQjQmkmMiJHNEQkRTY0JDQyMmumMiQTNCUxNjUzkxMiJBVRVDI3UzYxNzJDI3MWM1QzJUZDQkJGFENFRDI0U0MzRHJEJDJVlDQkQ2VSMyVLJCYxJEJGUiQWJDI1N5NTYjdEE6NDNCFGMyEjRSMzMxZDMCQiMzdTAzM0MyU2JlN0Q8NWMyISFUZCMmdEIjMUIiRCRURTQmRDQzR0E1QTQjImNTQjVhIiREUjUlM3E1MjRjJUNHIzYiNWMzMVVSMTITEyFDJlEzVURTI0QiIodBY2FzNENDEiVSOLOTMVEjQzdREkJSUhNFQhIjRCRjMzYjM1YzMzSBREM0ODJRMlxSIVUiM6MUZDFCRFMiR5IyNTQ1EUUwMzMjKFIyYoE2IhMlMyMyV0R5JFNJRCOEGzRVFSZDQzMzITQ0FSVEUiN2QiBSZyMzRxNhJjQiYlJzVkZCNGJSVVIjYzQ0RBoiNjInMUNDYlVFMmIUFEZEhCYiUTMhNjJDNlJDItRSIyOSJzNEUlMzNDQjJDJUREMyIkRyIyQmoUZVNFJUJzQxQiQhIkZHJCQ5YzMyJFQiUjYkMjMxYnVUIyFUViNCM0MzASJIN0NBNCo3UVZWIyJlRUUkJXNUN2Q1NFQkRTQjZDEjgSU1ciFTFjRyVFdCYzV0QiM0NjIzQzVjIyImN3OFQ0IjQzESSTQlOBcyFDIUNGJjQiNkViVDWCEUc1RFV6JFNCYXQzNCQ0RCIyRyMzRlEzKBYpNnU1M2YyMjJCRGMVY0VTIgJENVJhEUJUQ1NEQ4IhNjJCkzazYUETU1IjdEQTMlKCFDtzJhMyV3R0EyMhNjZTIXcVYiRmNSUTAiVCVEQTUlSDhCMyUjMxQTNTNjMzA4IzVBRDI5MhMzUidkNFUkJUNjMUMkdRNDYXJCJZUkRFMzQ=

Then good results returned values like this:

AQAB5AAAAAAQAAAAAAABUzAQQAARARACAAIAACAAABAAECAAAAIiAAAAAAQAAAAAAAAAAAAAAQAAAAIAAAACAAAIADAAAAAAMAEgACAAAAAAAQAAMAMAAEAQAAIAAAABARABABAAAAIQAiAxAAAAEAADEAMgBAAAEAAiAAATEwABAAAAAQAAABAAARAAAAMAMQAQAAAAAABQAAAAMAQAEAEZMAICAADBAAAgEAMDEAABAAAQAAAQAAAEAAAAAgAAAAEQIAIAAQAQAAAAAQAQIAIAADAAUAAQABAAAAAAAAAAAAcQAAACAAAAAAAAEAAQAAIAAAABAwAAAwIQAAAAAAAAAAAAQAMAMAAAAAAwAwAAAAEAAQAAAQAgAAAAAAABQAAAAAAAMAESAAABQAAAAwASIBABABAQEAIAAAAAQAEAIAAAAAAAAAABEAABAAAAEAAQAAAAAAAAAAQBAgRQEQAAAAAwAAAhABADACEAAAADAQASECAQAAABAAYQEAAAAFAAABAAEAAAYABAAAAEAAAAAgEgACEAAAAQEQAAIAEAAgABAAEAUAAAEQAAAAEQEAAFAAAAAAAAAAEAABACAAAwAAAAEAACAAEAAgAAAAABABAAIwAQAAAAAAAAAAAAAQAgAAABIAAAAAAAABAQADARBQADAAAQABAAAQEAAAIBEQAzAFEwATEDACACEAEAAiEAAAAQABAAAAEAAIIAAAAAAAIBACABAAAAIABAAAAAAwAAAAAgAAAAUGAAAAAAQAAgABABAAAAAAMCAAAAAAIAAAABAEAAACAAAAEBAgAAAAAQAAAAAAAAFwAAExAAAGAAAAEAAAACADAhAQABAQEBAAACAAAAAgAAFAABAAACAAAAACAgAAEhEAADAAAAIAAAIAAAFjACAAAAAAAAAQAAAAIAACAQAQEBABABAAMQAAAAADAABwAAEAAQAEMAEAAAAAAAAAABADEAAAADAAEAEAQwABAQAAAAASAAABAAAQEAAAJAMBAAABAAEAAAIAAQAQACCQAwAQAAEQAQBDAAABAVAAEQMAAAYQAAAQAAAwAAAAMQASEAAAAAARAAAAABABEAABAAAAAAAmAAABAyIAAREAQAECAAEAAAAAAAABAAACAAEAMBMAAFABAAUBEAAAAwAEEAEAIBACAAADAAABAAAAAAAAAAIBIAAAABAwAAAwAAARAAAAIAABBAAABAEAAAAAAAAFEgAAAAAAEgQAAAAAAABAIBAVEAAAEAAAAAAAEAAAAhAAYAEwQAABAAAAA0ABAAIAAAAAAAABAAAAQAATARAAAwAQAAAgEAIQAAIAAAABAAAAAAAAAQUAIAEAIgAgAAASAAAEACEAAAIAAAAAA=

Does this helps?

Thanks,

Sometimes Druid queries return the following error, do you think it could be related to this weird hyper log log values?

java.lang.IndexOutOfBoundsException

at java.nio.Buffer.checkIndex(Buffer.java:540) ~[?:1.8.0_111]

at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139) ~[?:1.8.0_111]

at io.druid.query.aggregation.hyperloglog.HyperLogLogCollector.mergeAndStoreByteRegister(HyperLogLogCollector.java:681) ~[druid-processing-0.9.2.jar:0.9.2]

at io.druid.query.aggregation.hyperloglog.HyperLogLogCollector.fold(HyperLogLogCollector.java:399) ~[druid-processing-0.9.2.jar:0.9.2]

at io.druid.query.aggregation.hyperloglog.HyperUniquesAggregator.aggregate(HyperUniquesAggregator.java:48) ~[druid-processing-0.9.2.jar:0.9.2]

at io.druid.query.timeseries.TimeseriesQueryEngine$1.apply(TimeseriesQueryEngine.java:73) ~[druid-processing-0.9.2.jar:0.9.2]

at io.druid.query.timeseries.TimeseriesQueryEngine$1.apply(TimeseriesQueryEngine.java:57) ~[druid-processing-0.9.2.jar:0.9.2]

at io.druid.query.QueryRunnerHelper$1.apply(QueryRunnerHelper.java:80) ~[druid-processing-0.9.2.jar:0.9.2]

at io.druid.query.QueryRunnerHelper$1.apply(QueryRunnerHelper.java:75) ~[druid-processing-0.9.2.jar:0.9.2]

at com.metamx.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:39) ~[java-util-0.27.10.jar:?]

at com.metamx.common.guava.FilteringAccumulator.accumulate(FilteringAccumulator.java:40) ~[java-util-0.27.10.jar:?]

at com.metamx.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:39) ~[java-util-0.27.10.jar:?]

at com.metamx.common.guava.BaseSequence.accumulate(BaseSequence.java:67) ~[java-util-0.27.10.jar:?]

Thanks !

I loaded up your “bad” HLL and it does have some weird values in it. Did you ever deploy a pre-release version of 0.9.2 built from master, or one of the earlier RCs? Some of them had a bug that could cause corrupt HLLs on disk. If you have that kind of corruption, then even if you upgraded to the final 0.9.2 release, you should still go and reindex your data to fix the segments on disk.

Hi Gian!

Yes we were using 0.9.1.1 before upgrading to 0.9.2 . And we have data that was uploaded with real-time nodes using 0.9.1.1, so as you said this could be the reason. Will re-index data and let you know if we still see weird values.

Thanks for the help !

0.9.1.1 is fine – actually, no released versions of Druid had this bug. The only buggy versions would have been snapshots built from master at some points between 0.9.1.1 and 0.9.2, and the earlier 0.9.2 RCs. So if you only ever had 0.9.1.1 and 0.9.2 installed then I think you’re hitting something else.

It might be some on-disk data corruption caused by some other reason like bad hardware?

It could be possible. It would also explain why the error is very intermittent. After a couple of same query execution, it succeeds. And really it is very random, like no pattern at all, not on segment intervals or something like that.

Will make a hardware check on all servers and get back to you! Thanks for everything Gian :smiley:

Hi there Gian,

I’m going mad, I ran a hardware check on every instance, and I got nothing. All disks are OK (smartctl shows good values, between 90 and 100 on remaining use time, not even near being bad). Also dmesg doesn’t show disk errors at all.

Also to check that all segments are OK, I ran segment-dump and also downloaded every segment from HDFS, compared crc32 values (zip crc32 value, extracted file crc32 value and cached crc32 value). Everything was correct, all of them had the same crc32 number, so segments are fine. My guess is that the error is happening on real-time indexing. Perhaps should I try to re-index all data and see how it goes?

Also this errors are intermittent, same as wrong hyperloglog values, and this is bad because we made a lot of efforts switching our tech to Druid, but we can’t push it to production because of these random failures (attached some errors).

I don’t know what else to try. Do you have any suggestion?

Something great would be to track down which was the segment that produced the failure, but it seems that logs don’t show that, I already started to see source code to see if I can find anything…

Thanks,

random_failures.log (13.2 KB)

Hi, anyone that could help us with these problems? Any kind of hint would be much appreciated!

Thanks,

Hi Federico,
Can you confirm if you see the issue only on realtime nodes or on historical also ?

If on historical nodes, then there might be some issues with the segment and reIndexing might help.

If only on realtime nodes, I wonder if there is some issue with realtime segments that might have caused this.

Hi Nishant, how are you?

I can confirm that the error is on historical servers, but I can’t say if it’s happening on real-time as well.

I will try mass re-indexing, it’s the only thing that’s left to check. Error is still happening.

Thanks for your reply!

Hi Federico,

I started having this issue too, it started happening after moving to ec2 i3 instances, segments loaded from s3, cluster stabilized and now, as you say,

we’re getting these exact same errors on druid 0.9.2 … It is absolutely non-deterministic, which makes it impossible to resolve.

Have you made any progress?

Also an interesting thing is that if this happens, then Coordinator somehow looses track of this historical node … and sees only the other ones …

if I restart this historical node, it subscribes back to coordinator and everything works again …

Hi Jakub,

We didn’t find the exact error. We don’t even know if its gone for good or not. But we made some changes that made it better and today it’s really really hard to find that weird output.

The first thing I would recommend is to re-index data (and as often as possible). For example if you ingest data via real-time nodes, have at least a nightly job that re-index data. We are even considering re-indexing every hour to avoid possible real-time data ingestion errors faster, but right now the nightly batch job works well.

The other and I believe the most important. On druid web console (default port 8081), check out the size of your segments. On Druid’s documentation it recommends a segment size between 400-700MB. We had segments of 2GB sometimes, so we changed indexing granularity and segments are now between those values.

After this changes, the errors happened much less often (however are not gone).

Hope that helps and if you have any other questions would be more than happy to help.

Regards,

Hi,

thank you for suggestions, we have small 30-60MB segments and we don’t use real time indexing, so reindexing should not be necessary at all :-/

What instances are you using, isn’t it i3 by any chance? With those NVMe SSD ?

Don’t you get any logs on the historical server when it gets out from the cluster?

No, we don’t use AWS, we use physical dedicated servers with Samsung SSDs.

Also I shared a script that checks for segment integrity between segments on deep-storage and segments present on historicals. But I made that using HDFS as deep storage, in your case you should change it to work with S3, but could be useful to double-check that segments are not corrupted.

I wonder if this is a similar issue to https://github.com/druid-io/druid/issues/4199. Are you all ever seeing exceptions too or just weird results?