Compute average values of nullable metrics

Hi,
I have a data schema containing a metric “response_time” of type “doubleSum”. The problem is response_time in JSON input could be null, which is indicated by a value of -1. I’d like to compute the average response time, and obviously, null values should be excluded.

Here is an approach I’ve come up with:

Before feeding the data into Druid, preprocess the JSON and add a dimension “response_time_valid”, which is set to 1 for rows where response_time >= 0. And then we define some filtered aggregator or custom JavaScript aggregator to compute the sum of valid rows. The averages can be obtained using post-aggregation like sumOfValidRows / countOfValidRows.

Are there any convenient ways to achieve this?

Could you just use a pair of standard filtered aggregators, with the filter on response_time? i.e. filter not response_time = -1

I have read the document and it seems filtered aggregators only work on dimensions rather than metrics?

I guess if you have a unique dimension, you could use a groupBy subquery on that dimension and use HAVING clause to filter out the -1 response_times - but it’s probably much faster to include your ‘valid’ dimension at ingestion time.

It seems like javascript aggregator is the only way

Wonderful question. I’ve been musing about how to do this for a while too and would like to hear more on this subject.
On our end, we also need to have two metrics for each average value that we want to compute, one metric for the numerator, one for the denominator. But of course, this doubles the data volume and the segment scan times, so I’d love to have a good way to do this differently.

Whenever we can, we try to get rid of the extra metric for the denominator. If a value is always set, one can divide by the number or rows ingested which is the sum over the implicit count metric. In this case it is important to use the right expression. In Pivot for example there is an average() expression which wrongly divides by the cuboid-count (the count over the implicit count metric instead of the sum over the implicit count metric) which yields the number of records inside Druid after aggregation, but the true number needed is the number of records ingested.

If a value is not always present, we try to find a closed-form rule for when it is present based on dimensions that we have anyway.

I was asking myself if one would use a multi-valued dimension to keep track of when a metric contains a value and do this across metrics. This saves metrics but might hurt the rollup.

Since Druid needs to aggregate over the numerator metric anyways, I don’t understand why it cannot count the number of times it has done that and have this as an implicit metric. Would love to hear what the best way to have average is. We have so many of them and it hurts performance to have two metrics each time.

I think in current releases you could use the javascript aggregator to compute a filtered “sumValid” and “countValid” at ingestion time. And in the future, Druid 0.9.2 will also allow regular filtered aggregators on both strings and numbers at both query and ingestion time.

Hi Gian, As 0.9.2 is released now - Do you know if this feature is available ?

I have metric data coming in continuously and some of the metrics can come as NULL and i want to filter them(all those events/values) at the time of ingestion so that it calculates the correct value of metrics for AVERAGE and other metrics.

Regards,

Arpan Khagram

+91 8308993200

Hi Arpan,

Support for filtering on long-typed columns was added in 0.9.2, the upcoming 0.10.0 release will add support for filtering on floats as well.

Thanks,

Jon