Handling dynamically generated metrics

Hi,

We have a use case that I’m struggling to figure out how to handle. We want to allow our customers to define custom metrics using a rule engine, then display those metrics in various aggregated forms on a dashboard. We have the rules engine and our data pipeline can generate appropriate metric values for our events - that is not the problem. The problem is how to ingest and query for these metrics when we don’t know what they are ahead of time and they can change over time.

So for example, Customer A (a fast food franchise) might set up rules that generate metrics that he labels “French Fry Upsell %” and “Milkshake Upsell %”. Customer B (a movie theater franchise) wants custom metrics labelled “Popcorn Upsell %” and “Candy Upsell %”.

The basic use case is to display these metrics aggregated by average over a date range, maybe with various dimensional filters and so forth. So Customer A’s dashboard shows “French Fry Upsell - 50%”, “Milkshake Upsell - 22%”. Customer B’s dashboard shows “Popcorn Upsell - 34%”, “Candy Upsell - 76%”.

A more advanced use case is to group by some dimension (e.g., store or city) to display top and bottom performers. For that use case, the output might show “French Fry Upsell (by Store) - Avg 50%, Best (Store #43) 79%, Worst (Store #22) - 1%”.

I’d understand pretty well how to do this if the set of metrics were known ahead of time. But how to handle it when customers can add or remove new metrics at any time? For example, after a month of generating data based on the rules above, say Customer A drops the “French Fry Upsell” metric and adds a new “Kids Meal Girl Toy %” metric.

Here are two approaches I’ve thought of that seem like they might work but seem pretty shaky.

First approach: ingest each custom metric using a custom (dynamically generated) ingestion spec and a metric name based off the customer-provided label. Following our example, we might ingest Customer A’s events with aggregators with both name and fieldName “customFrench_Fry_Upsell” and “customMilkshake_Upsell”, while Customer B would have metrics with name/fieldName “customPopcorn_Upsell” and “customCandy_Upsell”. We would have to generate an ingestion spec that included all four of these aggregators, which would be a bit annoying but doable. Not all events would contain all the targeted fields but I think that’s okay (I know it’s okay with dimensions, less sure about metrics - can they be null?). On the ingestion side, at least, this scheme works pretty well with changes to the desired metrics. When Customer A deletes “French Fry” and adds “Kids Meal”, the ingested metrics shift accordingly - we just stop ingesting “customFrench_Fry_Upsell” and start ingesting “custom_Kids_Meal_Girl_Toy”.

The difficulty is in how to query against this ingested data. We will filter by customer ID of course. And TopN queries against each metric is just what we need to fill our our chart. But how can we know that for Customer A, we need TopN’s against “customFrench_Fry_Upsell” and “customMilkshake_Upsell” but different metrics for Customer B? We could track them externally in our settings database, but in addition to opening the door to data inconsistency bugs that gets really messy when you consider the set of metrics changing over time. If we query a time range that crosses where “French Fry” was dropped and “Kids Toy” was added we have to know that we have to query for both. I thought of including the names of the metrics that are in play as a multi-valued dimension. Then we could do an initial TopN on the target time range with the target filters just to pull all those names back and use them to generate subsequent queries. I think that would work, but… it seems pretty nasty. Also in this approach, our ingestions schema could potentially have dozens (eventually maybe over a hundred at scale) custom metrics. But each actual event would only be providing data for five or ten of them. Is that a problem? Maybe not, but it sure seems inefficient.

Second approach: ingest the custom metrics using generic sequenced names. So the ingestion schema has field names like “customMetric1”, “customMetric2”, up to “customMetricN”. Then we basically do a mapping of custom metric index to custom metric label in the aggregators in our queries - our service layer knows to generate queries for Customer A with { name: French Fry Upsell, fieldName: customMetric1 } and for Customer B with { name: Popcorn Upsell, fieldName: customMetric1 }.

But now the difficulty is maintaining that mapping over time. When “Kids Toy” replaces “French Fry” does it take the “customMetric1” slot? If so then it seems we would need to split into multiple queries when the target date range crosses the point where the change happens. Really messy, and maybe not realistically possible when you consider date ranges that contain multiple metric changes - how thin can we slice those queries? And it would be hard to pick the exact point in time where to slice as there would be a pipeline delay between when configuration changed and when the event data reflects that change. Do we retire “customMetric1” and put “Kids Toy” in “customMetric3”? Now we’re facing the prospect of having to allow an unlimited number of these columns or capping them at some number past which no further changes are possible. And it seems like we’re still in the business of figuring out which of the custom metrics are relevant at particular times, it’s just shifted a little bit.

Any suggestions on how to approach this would be greatly appreciated.

  • Ryan

Here was a thought we had that seems workable, but still wondering if it could be better.

The way I described my use case cut a few corners. What we have is a set of custom-defined boolean signals (did you upsell fries T/F? did you upsell milkshake T/F?) that we want to display expressed as a percentage. But we realized that these do not have to be represented as a metrics in Druid. Instead, we can represent it as two multi-valued dimensions, e.g.:

{ allCustomSignals: [“French Fry”, “Milkshake”], trueCustomSignals: [“French Fry”] }

Then, we can do two GroupBy or TopN queries on each of the two dimensions, and divide the counts in our service layer to generate a percentage for display.

That seems fairly clean. It requires two queries but they can be run in parallel. It handles changes in the set of signals over time pretty naturally. But it’s a little bit more work in our service layer than we typically like to do. Would there be any way to do the percentage calculation in Druid as part of the query? It seems like trying to do it with aggregators and/or post-aggregators will run into the same issues with needing to know what dimension values are possible up front. (Or I bet it could be done with Javascript, but that would likely not be performant.)

Is there some clever way to do it in a single query?

I’ve played around with this a bit. I found that the approach of ingesting “all” and “true” dimensions then joining and dividing in our service layer is workable. It looks a little less attractive in cases where we want to group by another dimension, e.g. show me the % of French Fry upsells grouped by restaurant. You have to group by both the custom dimension and the secondary dimension. If the cardinality of that second dimension is high the result sets to be joined starts looking a bit scary.

I also stumbled across this thread: https://groups.google.com/forum/#!topic/druid-user/KralBuLp7iM That suggests another approach of doing an initial TopN to fetch the “all” values which are in play for a given date range and set of filters, then using that to generate another query that includes filtered aggregators which target the values learned via the first query. E.g., a metric “french_fry_max” filtered on "allCustomSignals == “French Fry” and a parallel metric “french_fry_count” filtered on “trueCustomSignals” == “French Fry”. These metrics can then be converted to a percentage via post aggregators. This approach also seems workable - the two queries have to be performed in sequence, but the first one ought to be very quick and could be cached sometimes. And this one is better for the secondary dimension case since you only have to group by the single dimension - you can do it with a TopN.

It’s hard to say how the performance will be like until we start ingesting the data. We might try both approaches and compare performance when we get data in production.

Hey Ryan,

Sorry to necro the thread, but I just read it for the first time (cue joke about being behind on email) and that is a pretty cool idea. Did you find out what ended up working best?