Rollup Ingestion on weighted metric

Hi guys! I might have a silly question, but can’t fina an answer…

I have data in csv file similar like this:

date       | ad_network_id | ads_impressions| ads_revenue | ecpm
----------------------------------------------------------------
2021-12-20 |            10 |           2343 |      543.15 | 5.4
2021-12-20 |            10 |           3453 |      343.12 | 3.1
2021-12-20 |            10 |          12543 |       353.9 | 8.1

date is timestamp
ad_network_id is long

and then we have metrics:
ads_impressions as longSum
ads_revenue as doubleSum

My problem is ecpm… how ecpm can be a metric if I want it to be weighted by impressions?
Like: ecpm = ecpm*ads_impressios/total_impressions

After rollup I want my data to look like this:

__time     | ad_network_id | ads_impressions| ads_revenue | ecpm
----------------------------------------------------------------
2021-12-20 |            10 |          18339 |     1240.17 | 6.8136

In this example ecpm formula:

ecpm = 5.4*2343/18339 + 3.1*3453 /18339 + 8.1*12543 /18339

How can I make this kinda metric rollup? Any advice would be appreciated :slight_smile:

Thank you

Here is a first approach to get the result at query time, if this helps at all:

WITH r AS (
  SELECT __time, ad_network_id, SUM(ads_impressions) AS sum_impressions
  FROM ecpm
  GROUP BY 1, 2
)
SELECT t.__time, t.ad_network_id, SUM(ecpm * t.ads_impressions / r.sum_impressions)
FROM ecpm t JOIN r ON (t.__time = r.__time) AND (t.ad_network_id = r.ad_network_id)
GROUP BY 1, 2

Depending on the size of the tables involved, this might need some tuning to work.

I doubt that you can make this work at ingestion time. What happens if your rollup is not perfect? All your numbers would be off because you would be dividing by a partial sum only.

1 Like

Been googling all day :smiley: could not find any way to do it at ingestion time…

Know for sure making ecpm a doubleSum metric does not work… that only sums ecpm, which is not what I need…

Tried different transforms, but there is just no way to get full sum of impressions (or at least not as I know right now…)

Yes, the only option is just leave ecpm as a dimension & do it at query time. Your query @Hellmar_Becker does work, but I thought I can reduce storage footprint by reducing those three rows into one row… (in production its not gonna be 3 rows, but thousands…)

Is there any way to get summed impressions at ingestion time? cuz then I could just do transform:

"transforms": [
    {
        "type": "expression",
        "name": "ecpm",
        "expression": "cast(ecpm, 'DOUBLE') * cast(ads_impressions, 'LONG') / cast(ads_impressions_summed, 'LONG')"
    }
]

ads_impressions_summed would that 18,339 number (all summed impressions)

If you enable Javascript, you can write a metric definition like this:

        {
          "name": "custom_agg",
          "type": "javascript",
          "fieldNames": [
            "ads_impressions",
            "ecpm"
          ],
          "fnAggregate": "function(current, a, b) { return current + a * b; }",
          "fnCombine": "function(partialA, partialB) { return partialA + partialB; }",
          "fnReset": "function() { return 0.0; }"
        }

which sums up (ads_impressions * ecpm). You would still have to keep the sum of impressions as a separate metric and do the division at query time, because (as I understand) the Javascript aggregator can handle only single floating point return types. But that way you can do the rollup.

I will continue digging.

Come to think of it, would this do:

Define a transform to compute the total spend per row as (impressions * ecpm) and sum this one up

      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        },
        {
          "name": "sum_ad_network_id",
          "type": "longSum",
          "fieldName": "ad_network_id"
        },
        {
          "name": "sum_ads_impressions",
          "type": "longSum",
          "fieldName": "ads_impressions"
        },
        {
          "name": "sum_ads_revenue",
          "type": "doubleSum",
          "fieldName": "ads_revenue"
        },
        {
          "name": "sum_ecpm",
          "type": "doubleSum",
          "fieldName": "ecpm"
        },
        {
          "type": "doubleSum",
          "name": "sum_ads_spend",
          "fieldName": "ads_spend"
        }
      ],
      "transformSpec": {
        "transforms": [
          {
            "type": "expression",
            "name": "ads_spend",
            "expression": "1.0 * ads_impressions * ecpm"
          }
        ]
      }

then you compute the overall eCPM as (ads_spend / sum_ads_impressions) at query time?

1 Like

Yes man! It worked like Magic! Thank you very much :slight_smile:

Now storage footprint is more than 10 times smaller.
From my previous example, data in csv file look like this:

date       | ad_network_id | ads_impressions| ads_revenue | ecpm
----------------------------------------------------------------
2021-12-20 |            10 |           2343 |      543.15 | 5.4
2021-12-20 |            10 |           3453 |      343.12 | 3.1
2021-12-20 |            10 |          12543 |       353.9 | 8.1

After ingestion data turns out like this:

__time     | ad_network_id | sum_ads_impressions| sum_ads_revenue | sum_ecpm_imp
---------------------------------------------------------------------------------
2021-12-20 |            10 |              18339 |         1240.17 |     124954.8

Dimensions & metrics:

"metricsSpec": [
  {
    "type": "longSum",
    "name": "sum_ads_impressions",
    "fieldName": "ads_impressions"
  },
  {
    "type": "doubleSum",
    "name": "sum_ads_revenue",
    "fieldName": "ads_revenue"
  },
  {
    "type": "doubleSum",
    "name": "sum_ecpm_imp",
    "fieldName": "ecpm_imp"
  }
],
"dimensionsSpec": {
  "dimensions": [
    {
      "name": "ad_network_id",
      "type": "long"
    }
  ]
}

Using your transformation:

"transforms": [
  {
    "type": "expression",
    "name": "ecpm_imp",
    "expression": "1.0 * ads_impressions * ecpm"
  }
]

After everything is done, using this query to get results:

SELECT 
  __time, 
  ad_network_id, 
  SUM(imp) AS impressions,
  SUM(rev) AS revenue,
  SUM(sum_ecpm/imp) AS ecpm
FROM (
  SELECT 
    __time,
    ad_network_id,
    SUM(sum_ads_impressions) AS imp, 
    SUM(sum_ads_revenue) AS rev,
    SUM(sum_ecpm_imp) AS sum_ecpm
  FROM my_testing_datasource
  GROUP BY __time, ad_network_id
)
GROUP BY __time, ad_network_id

Not sure if WITH or nested select is faster…

Results from query:

__time     | ad_network_id |     impressions |      revenue |   ecpm
-------------------------------------------------------------------
2021-12-20 |            10 |          18339  |     1240.17  | 6.8136
1 Like

I don’t even think you need a sub-SELECT. This should be fine:

SELECT 
  __time, 
  ad_network_id, 
  SUM(sum_ads_impressions) AS impressions,
  SUM(sum_ads_revenue) AS revenue,
  SUM(sum_ecpm_imp) / SUM(sum_ads_impressions) AS ecpm
FROM my_testing_datasource
GROUP BY __time, ad_network_id

Basically, I think you use the inner SELECT just to sum up the impressions in order to get the common denominator for all ecpm values. But you can just aggregate this on the fly in the main query.

That’s even simplier! Thank you