in order to speed up indexing of some datasources, we were wondering to take advantage of the indexing task from existing datasources.
The general idea is this one:
We have a big datasource A. This is used only by few use cases.
We have smaller datasources B_1, …, B_n that are basically limited views over the big one. Most of our use cases can rely on these datasources. (for instance: total revenue for a publisher, campaign spend, etc…)
With “big” and “small” I mean in term of size of the segments produced by the indexer. (A small datasource is <10% the size of the big one)
Currently we are indexing B_* from the same log data used for A but we wondering to switch the source to A data segments. I expect it to be much faster as A data segments are already aggregated.
The problem we are facing is related to unique counters. Some of the A metrics are unique counters (hyperUnique) over dimensions that are not saved into the generated datasegments. Let’s explain it better. Logs have the following dimensions:
A have a subset of these dimensions that don’t include user_id and page_id. Instead it has the following metrics:
Datasource B_* have the same metrics but I don’t know actually how to define them as user_id and page_id don’t exist anymore in A segments.
My questions are:
Does it make sense to speed up generation of datasources from a datasource that have a superset of the dimensions/metrics required?
How can we actually reindex the unique metrics given the problem stated above?