Reindex from segments

Hi all,

in order to speed up indexing of some datasources, we were wondering to take advantage of the indexing task from existing datasources.

The general idea is this one:

  1. We have a big datasource A. This is used only by few use cases.

  2. We have smaller datasources B_1, …, B_n that are basically limited views over the big one. Most of our use cases can rely on these datasources. (for instance: total revenue for a publisher, campaign spend, etc…)

With “big” and “small” I mean in term of size of the segments produced by the indexer. (A small datasource is <10% the size of the big one)

Currently we are indexing B_* from the same log data used for A but we wondering to switch the source to A data segments. I expect it to be much faster as A data segments are already aggregated.

The problem we are facing is related to unique counters. Some of the A metrics are unique counters (hyperUnique) over dimensions that are not saved into the generated datasegments. Let’s explain it better. Logs have the following dimensions:

  • user_id

  • page_id

A have a subset of these dimensions that don’t include user_id and page_id. Instead it has the following metrics:

  • unique_users: hyperUnique(user_id)

  • unique_pages: hyperUnique(page_id)

Datasource B_* have the same metrics but I don’t know actually how to define them as user_id and page_id don’t exist anymore in A segments.

My questions are:

  1. Does it make sense to speed up generation of datasources from a datasource that have a superset of the dimensions/metrics required?

  2. How can we actually reindex the unique metrics given the problem stated above?

thanks,

Maurizio

Ok, reply by myself:

  1. it’s very fasst

  2. just use hyperUnique against the hyperUnique field of the origin datasource.

Maurizio

Thanks Maurizio, for the update. Your assumption is correct indexing from already exiting segments would be faster than indexing raw data, one major factor for speedup can be rollup ratio. Druid segments have already rolled up data and the number of rows to be processed would be much less than raw data in case you have a good rollup ratio.