Currently we have a datasource that is built by batch indexing from a day-sized extract from our data warehouse. This data source contains unique ids, which we need to drive one feature of our product, which represents individual events (calls) using a Select query.
Our other use case is a dashboard showing a number of aggregated views based on group by queries. We realized that the datasource as we had it is not well optimized for that use case for two reasons:
Though we don’t need the unique ids for this use case, their presence in the data source was blocking pre-aggregation/rollups.
The “dimension” we effectively want to group by for the queries was not represented via metrics or even by a single actual dimension in Druid. Instead we had to group by two dimensions and then do a post-query mapping to get the “synthetic” dimension we want to display in our graphs.
We built a prototype datasource to power the dashboards that addressed these problems:
It does not contain the unique ids so is able to get a rollup ratio of about 2:1.
An ETL process does the mapping of the two source dimensions to the synthetic dimension pre-ingestion. It also represents all the possible values of the group by dimension as count metrics. (There are only six possible values, so six metrics.) So we were able to pull the counts we need from the metrics and drop two dimensions from our queries - in one case, this meant switching to a timeseries query since there were no dimensions left to group by.
The prototype datasource has a massive performance improvement over the original. About 8x in some of the worst case queries that also involve high-cardinality query time lookups.
So what we want to do is figure out a way to ingest both data sources in a scalable way. As I mentioned, our current indexer capacity can barely keep up with ingestion of the original data source. We are admittedly trying to do things a bit on the cheap here and maybe we just have to suck it up and pay for more indexing capacity. But I’m looking for ways to increase our indexing efficiency. One idea was to use the IngestSegmentFirehose to ingest the dashboard datasource from the primary one in the hope that since it was already ingested data, it might be faster to ingest that way than by pulling down the original source file from S3 all over again. If for no other reason than the segments in the ingested datasource are 50%-70% the size of the original extract.
This is getting a little bit far afield… but other thoughts that I had were to go ahead and put the new metrics into the primary datasource too instead of creating a new datasource. Although we’d not get the ingestion rollup, my guess is that it’s mainly being able to drop two group-by dimensions that is giving the performance boost. I haven’t had a chance to test that yet though. Even if so, sticking with a single data source would seem to make reingestion/backfill scenarios more awkward in some cases. Also, in the future we’ll be adding more dashboards for which we may want to craft similarly optimized datasources. Maybe the metrics for those cases could just be added to the existing datasource, or to a single dashboard database, but it’d be nice to have enough indexing headroom to be able to choose to make additional specialized datasources if we want.