[0.9.1.1] performance of IngestSegmentFirehose

Hi,

It’s looking like we could benefit from setting up multiple data sources. Two to start with but possibly more in the future.

We’re still creaking along with batch file indexing and kind of just barely keeping up with our primary data source ingestion. For the secondary data source I was thinking about using an IngestSegmentFirehose to ingest the new data source from that primary one as it contains or could contain all the data needed for it. (The secondary source is a subset of the primary with some dimensions removed to allow better rollups.)

Can we expect better performance from that firehouse compared to ingestion using the same static-S3 firehose that we use for the primary one? A couple of quick tests suggests no, it looks about the same. Perhaps it is just reading the segment files out of S3 deep storage so it works out about the same? Even if that’s the case the ingested segment files are a bit smaller than the original source files so I was hoping it might be at least a little bit faster…

Of course we can add more indexer capacity to keep up, but we do have to worry about our operating costs. We don’t really have any in-house experience with Hadoop, so we’ve been reluctant to try to switch over to that style, but would that be our best bet if we want to keep scaling up without having to shell out for a lot of hardware?

  • Ryan

Hi Ryan, what are you trying to do?

Currently we have a datasource that is built by batch indexing from a day-sized extract from our data warehouse. This data source contains unique ids, which we need to drive one feature of our product, which represents individual events (calls) using a Select query.

Our other use case is a dashboard showing a number of aggregated views based on group by queries. We realized that the datasource as we had it is not well optimized for that use case for two reasons:

  1. Though we don’t need the unique ids for this use case, their presence in the data source was blocking pre-aggregation/rollups.

  2. The “dimension” we effectively want to group by for the queries was not represented via metrics or even by a single actual dimension in Druid. Instead we had to group by two dimensions and then do a post-query mapping to get the “synthetic” dimension we want to display in our graphs.

We built a prototype datasource to power the dashboards that addressed these problems:

  1. It does not contain the unique ids so is able to get a rollup ratio of about 2:1.

  2. An ETL process does the mapping of the two source dimensions to the synthetic dimension pre-ingestion. It also represents all the possible values of the group by dimension as count metrics. (There are only six possible values, so six metrics.) So we were able to pull the counts we need from the metrics and drop two dimensions from our queries - in one case, this meant switching to a timeseries query since there were no dimensions left to group by.

The prototype datasource has a massive performance improvement over the original. About 8x in some of the worst case queries that also involve high-cardinality query time lookups.

So what we want to do is figure out a way to ingest both data sources in a scalable way. As I mentioned, our current indexer capacity can barely keep up with ingestion of the original data source. We are admittedly trying to do things a bit on the cheap here and maybe we just have to suck it up and pay for more indexing capacity. But I’m looking for ways to increase our indexing efficiency. One idea was to use the IngestSegmentFirehose to ingest the dashboard datasource from the primary one in the hope that since it was already ingested data, it might be faster to ingest that way than by pulling down the original source file from S3 all over again. If for no other reason than the segments in the ingested datasource are 50%-70% the size of the original extract.

This is getting a little bit far afield… but other thoughts that I had were to go ahead and put the new metrics into the primary datasource too instead of creating a new datasource. Although we’d not get the ingestion rollup, my guess is that it’s mainly being able to drop two group-by dimensions that is giving the performance boost. I haven’t had a chance to test that yet though. Even if so, sticking with a single data source would seem to make reingestion/backfill scenarios more awkward in some cases. Also, in the future we’ll be adding more dashboards for which we may want to craft similarly optimized datasources. Maybe the metrics for those cases could just be added to the existing datasource, or to a single dashboard database, but it’d be nice to have enough indexing headroom to be able to choose to make additional specialized datasources if we want.