When defining a schema for Druid, how critical is the metricsSpec section? Is it required that we know ahead of time all dimensions that we might want to have aggregation on in order to perform max/min/sum queries on them? Based on the documentation, I believe it is required when using Aggregations in queries. There doesn’t appear to be a way to do max/min/sum without having these metrics defined in the schema, but I wanted to confirm that as it doesn’t appear to be explicitly called out in the documentation for Schemas.
You do need to set up your metrics at ingestion time because of the rollup Druid does during ingestion. It needs to know how to combine two events that share the same set of dimensions. See “Roll-Up” here for some more details: http://druid.io/docs/latest/design/index.html
Thanks Gian, that makes sense. So, I guess my next question would be: how do you add a new metric to already existing segments?
Segments that have already been created are immutable, but you can regenerate them with new columns by reloading the data in batch. Generally the Hadoop indexer is the easiest way to do this. You can provide it with an “intervals” parameter and it will do a batch-replacement of that interval.
That seems to imply that you would need the original, uningested dataset. We’re pulling data in real-time, so we don’t keep a separate copy to batch ingest. Is there a way to reindex from existing segment data?
Druid can do reingestion, but only to drop columns or coarsen up granularity. Druid actually doesn’t keep a raw copy of your data. It only keeps the columns you tell it to ingest. For this reason, if you think you might ever need to reprocess your data, we recommend saving a copy of your raw data to S3 or HDFS in addition to sending it to Druid.
I understand that druid doesn’t keep everything, but I’d like to apply a new metric to an existing dimension. Is that possible with reingestion?
Not in general, because generally when you index a metric you’ll pick SUM, MIN, MAX, and so on for your ingestion aggregator that will get applied during rollup. In general you wouldn’t be able to take a bunch of things that were summed and figure out what their max was. It may be possible in some specific cases though (like if you actually did not get any rollup).
I think my case is one of those special ones. The new metric would be applied to an existing dimension so the base value should (if I understand the ingestion process properly) be stored for every record created in each segment.
In the particular case I’m dealing with, we have a dimension called ‘duration’ that holds a number. In theory, it should be possible to read in the segment, determine the max value, and then generate a new segment from it with the new metric inside of it.
Does that mean it’s possible to reindex our dataset, and if so is there an example out there on how to do it?
The thing to look at would be the “dataSource” pathSpec for batch ingestion: http://druid.io/docs/latest/ingestion/batch-ingestion.html
It basically reads rows from existing segments and reindexes them with a new schema.
Awesome, I’ll give that a shot! Any advice for doing this on a live system with an external Hadoop cluster? It’s not super-critical but I would like it to remain online as much as possible while the reindexing takes place.
while re-indexing happens, druid cluster will stay online and queries will continue to work with old data.
btw, If you are using “dataSource” path spec for reingestion then make sure you patch druid-0.8.1 with https://github.com/druid-io/druid/pull/1797 or wait and use druid-0.8.2-rc that would be released very soon.