Retrospectively adding data

Hey everybody,

I’m thinking of using Druid in a scenario where I want to run an algorithm with different parameter values in order to study the effect of changing this parameter. The algorithm takes a time series and a parameter as input and output a time-series. For each parameter value of interest, I would batch-index the output produced by the algorithm, so I can subsequently use something like Caravel to slide-and-dice the results.

I know from posts on this forum (for example, from “Load two batch files with the same data”) that reindexing an interval within the same datasource overwrites the existing data. Can this be changed so that instead of overwriting, the data gets added?

Otherwise, it looks as if indexing the output for each parameter value as a separate data source and using Union Data Source queries seems best. Does that sound right?

Thanks for any feedback!


To append data events, you can use the delta ingestion as mentioned here -
Having separate datasource and doing union query also sounds good, It will also help you in case you would want to drop data for a set of parameters in future.

Isn’t delta ingestion also replacing all existing segment shards with new ones?

One could perhaps also use the new kafka indexing task for this usecase I think. It would ingest the newly arriving data and create segments that reside alongside the existing ones. The existing segments don’t get merged with the newly arriving data but remain as is. If at some point the time is right to compact all the segments that arrived over time into merged segments to increase query performance, a reindex task can be issued.


you wrote

Having separate datasource and doing union query also sounds good, It will also help you in case you would want to drop data for a set of parameters in future.

Could you elaborate on this? I’m interested to learn which usecases the datasource-union can be used for.
How would it help when wanting to drop data?

What other usecases is it good for?

The documentation states that the schemas must be the same. I wonder how strict this requirement is (e.g. are different segment-granularities OK? etc ) and what usecases this feature tries to solve. That’s a general issue I have: the documentation describes all the features that exist but sometimes I don’t know which usecases they were designed to tackle.

We want to enrich billing information that we received days after the fact to our main datasource, preferably without needing to reprocess all the data.
Lookups don’t seem to be the way to go because they translate dimension values, but the enriched information would be prices, so a metric, not a dimension.

Another thing I’m looking into is “datasource-delegation”, having different views/cuboids of the same data and picking the datasource that could serve up a query the fastest.
Would the union be useful in this context? For instance to union recent data with hourly granularity with older data that is also available at daily granularity to speed up a query?

Union datasources are intended to make it easier to manage ingestion of a dataset that comes from a variety of different datasources, where you might want to manage ingestion of each one separately, but you generally want to query them together. Under the hood a union datasource works by querying all segments for all named datasources as if they were part of a single datasource.

Union datasources make it easier to drop data because dropping a time range of a datasource is O(1) but dropping rows matching a filter is O(N) in the number of rows that exist in the datasource, because you need to reindex.

Thanks a lot Gian. Very helpful explanation. Sounds like a very useful feature.

I’d like to work on datasource-delegation in Druid and at the same time I’m seeing work on virtual columns and get informed about the union datasource feature. These features seem to be connected in that it often seems desirable to keep things transparent to a client. I would like to be able to configure a column to be either a derived column materialized by on-the-fly computations or an actual column based on raw data without the client knowing this. Likewise, I want to be able to have peer-datasources resulting from re-indexing a master-datasource with only subsets of dimensions retained to speed up queries, always having the broker direct a query to the datasource that can serve it up the fastest, but there is no need for a client to know about that, it should just fire a query to one virtual datasource name and not having to care about how Druid would piece the resultset together from specialized datasources.

The union query seems to fit right in into this paradigm, as I might want to keep certain subsets of data in a separate datasource so that I can painlessly delete that data after a while, but this should not force every client into needing to know about my data strategy and play along. So again, being able to make this transparent to a client querying Druid seems like a good thing.

As I mentioned, I would like to see the datasource-delegation feature in Druid and it is on our near-time roadmap to implement this. Would be nice to know whether there is interest on the committers end in that feature as well. If so, maybe we could discuss possible designs in the dev channel?