How to handle data source refresh


We have a use case where dimensions of a data source is changed/modified and the entire data source needs to be reingested, to account for the new dimensions. When re-ingestion is executed for a data source, the data source itself is offline until the reingestion is complete.

What’s the best way to handle this? I’d like to keep the data source online while the re-ingestion is happening. Are there features in ingestion that i can use to keep the data source online while reindexing is happening?

I’ve worked a bit on view support in Druid SQL - in the SQL case, we can solve this by registering a view that references different versions of underlying data source (e.g: create view data_src as select * from data_src_v1). Can we similarly look at implementing ‘alias’ support for data source name, so that we have a similar solution in the case of JSON queries as well? What do folks here think?



Not sure what do you mean by datasource going offline.
Fwiw, You can reIndex data using batch ingestion and also run queries on existing older segments at the same time. Once the batch ingestion completes, new segments are created and loaded on the historical nodes, Older segments are dropped only after newer version segments are loaded. So at any point of time during ingestion you can query your old data.

Thanks Nishant.

I spoke to our internal user - here is a better explanation of the scenario:

  • We are changing dimension sets, value bucketing or other aggregate features and so need to re-ingest the entire data again.

  • The input data for this data source is very large (~17 billion rows of data). A single hadoop batch ingestion for entire dataset does not successfully complete, with job failing after running for more than 24 hours. Therefore we break it up into a few chunks by time window and submit one hadoop ingestion job for each window.

  • Prior to re-ingestion, we have to delete the datasource first (by issuing HTTP delete on the datasource REST object, which i believe removes the segment metadata). Reason is if we don’t and only some of the hadoop ingestion job above succeed and some fail, we will be in weird intermediate state with both old and new data mixed.

Are there other ways i am not aware of, to address this scenario, while still keeping the datasource online?