Is there a way to rename data sources?

We do a batch ingestion in our cluster and currently, we drop the whole datasource and re-injest everyday. Is there a way to have a hotswap kind of feature? Say I have a datasource called "visits"; can I injest new data into "visits_new" ans basically swap them? This way I can keep the downstream applications afloat at all times without any downtime.

Any suggestions/recommendations to an alternate workflow are welcome too. Thanks in advance.

Hi Karthik,

Why there is need to drop the datasource daily ?
If you do the reingestion in batch mode, Druid’s atomic update mechanism means that queries will flip seamlessly from the old data to the new data.

Could you please explain more on the use case if I am missing something here.

Thanks and Regards,

Vaibhav

I think whether my new data will replace the segments or no will depend on “appendToExisting” flag right? And in case I keep this as false, only the segments which exist will be replaced right?

Hi Karthik,

That’s right, if you have appendToExisting: false, then the segment will have the same partition number if they are in the same interval(segment granularity).

This will overwrite the old segment. If it’s true, then the segments will have a higher partition number.

Hope this helps.

Thanks,

Hemanth

Yeah I get that. Here is my problem. Suppose by segment granularity is “day”. I ingest data for a month. And my initial load contained data only for all Mondays

Now, if I submit another job but now with data for all Sundays… Now, with appendToExisting = false, after my second set is loaded, my datasource will contain data for both Mondays and Tuesdays… Rightfully so since there was no segment overlap… (had the new data had data for both Mondays and Tuesdays, the data for Mondays would be replaced and data for Tuesdays be created afresh)

I simply want, after my second load, the datasource to contain only the Tuesday’s data…

The above is a very simple example but our use case is little more complex for the near term (for the long run, we are working towards ingesting event-like data so we wont have to deal with segment manipulations)

When I said Sunday, I actually meant Tuesday. Sorry about that, I hadn’t had my coffee yet when I typed that.

??

Hi Karthik,

As you just need Tuesday [T] data after your second load where you are loading both Monday’s [M] and Tuesday’s [T] data. You may try below :

  1. Filter your input data at the source itself to insert only the data you want [ i.e T 's data]

  2. As you don’t want M data any more [ Run a kill task for all the M segments which will eventually DROP all the M segments]

https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html

Thanks,

Vaibhav

I think you could imagine modifying Druid’s batch ingestion so it behaves the way you describe (“holes” in the input data lead to drops from Druid, rather than retaining the existing Druid data). But, today, it doesn’t. And by the way, even if it did, the atomic replace would be day-by-day, not whole-datasource.

One interesting feature might be ‘datasource aliases’. Imagine a pointer that you could re-point to a different datasource. I think that would solve your problem nicely and be pretty straightforward to implement (store a list of aliases in the metadata store somewhere, cache them on brokers, and include a query runner on brokers that rewrites the query if an alias datasource is involved). If you are interested please raise a GitHub issue or even get your hands dirty with a patch.

1 Like