Is there a multi-ingest for batch-ingestion?

Hi there

I’m trying to ingest data that has already been loaded, but due to any reason has been changed.

I’m using “native batch” ingestion and unfortunately, I found no way to ingest changed data easily. Of course, I could drop the segments and load all files again. But what over several segments only a couple of rows has changed, I wouldn’t want to load everything again.

In the Hadoop-based ingestion has the ingestion type multi, it’s the kind of delta ingestion I would look for.

That’s why my question, is there the same for native batch? Or if not, how would or do you solve this issue?

I assume each and everyone has this scenario where data or logic beforehand is changing, and therefore need to delta-load these changes.

Thank you so very much for your help or tips,

Best Simon

Hey,

according to Druid’s docs, it seems native batch ingestion supports both append and overwrite (https://druid.apache.org/docs/latest/ingestion/index.html#batch).

I think you can achieve that using the IngestSegmentFirehose (https://druid.apache.org/docs/latest/ingestion/native-batch.html#ingestsegmentfirehose), so there’s no need to keep all the ingested files (i.e the input data).

Am I missing something?

Hi Itai

Thank you so much for your help.
I tried this one, but the problem is of course, if I a value on dimension or metrics has changed from a row, and re-ingest it with the “combined” firehose, Druid doesn’t know that row already exists (as dimension or metrics has changed - there is no Primary key as such).

Example:

In the exampled attached, I added 3 rows with the first ingestion task.

The second ingestion task will only add one row which has changed (this is the use-case we have, if some of your data would change for example). I changed the metrics of one already existing metric in that row (factid=11062) and ingest with appendToExisting=false.

What I want to achieve is that it would kick out existing factid=11062 (as dimensions are all same), and reingest with the new changed metric.

But what happen is, that Druid is not doing anything with this row. All the 3 rows are staying as before. I’m not sure how or what Druid is doing internally.
When I have a look at the segments, it also looks like that the segments are not touched.

Questions:
I believe or understand, that this case with changing rows is not made for Druid. Is that correct?

The correct approach is always to drop the full segment, and load all data again, is that correct? In that way, I would only have one version of the new changed rows.

But of course the disadvantage that I tried to avoid is, that I need to reload all files again. I cannot e.g. create a Change Data Capture and only ingest changed rows.

Also I need to store ingested files still, so that in case of changes, I can re-ingest all of them again.

General handling of data changes

Actually a follow-up question in the same direction is, how to handle changes in already ingested data? Probably as above explained?

But let’s say you need to change a dimension from a Value to another over the whole ingested data, that would mean you need to reload all data again. Correct? Or how is Druid handling data-changes?

I hope I explained that it is understandable, otherwise let me know.
Thanks so much for your advice and help.
Best Simon

FactCallSession_reindex_1.json (2.34 KB)

FactCallSession_reindex_2.json (1.75 KB)