Multiple batch ingestion for same time interval replaces existing data.

Hello,
I want to do batch ingestion on druid multiple times over same time interval, but every time I ingest the data, the existing data gets replaced. It’s really freaking me out. Is there any way to change the replace-existing-data policy for batch ingestion. Help me.
Thanks
Rajnandini

Hi

That sounds like the desired behavior ? if you re-index the same interval that means you need to replace the old one ?
i am i missing something ?

Hello,
I don’t want to re-index the data. I want the data present in each batch ingestion to be indexed without replacing existing data (targeted for same interval). What can I do in such scenario?
thanks
Rajnandini

Looks like you have multiple source of data (multiple files) and you can’t ingest it at the same time (assuming that you have new data arriving every period of time).

If all you need is to append new data then use delta ingestion job.

With delta ingestion you will use the data present at segment and add to it the new data.

read this post http://druid.io/docs/latest/ingestion/update-existing-data.html

Let me know if you have further questions.

hello,
The delta ingestion may allow me to append new data, but it’s not feasible as I have to provide the segments description in inputSpec (overhead of having one more REST call). Isn’t there any other way for appending data in existing segments?
Thanks
Rajnandini

Hi,
I guess there are two options that you can use -

  1. Ingest separate sources of data as different datasource and use Union Queries to get an aggregated view over them.

  2. Use the new kafka-indexing-service for data ingestion which allows you to ingest/append old data.

does any of them work for you ?

Hello,
I tried both of the options. The first option is not feasible as I’m querying druid using caravel which does not have support for the union of datasources. And in the second option we need to provide the rejection policy( I used message-time rejection policy ) to hand-off data to deep storage . In this case, if data comes from other tenant which lies in a segment that is already handed off, it rejects records (as segments are immutable).
Thanks,
Rajnandini

Hi,
FWIW, the new kafka-indexing-service allows you to ingest delayed data and there is no need to specify a windowPeriod, so that might help your use case.

Refer http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html for more info.

There’s also a tutorial here to try new kafka indexing service: https://imply.io/docs/latest/tutorial-kafka-indexing-service.html

Hello,
Actually I’m less concerned about window period. All I want to know is the rejectionPolicy. The rejectionPolicy will decide when the segment would be handed off to deep storage. I think, my query is not feasible as it wishes for mutable segments.
Thanks,
Rajnandini

Hi,
For kafka ingestion service, the supervisor when stopped, starts publishing it’s segments. What does it mean? Is it like, the segments will be handed off to deep storage and become immutable?
Thanks,
Rajnandini

Hi Rajnandini,

KAFKA Indexing Service meets your requirement and you don’t need to specify anything specific for that.

You can ingest data for the segments which have already been handed off to deep storage.

Documentation for KAFKA indexing service is self explanatory but still if you have any questions let me know - http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html

Regards,

Arpan Khagram

+91 8308993200