Batch Ingestion Data Loss?

Hi Team,

I tried more batch ingestion from hadoop yesterday and there seemed to be some data loss even though the overlord console page shows all the tasks succeeded. I fired 10 tasks all at once, each of them with about 1G data to process and they are all ingesting data to the same datasource. I’m not sure if that caused the issue.

I’m doing this because the task will fail because the dataset is too large if I ingest the whole dataset all in one task.

Is this a known issue?


Hi Qi, how are verifying there is a loss of data?

Do the tasks overlap in their intervals at all? Batch ingestion tasks are replace-by-interval, so if you run two tasks for the same interval than the later one will replace data from the earlier one.

Fwiw, it’s fine to ingest a lot of data at once with the Hadoop indexing task. One task can scale out to use as much capacity as you have in your Hadoop cluster.

Hi Fangjin,

Yes I did. I think what Gian said is exactly the cause.


Oh. I don’t know that, I think thats exactly the reason! Thanks!