I want to ingest data 10TB using Hadoop Batch Ingestion.
I have 1 TB of data and I want to duplicate it 10 times to reach 10 TB by using Hadoop batch ingestion.
However, Druid overwrites the existing data in the data source (Using “appendToExisting”:“true” does not work) .
(1) How can I dupplicate my data in Druid?
(2) Can I duplicate segment in Druid?
Check out the “multi” datasource type at http://druid.io/docs/0.10.1/ingestion/update-existing-data.html. (
It allows you to combine an existing segment with additional source data. Using this, you could iteratively perform delta aggregation of your source data set over and over again.
Someone else also post an example of a combining firehose which should be another way to do this: https://groups.google.com/d/msg/druid-user/1l5s5ZXM6oM/zlbR0AKdBgAJ
Thank you Kyle,
How can I do it with Hadoop Batch ingestion without Firehose?
Is it the same or do I have to modiy more parameters?
My mistake, I think firehoses can only be used in realtime nodes – you’ll have to experiment with the functionality defined on the “update-existing-data” documentation page.