I have to upload data for a year spanning 1500+ files on Druid. I was using Hadoop batch ingestion for this and after I queried my data I realized that the ingested segments were replacing the previously ingested segments for the same interval. After reading previous posts about appending segments instead of replacing them on this group, I came across 2 methods I can use:
- Ingestion through kafka-indexing-service: Most of the answers on this group recommended using this option. I tried it and it works for a single machine.
Q 1a: Does kafka-indexing service provide same ingestion performance as a hadoop batch ingestion?
Q 1b: Any configs I can change to make kafka help ingest more data faster?
Q 1c: The command used to load data on a kafka topic basically uses a pipe to read a local file. Can I use the cat command to pipe a file from GCS bucket? Is this the best way to load data on Kafka from GCS bucket? Any suggestions are welcome.
- Delta ingestion: I am a little bit confused about this method. The example show at (http://druid.io/docs/latest/ingestion/update-existing-data.html), refers to reading data already existing inside druid. I am not sure how to use this option when reading data from files for the first time itself. Basically I want to achieve results from a proper hadoop batch ingestion but by appending previous segments instead of replacing it.
Q 2a: Would really appreciate if someone could share their setup for delta ingestion where reading from files instead of datasources.
Q 3: I am assuming that all the segment intervals that got over-written during Hadoop batch ingestion are inaccessible now. That is to say I cannot make any changes in config on Historical or any other node to be able to read from the entire data instead of last updated. Am I wrong?
Q 4: In what way are you ingesting large amount of historical data into Druid?