Previous segments get over-written during hadoop batch ingestion. Need help with options

I have to upload data for a year spanning 1500+ files on Druid. I was using Hadoop batch ingestion for this and after I queried my data I realized that the ingested segments were replacing the previously ingested segments for the same interval. After reading previous posts about appending segments instead of replacing them on this group, I came across 2 methods I can use:

  1. Ingestion through kafka-indexing-service: Most of the answers on this group recommended using this option. I tried it and it works for a single machine.

Q 1a: Does kafka-indexing service provide same ingestion performance as a hadoop batch ingestion?

Q 1b: Any configs I can change to make kafka help ingest more data faster?

Q 1c: The command used to load data on a kafka topic basically uses a pipe to read a local file. Can I use the cat command to pipe a file from GCS bucket? Is this the best way to load data on Kafka from GCS bucket? Any suggestions are welcome.

  1. Delta ingestion: I am a little bit confused about this method. The example show at (http://druid.io/docs/latest/ingestion/update-existing-data.html), refers to reading data already existing inside druid. I am not sure how to use this option when reading data from files for the first time itself. Basically I want to achieve results from a proper hadoop batch ingestion but by appending previous segments instead of replacing it.

Q 2a: Would really appreciate if someone could share their setup for delta ingestion where reading from files instead of datasources.

Q 3: I am assuming that all the segment intervals that got over-written during Hadoop batch ingestion are inaccessible now. That is to say I cannot make any changes in config on Historical or any other node to be able to read from the entire data instead of last updated. Am I wrong?

Q 4: In what way are you ingesting large amount of historical data into Druid?

Thanks!

Hey Ankit,

Responding at a high level:

  • If you’re doing a one shot load of 1500 files into Druid into a new (no previous data) datasource AND you already have a Hadoop cluster to work with, I would suggest just running a Hadoop job that handles the whole batch at once which will avoid any overwrite issues.

  • If the parameters are same as above but you’re ingesting into an existing datasource with previous data, you would follow the instructions here: http://druid.io/docs/latest/ingestion/update-existing-data.html and use a ‘multi’ inputSpec with one ‘dataSource’ inputSpec and another ‘static’ (or other file-type) inputSpec that reads your 1500 new files. This will create a new set of segments that will replace the existing ones, but these will contain all the previous data + the new data. The example on that page is helpful in understanding how to read existing segments and new files in a single indexing job.

  • If after the initial load, you’re going to periodically have additional files that you want added to the datasource, I’d recommend looking into native batch ingestion running in append-mode (setting the ‘appendToExisting’ ioConfig to true). This will add the new data to the datasource without replacing the existing data.

  • From what you’ve described, using Kafka indexing doesn’t seem necessary; however, if you are working with real-time data or getting frequent updates that you’d like reflected quickly in Druid, it would be something interesting to consider. I’ll defer responding to your specific questions on Kafka indexing unless it’s something you decide you’d like to pursue instead of (or in addition to) the above methods.

Hope this helps!

David

Thanks a lot for clearing things up.