I’m setting up Druid on an EMR cluster and successfully ingesting parquet files from an s3 bucket using Hadoop Batch Ingestion.
Now, I wanted to automate this ingestion process to run everyday for a different file (in the same s3 bucket) with the same configuration. Is this possible? Or would I need to manually do it everyday.
You can POST API calls to Overlord periodically, probably configure a cron job, to submit tasks. https://druid.apache.org/docs/latest/operations/api-reference.html#overlord
Only problem is though you need to figure out a way to tell Druid what are the new files in the folder, as you probably do not want to reingest the old file over and over again.