Continuous Batch Ingestion from S3

We have many scenarios where we want to batch ingest from a path in S3 every time a new partition is created. A partition (folder path in S3) is created once an hour by some spark jobs. We thought of creating an ingestion spec and submitting every hour to Druid. I am wondering if there is a better way to do this. I saw a blog post that mentioned using Apache Beam, but don’t know the details. Ideally I would turn the S3 batch ingestion to a streaming ingestion mirroring streaming that we already do from Kafka. Any suggestions on architecting a better batch ingestion solution, which tooling to use, etc.

Thanks,

Alon Becker

Hi Alon,

Creating an hourly ingestion spec will work and is a good solution. It will help you achieve what you want to achieve in this scenario. A better option is to write the results of your spark jobs to Kafka. If you do that then you can connect that Kafka topic to Druid. This will ensure that there is no additional manual process involved and your data will be available in real-time.

2 Likes

Hey @Alon_Becker maybe the Community Stories on Airflow might help?
https://www.druidforum.org/tag/airflow

You should also be able to go from there into topics about Spark as well to see how other people are doing it…

2 Likes

I do like this solution better. This also simplifies ingestion spec management as we will only manage ingestions from Kafka. It also will allow for simple GitOps management of Ingestion Specs.

Thanks! I will take a look at the community stories using Airflow.

1 Like