Configuration for large data ingestion from S3


Does anyone ingest large amounts of data from AWS S3 into Druid offline daily/hourly/… ?

If anyone can share how he’s doing it then it’ll be great!

(I tried do it using EMR but it works sooooo slow and the cluster stays up for a very long time)

Can you check the interval for each task. If the interval’s are overlapping the ingestion task would run sequentially. I had the same issue. I got better results after I made the interval granular (Hourly instead of daily)

you can also improve performance by explicitly setting numShards in your partition spec. (