Does anyone ingest large amounts of data from AWS S3 into Druid offline daily/hourly/… ?
If anyone can share how he’s doing it then it’ll be great!
(I tried do it using EMR but it works sooooo slow and the cluster stays up for a very long time)
Can you check the interval for each task. If the interval’s are overlapping the ingestion task would run sequentially. I had the same issue. I got better results after I made the interval granular (Hourly instead of daily)
you can also improve performance by explicitly setting numShards in your partition spec. (http://druid.io/docs/latest/ingestion/batch-ingestion.html)