I'm planning to ingest (historical/batch) data from S3. I have to ingest/index about 1 year's worth of data with 1.5 million events/day whose size is roughly 100KB/event. From what I see in the documentation and going through the groups, my understanding is that there are 2 ways of doing it. However I have questions in regards to creating segments. 1. hadoop task - Does this have to have a EMR or Hadoop cluster running? If so, I don't intend to spawn another cluster to create segments. 2. Index-Task - Documentation says it's not scalable. I presume the index tasks are distributed and therefore scale-out of middle managers / overlord should make this scalable? Few Questions with respect to operations & monitoring: a. What are the best practices to schedule a recurring task to constantly load newer data as and when the data arrives in S3? Say 'ingest every day post midnight' b. I understand that the co-ordinator console provides some cluster wide information. Is this the only console for monitoring index tasks? c. What level of faul-tolerance / recovery options available for failed segment creation tasks?