Indexing S3 Data with hadoop-task - what does it need?

Hi Everyone,

I'm planning to ingest (historical/batch) data from S3. I have to ingest/index about 1 year's worth of data with 1.5 million events/day whose size is roughly 100KB/event. From what I see in the documentation and going through the groups, my understanding is that there are 2 ways of doing it. However I have questions in regards to creating segments.

1. hadoop task - Does this have to have a EMR or Hadoop cluster running? If so, I don't intend to spawn another cluster to create segments.
2. Index-Task - Documentation says it's not scalable. I presume the index tasks are distributed and therefore scale-out of middle managers / overlord should make this scalable?

Few Questions with respect to operations & monitoring:
a.  What are the best practices to schedule a recurring task to constantly load newer data as and when the data arrives in S3? Say 'ingest every day post midnight'
b.  I understand that the co-ordinator console provides some cluster wide information. Is this the only console for monitoring index tasks?
c.  What level of faul-tolerance / recovery options available for failed segment creation tasks?

Best Regards
Varaga

1. hadoop task - Does this have to have a EMR or Hadoop cluster running? If so, I don't intend to spawn another cluster to create segments.

Yes, this is intended for deployments that have an external Hadoop cluster running.

2. Index-Task - Documentation says it's not scalable. I presume the index tasks are distributed and therefore scale-out of middle managers / overlord should make this scalable?

Currently this “local index task” is not distributed, it runs in a single worker task. There is work currently underway to build a distributed local indexing task though (https://github.com/druid-io/druid/pull/5492)

b.  I understand that the co-ordinator console provides some cluster wide information. Is this the only console for monitoring index tasks?

The overlord console (on port 8090) by default is the only built-in console for monitoring index tasks

Perhaps someone else can chime in on questions A and C.

Thanks,

Jon