Have a question about Kafka Indexing task failover. Assume the taskCount=2, replica=1 and these two tasks run in two different machines (A and B). If machine B crashed, which makes task B goes away, would a new task B be restarted immediately in the machine A before the end of taskDuration?
As per https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion – capacity planning section in that link
one more parameter also plays a role here - druid.worker.capacity
If enough capacity is available on machine A, it should be started theoretically.
Are you seeing any issues with respect to that?
I am trying to use aws_auto_scaling group for middleManager nodes. When it scales down and force the running task to be killed, I am concerned the data missing issue. But as your said, the killed task could be restarted if there’s sufficient
work capacity, it’s save to adopt aws_auto_scaling.
The reason I don’t want to use Druid built-in auto_scaling is that built-in auto_scaling is task based. Since the taskCount is fixed in the spec file, I figure it couldn’t handle spike in throughput. Please correct me if I am wrong.
Just for a re-confirmation, the failed task could be restarted SOON theoretically even if the replica=1 before the taskDuration timeout.
You can continue to use AWS auto-scaling group. No issues with that.
Yes, the killed task could be restarted if there’s sufficient work capacity.
In kafa ingestions, druid_dataSource table stores info about till what point or offset the stream was read.
It continues to read from that next offset based on entry in druid_dataSource table.
Ideally, the data should not have gaps.
Hope this helps.