Documents on alerts is too simple

I’m working on monitoring druid cluster in my production environment. There seems ServiceMetricEvent and AlertEvent concerning this. I could find detailed document on metrics (Link: http://druid.io/docs/latest/operations/metrics.html). But for alerts, the document is too simple (Link: http://druid.io/docs/latest/operations/alerts.html). Could you guys provide details on alert, including all types and condtions? Thanks.

That’s exactly what I am concerning, unfortunately, I don’t have good advises either. Hope some geek could help!

在 2017年5月18日星期四 UTC+8下午5:02:13,Liz Lin写道:

Unfortunately the alerts haven’t been documented. We also historically haven’t treated them like a ‘stable’ api so the specific wordings or fields might change between versions. You might have better luck just sending them all to an email list or something like that. If you’re working on automated monitoring, the http info endpoints (coordinator/broker/overlord) are probably more useful. They’ll help you figure out if your data is fully loaded, if any ingestion tasks are failing, that kind of thing. They’re also stable from release to release so they would work well with automation.

ah, that is very useful information. We are working on better alerting/monitoring as well.

On that note: the overlord endpoints know the task states RUNNING SUCCESS and FAILED.
I would like to see a distinction made between a task that actually failed and one that didn’t finish because it was explicitly cancelled, meaning I’d love it to see a CANCELLED status as well.

Reason being that as there is no automatic retry mechanism for failed batch ingestion tasks, I need to work on some automation that monitors the tasks and retries failed tasks once.
However, an external process cannot decide whether a task failed because a person manually cancelled it for a good reason or whether it failed due to some incident. I don’t want to automatically reschedule tasks that got cancelled. Is it possible to distinguish between the two?