We’d like to monitor directly for Kafka ingestion task failures. (We’ve had problems both with parse exceptions that we’re catching by setting maxParseExceptions to 0, and with OOMs in some custom InputRowParser code.) Do any of the metrics available (http://druid.io/docs/latest/operations/metrics.html) tell me this? I don’t see something direct.
OK, maybe this is related to alerting? http://druid.io/docs/latest/operations/alerts.html
I can’t find docs on how to configure alerts or a list of what alerts might show up, though.
I don’t think there’s any document currently that shows all the possible alerts; to find what will trigger alerts, you’ll need to search the code for EmitterLogger.makeAlert() calls.
For task failures, it looks like
private ListenableFuture<TaskStatus> attachCallbacks(final Task task, final ListenableFuture<TaskStatus> statusFuture) in
TaskQueue may generate some useful metrics.
When a task finishes,
task/run/time is emitted which will contain the task status code (SUCCESS or FAILURE) as a dimension. This is a regular metrics event though, and not an alert.