Alert or Metric for Failed Tasks

Hi there!

Are there any metrics or alerts for failed tasks? I want to create Datadog monitoring over Druid tasks to see when they have failed or succeeded but I don’t currently see anything in the docs. Can anyone point me to the right resources? :slight_smile:

I don’t think there is a currently a metric/alert that’s generated upon task failure (I think there are alerts for some kinds of task failure modes but not a general alert, if you search for EmittingLogger.makeAlert() calls in the codebase).

One option could be to periodically fetch the list of completed tasks using http://overlord:overlordPort/druid/indexer/v1/completeTasks and check for task failures that way. In 0.12.0 and later there is an “n” parameter that allows you to fetch the most recent N task statuses.

Thanks Jonathan for the response!

Looks like we’ll have to just go with the second option!

Actually one more quick question, how do I grab the N most recent task statuses through that endpoint? I couldn’t find what the shape of this endpoint to get pass in n in the API docs :frowning:

I believe this is an internal API (undocumented), so I would trust Jon if he suggests using it, but perhaps you could file a GitHub issue to request documentation, so any future change would be announced properly.

So the URL would currently be: http://overlord:overlordPort/druid/indexer/v1/completeTasks?n=maxTaskStatuses

Where maxTaskStatuses is a number.

Thanks Eyal, it should look like that URL.

The “completeTasks” (and related APIs like “runningTasks”) should be documented IMO, I don’t believe it’s meant to be obscured but it’s missing from a documentation oversight.

Also worth mentioning, in 0.13.0-incubating, task information is also exposed via Druid SQL: http://druid.io/docs/0.13.0-incubating/querying/sql.html#system-schema