Problem: dead historical node still used

I have a dead historical node that the brokers keep trying to use. Queries fail with:

“errorMessage”: “Failure getting results from[http://<dead_server_hostname>:8083/druid/v2/]
because of [org.jboss.netty.channel.ChannelException: Faulty channel in resource pool]” .

``

and when debugging queries with the “druid/v2/candidates” POST commands, the dead historical node shows up in the lists.

The broker logs have this error:

“Error parsing response context from url [http://<dead_server_hostname>:8083/druid/v2/]
com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value
at [Source: {“missingSegments”:[{“itvl”:”…",“ver”:"…",“part”:2},{“itvl”:"…",“ver”:"…",“part”:2} …
line: 1, column: 14337]"

``

Could it be that the (string representation of) the list of intervals assigned to the dead historical node exceeds some internal communication limit? If so, how to fix?

Thanks,

  • Tomas

How dead is the dead historical node? If it was dead enough to stop updating its inventory in zookeeper, but still alive enough to heartbeat, you might get this sort of behavior. The broker logs indicate that something that looks like a historical node is responding on <dead_server_hostname>:8083.

Thanks for the reply — you are right, the server is not quite dead. Looks like I have two separate (but perhaps related?) problems, one is the error message above; the other, a different, truly dead server that is the one showing up on the candidates list (names were similar, hence the mix-up).

So the question remains, how to get around the error:

``

That missingSegments stuff is how Druid historical nodes tell the broker which segments they were queried for, but are no longer serving. It’s delivered in an HTTP response header so there is a max size imposed by the HTTP client/server, and after that it gets truncated. It’s not normal for it to be so long that this happens, unless for some reason, the broker’s view of what segments the historical is serving has got badly out of sync. Your two problems might have a common cause – they could both be explained if your broker is having trouble syncing up with zookeeper.

Thanks again… there are no obvious zookeeper problems.

How do I convince Druid that the dead historical node is really dead — what state or services should be reset?

Still see the dead historical under /druid/loadQueue/ , /druid/servedSegments/ and
/druid/segments/ in zookeeper (though the last two are empty)

  • Tomas

If the last two are empty, then that means zookeeper is aware that the historical is gone (the empty containers aren’t cleaned up automatically, but there are no contents which means the node is treated as empty). So odds are that the broker is not able to sync from zookeeper. This is not really common and might indicate some kind of broker -> zk communication problem. You could try barking up that tree or try restarting the broker.

I was about to post an update: restarting the brokers seems to have done the trick. Thanks!