Since historical node hitting 100% CPU causes entire cluster to be TOTALLY inoperable

Druid gurus,

We have a druid setup with 8 historical nodes,

1 coordinator,

1 broker,

4 middle managers,

memcache for queries and segements,

mysql

s3

zookeeper

Druid version 10.

For some reasons a single node hits 100% CPU on all cores. (32 core machine so 3200% CPU). Since many of our cubes are replicated 2x I would expect the cluster to continue on. But from the stand of pivot the cluster is dead.

The broker just keeps sending timeouts.

2019-01-01T13:00:59.322Z 10.0.4.210 {“queryType”:“groupBy”,“dataSource”:

redacted
,“descending”:false} {“query/time”:308

796,“query/bytes”:-1,“success”:false,“exception”:"io.druid.java.util.common.RE: Failure getting results for query[ff7268e6-627e-4b40-8b20-b01324616134] url[http://historical5.druid.prod

.ourgreatco.net:8080/druid/v2/] because of [org.jboss.netty.handler.timeout.ReadTimeoutException]"}

Is there any way to configure druid broker so it does not not CONSTANTLY send requests to makes that are timing out?

Im thinking of writing a cron job to kill off nodes that hit 100% but come on cant this thing heartbeat itself?

By default the broker sends queries randomly down to data servers like historicals, on the grounds that this a reasonable way of balancing load. Do you have one historical that is at 100% CPU and the others are idle (or much lower)? Could it be that your data isn’t well balanced and you have a hot spot on that server as a result?

I do not think it is about data load or locality. Random historical nodes hit 100% at random times. The system will never come out if it. Annectodally I see systems hit 100% CPU on all cores like this when there is software bug caught in an infinite loop scenario. How can I troubleshoot?

Try doing a ‘jstack -l [pid]’ on the historical process: it will show the current call stack for each Java thread.

Hi Eddie,
as Giant pointed out the problem may be Hot spot on that instance. The CPU 100% load is OK in my opinion so I assume you may experience:

  • You have plenty of segments to be loaded. How many segments do you have per day?

  • The query period time is big so that it takes long time to serve the data. What is the period you’re trying to query?

  • The segments are not properly sized so that it takes longer to load them and serve. What is the size of the segments on histo nodes?

  • There may be a chance that the Disk I/O is slow. What is the disk type and capacity?

I suggest also to use: http://druid.io/docs/latest/operations/metrics.html where plenty of useful information is provided in order to debug the issue.

Artiom