Strange query from historical to broker that crash the cluster

Hi,

We have found strange query from historical node to our broker node which crashed the broker.

Here’s the log because OutOfMemory Java Heap error

2016-02-18T06:17:37,500 INFO [HttpClient-Netty-Worker-2] LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-02-18T06:17:37.500Z","service":"druid/broker","host":"<broker_ip_address>:8082","metric":"query/node/time","value":227190,"dataSource":"firehose","dimension":"url","duration":"PT1126800S","hasFilters":"true","id":"a51fa98f-0b9b-431d-868d-9178eb7d6f03","interval":["2016-02-04T06:00:00.000Z/2016-02-08T03:00:00.000Z","2016-02-08T04:00:00.000Z/2016-02-08T05:00:00.000Z","2016-02-08T06:00:00.000Z/2016-02-08T09:00:00.000Z","2016-02-08T10:00:00.000Z/2016-02-08T18:00:00.000Z","2016-02-08T19:00:00.000Z/2016-02-09T01:00:00.000Z","2016-02-09T02:00:00.000Z/2016-02-09T03:00:00.000Z","2016-02-09T04:00:00.000Z/2016-02-09T06:00:00.000Z","2016-02-09T07:00:00.000Z/2016-02-09T11:00:00.000Z","2016-02-09T12:00:00.000Z/2016-02-09T22:00:00.000Z","2016-02-09T23:00:00.000Z/2016-02-11T22:00:00.000Z","2016-02-12T00:00:00.000Z/2016-02-12T01:00:00.000Z","2016-02-12T02:00:00.000Z/2016-02-12T14:00:00.000Z","2016-02-12T15:00:00.000Z/2016-02-12T22:00:00.000Z","2016-02-12T23:00:00.000Z/2016-02-13T16:00:00.000Z","2016-02-13T17:00:00.000Z/2016-02-14T11:00:00.000Z","2016-02-14T12:00:00.000Z/2016-02-14T16:00:00.000Z","2016-02-14T17:00:00.000Z/2016-02-14T22:00:00.000Z","2016-02-14T23:00:00.000Z/2016-02-15T04:00:00.000Z","2016-02-15T05:00:00.000Z/2016-02-15T16:00:00.000Z","2016-02-15T17:00:00.000Z/2016-02-16T20:00:00.000Z","2016-02-16T21:00:00.000Z/2016-02-18T04:00:00.000Z"],"numComplexMetrics":"0","numMetrics":"1","server":"<historial_ip_address>:8083","threshold":"1000","type":"topN"}]

We got a bunch of those before the machine ran out of memory.

java.lang.OutOfMemoryError: Java heap space
at com.metamx.http.client.NettyHttpClient.go(NettyHttpClient.java:148) ~[http-client-1.0.4.jar:?]
at com.metamx.http.client.AbstractHttpClient.go(AbstractHttpClient.java:14) ~[http-client-1.0.4.jar:?]
at io.druid.client.DirectDruidClient.run(DirectDruidClient.java:322) ~[druid-server-0.8.3.jar:0.8.3]
at io.druid.client.CachingClusteredClient$2.addSequencesFromServer(CachingClusteredClient.java:374) ~[druid-server-0.8.3.jar:0.8.3]

Hey Noppanit,

It’s possible that your broker doesn’t have enough memory to merge the query results. You could try the following:

  • Give the broker more heap.

  • Disable broker caching (set druid.broker.cache.useCache=false, druid.broker.cache.populateCache=false on the broker). This will cause your historical nodes to merge results before sending them to the broker.

  • If you have a large number of small segments, try indexing your data into larger segments.

Our segment is really small ~2MB

And this is our segmentation

          "granularitySpec" : {
            "type" : "uniform",
            "segmentGranularity" : "HOUR",
            "queryGranularity" : "MINUTE"
          }

How do I merge to larger segments? Sorry if I ask stupid question, I'm new to Druid.

Hey Noppanit, you could change segmentGranularity to “DAY” and that would produce larger segments.

May I ask what’s the impact of setting the segment to “DAY”. Would it just hold on to the index for a day and write to file? But I can still query the result by “MIN” or “HOUR”?

Thanks.

Yes, the segment intervals would be day-aligned, but you could still query at min or hour granularity (since you have “queryGranularity” set to “minute”). It could also be worth trying out the larger heap and the disabling of caching first, as those are simpler changes to make.

Hi @gian

After having done the segment to day, I can see that now the segmentation is collected by day. However I’m still seeing this query

[{“feed”:“metrics”,“timestamp”:“2016-03-02T23:15:51.729Z”,“service”:“druid/historical”,“host”:“historical:8083”,“metric”:“query/time”,“value”:84,“context”:"{“bySegment”:true,“finalize”:false,“intermediate”:true,“populateCache”:false,“priority”:0,“queryId”:“9dc6ba6b-3909-4fd8-a68a-5f1d05c87d04”,“timeout”:300000}",“dataSource”:“firehose-web”,“duration”:“PT356400S”,“hasFilters”:“true”,“id”:“9dc6ba6b-3909-4fd8-a68a-5f1d05c87d04”,“interval”:[“2016-02-10T17:00:00.000Z/2016-02-10T18:00:00.000Z”,“2016-02-11T17:00:00.000Z/2016-02-11T18:00:00.000Z”,“2016-02-11T20:00:00.000Z/2016-02-11T22:00:00.000Z”,“2016-02-12T10:00:00.000Z/2016-02-12T11:00:00.000Z”,“2016-02-12T12:00:00.000Z/2016-02-12T13:00:00.000Z”,“2016-02-12T14:00:00.000Z/2016-02-13T03:00:00.000Z”,“2016-02-13T04:00:00.000Z/2016-02-13T05:00:00.000Z”,“2016-02-13T06:00:00.000Z/2016-02-13T09:00:00.000Z”,“2016-02-13T10:00:00.000Z/2016-02-13T22:00:00.000Z”,“2016-02-13T23:00:00.000Z/2016-02-15T13:00:00.000Z”,“2016-02-15T14:00:00.000Z/2016-02-15T20:00:00.000Z”,“2016-02-15T21:00:00.000Z/2016-02-16T02:00:00.000Z”,“2016-02-16T03:00:00.000Z/2016-02-16T04:00:00.000Z”,“2016-02-16T05:00:00.000Z/2016-02-16T06:00:00.000Z”,“2016-02-16T11:00:00.000Z/2016-02-16T16:00:00.000Z”,“2016-02-16T17:00:00.000Z/2016-02-16T19:00:00.000Z”,“2016-02-16T21:00:00.000Z/2016-02-16T22:00:00.000Z”,“2016-02-16T23:00:00.000Z/2016-02-17T00:00:00.000Z”,“2016-02-17T02:00:00.000Z/2016-02-17T03:00:00.000Z”,“2016-02-17T14:00:00.000Z/2016-02-17T15:00:00.000Z”,“2016-02-18T02:00:00.000Z/2016-02-18T03:00:00.000Z”,“2016-02-18T12:00:00.000Z/2016-02-18T13:00:00.000Z”],“remoteAddress”:“broker”,“type”:“topN”}]

These logs are metrics about query performance