Strange performance problem

This is my first time of Druid performance testing on our data and I see some strange numbers I cannot explain.

Here is part of the log from my broker:

2016-06-21T06:49:28,469 INFO [HttpClient-Netty-Worker-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:28.469Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/ttfb”,“value”:10714,“dataSource”:“ProblemsView”,“dimension”:“serviceId”,“duration”:“PT86400S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-02T00:00:00.000Z”,“2016-05-31T00:00:00.000Z/2016-05-31T06:26:10.754Z”],“numComplexMetrics”:“0”,“numMetrics”:“4”,“server”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.0”}]

2016-06-21T06:49:28,469 INFO [HttpClient-Netty-Worker-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:28.469Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/time”,“value”:10714,“dataSource”:“ProblemsView”,“dimension”:“serviceId”,“duration”:“PT86400S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-02T00:00:00.000Z”,“2016-05-31T00:00:00.000Z/2016-05-31T06:26:10.754Z”],“numComplexMetrics”:“0”,“numMetrics”:“4”,“server”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.0”}]

2016-06-21T06:49:28,469 INFO [HttpClient-Netty-Worker-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:28.469Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/bytes”,“value”:2488,“dataSource”:“ProblemsView”,“dimension”:“serviceId”,“duration”:“PT86400S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-02T00:00:00.000Z”,“2016-05-31T00:00:00.000Z/2016-05-31T06:26:10.754Z”],“numComplexMetrics”:“0”,“numMetrics”:“4”,“server”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.0”}]

2016-06-21T06:49:28,476 INFO [qtp1956341516-34] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:28.476Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/cpu/time”,“value”:3,“dataSource”:“ProblemsView”,“dimension”:“serviceId”,“duration”:“PT2592000S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-31T06:26:10.754Z”],“numComplexMetrics”:“0”,“numMetrics”:“4”,“threshold”:“10”,“type”:“topN”,“version”:“0.9.0”}]

2016-06-21T06:49:28,477 INFO [qtp1956341516-34] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:28.477Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/time”,“value”:10722,“context”:"{“queryId”:“277c08ad-d411-45ac-8245-300322fef308”,“timeout”:300000}",“dataSource”:“ProblemsView”,“duration”:“PT2592000S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-31T06:26:10.754Z”],“remoteAddress”:“88.196.254.174”,“success”:“true”,“type”:“topN”,“version”:“0.9.0”}]

2016-06-21T06:49:28,477 INFO [qtp1956341516-34] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:28.477Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/bytes”,“value”:1573,“context”:"{“queryId”:“277c08ad-d411-45ac-8245-300322fef308”,“timeout”:300000}",“dataSource”:“ProblemsView”,“duration”:“PT2592000S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-31T06:26:10.754Z”],“remoteAddress”:“88.196.254.174”,“type”:“topN”,“version”:“0.9.0”}]

The relevant information, I believe, is that Time To First Byte is “10714”. So I assume the downstream historical node took more than 10 seconds to answer the query. But my historical node says:

2016-06-21T06:49:39,842 INFO [qtp701476543-16] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:39.842Z”,“service”:“druid/historical”,“host”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“metric”:“query/time”,“value”:1272,“context”:"{“bySegment”:true,“finalize”:false,“intermediate”:true,“populateCache”:false,“priority”:0,“queryId”:“277c08ad-d411-45ac-8245-300322fef308”,“timeout”:300000}",“dataSource”:“ProblemsView”,“duration”:“PT86400S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-02T00:00:00.000Z”,“2016-05-31T00:00:00.000Z/2016-05-31T06:26:10.754Z”],“remoteAddress”:“10.0.0.205”,“success”:“true”,“type”:“topN”,“version”:“0.9.0”}]

2016-06-21T06:49:39,842 INFO [qtp701476543-16] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T06:49:39.842Z”,“service”:“druid/historical”,“host”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“metric”:“query/bytes”,“value”:2488,“context”:"{“bySegment”:true,“finalize”:false,“intermediate”:true,“populateCache”:false,“priority”:0,“queryId”:“277c08ad-d411-45ac-8245-300322fef308”,“timeout”:300000}",“dataSource”:“ProblemsView”,“duration”:“PT86400S”,“hasFilters”:“true”,“id”:“277c08ad-d411-45ac-8245-300322fef308”,“interval”:[“2016-05-01T06:26:10.754Z/2016-05-02T00:00:00.000Z”,“2016-05-31T00:00:00.000Z/2016-05-31T06:26:10.754Z”],“remoteAddress”:“10.0.0.205”,“type”:“topN”,“version”:“0.9.0”}]

So here we see that queries finished much faster. How can I troubleshoot where the remaining time was spent? At least 6 seconds are lost somewhere…

Thank you in advance,

Nikita

P.S. Btw, does anybody offer a paid performance consulting for Druid?

And one more, even stranger case:

2016-06-21T08:08:50,674 INFO [HttpClient-Netty-Worker-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:08:50.674Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/ttfb”,“value”:1,“dataSource”:“ProblemsView”,“duration”:“PT29147.278S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-31T00:00:00.000Z/2016-05-31T08:05:47.278Z”],“numComplexMetrics”:“0”,“numDimensions”:“1”,“numMetrics”:“4”,“server”:“ip-10-0-0-94.eu-west-1.compute.internal:8083”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:08:50,675 INFO [HttpClient-Netty-Worker-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:08:50.675Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/time”,“value”:2,“dataSource”:“ProblemsView”,“duration”:“PT29147.278S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-31T00:00:00.000Z/2016-05-31T08:05:47.278Z”],“numComplexMetrics”:“0”,“numDimensions”:“1”,“numMetrics”:“4”,“server”:“ip-10-0-0-94.eu-west-1.compute.internal:8083”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:08:50,675 INFO [HttpClient-Netty-Worker-0] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:08:50.675Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/bytes”,“value”:218,“dataSource”:“ProblemsView”,“duration”:“PT29147.278S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-31T00:00:00.000Z/2016-05-31T08:05:47.278Z”],“numComplexMetrics”:“0”,“numDimensions”:“1”,“numMetrics”:“4”,“server”:“ip-10-0-0-94.eu-west-1.compute.internal:8083”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:08:50,737 INFO [HttpClient-Netty-Worker-3] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:08:50.737Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/ttfb”,“value”:64,“dataSource”:“ProblemsView”,“duration”:“PT57252.722S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-01T08:05:47.278Z/2016-05-02T00:00:00.000Z”],“numComplexMetrics”:“0”,“numDimensions”:“1”,“numMetrics”:“4”,“server”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:08:50,737 INFO [HttpClient-Netty-Worker-3] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:08:50.737Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/time”,“value”:64,“dataSource”:“ProblemsView”,“duration”:“PT57252.722S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-01T08:05:47.278Z/2016-05-02T00:00:00.000Z”],“numComplexMetrics”:“0”,“numDimensions”:“1”,“numMetrics”:“4”,“server”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:08:50,737 INFO [HttpClient-Netty-Worker-3] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:08:50.737Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/node/bytes”,“value”:5204,“dataSource”:“ProblemsView”,“duration”:“PT57252.722S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-01T08:05:47.278Z/2016-05-02T00:00:00.000Z”],“numComplexMetrics”:“0”,“numDimensions”:“1”,“numMetrics”:“4”,“server”:“ip-10-0-0-93.eu-west-1.compute.internal:8083”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:10:03,467 INFO [qtp735834306-43] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:10:03.467Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/cpu/time”,“value”:15,“dataSource”:“ProblemsView”,“duration”:“PT2592000S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-01T08:05:47.278Z/2016-05-31T08:05:47.278Z”],“numComplexMetrics”:“0”,“numDimensions”:“1”,“numMetrics”:“4”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:10:03,468 INFO [qtp735834306-43] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:10:03.468Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/time”,“value”:72796,“context”:"{“queryId”:“4930c316-112f-4837-a979-1ea523be535d”,“timeout”:300000}",“dataSource”:“ProblemsView”,“duration”:“PT2592000S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-01T08:05:47.278Z/2016-05-31T08:05:47.278Z”],“remoteAddress”:“88.196.254.174”,“success”:“true”,“type”:“groupBy”,“version”:“0.9.0”}]

2016-06-21T08:10:03,468 INFO [qtp735834306-43] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-06-21T08:10:03.468Z”,“service”:“druid/broker”,“host”:“ip-10-0-0-205.eu-west-1.compute.internal:8082”,“metric”:“query/bytes”,“value”:2302,“context”:"{“queryId”:“4930c316-112f-4837-a979-1ea523be535d”,“timeout”:300000}",“dataSource”:“ProblemsView”,“duration”:“PT2592000S”,“hasFilters”:“true”,“id”:“4930c316-112f-4837-a979-1ea523be535d”,“interval”:[“2016-05-01T08:05:47.278Z/2016-05-31T08:05:47.278Z”],“remoteAddress”:“88.196.254.174”,“type”:“groupBy”,“version”:“0.9.0”}]

If you look at it, there is strange gap between 08:08:50 and 08:10:03, when result from downstream node was already received, but nothing happened for 2 minutes. And query cpu time is mere 15 microseconds.

Here groupBy queries are cached as well, if this is important.

Hi Nikita, those are numbers are automatically generated query IDs. The first step to understand performance is always around where the bottleneck is, either at the segment scanning level, the historical node itself, or at merge time.

We provide paid performance consulting at Imply (http://imply.io/). Feel free to email me to discuss.

I do understand what these ids mean. The highlight is just from my terminal when I grep this query.

My question was about the big difference in broker reported times and historical node reported times.

Nikita

Ok, this case seems to be the GC kicking in. Question closed for now, I can handle GC myself :slight_smile: