Timeout debug

Hi Everyone,

I’m trying to create a new druid cluster but before that I wanted to make sure that I know all the variables that are involved on the queries that we will run. Doing some querying with pivot on a test database (that already has more than 150M events) sometimes we get timeout errors, mostly on topN queries, like this one:

Sending query 1225:

{
“queryType”: “topN”,
“dataSource”: “eventData”,
“intervals”: “2016-08-03T13Z/2016-08-10T13Z”,
“granularity”: “all”,
“filter”: {
“type”: “selector”,
“dimension”: “event_type”,
“value”: “event1”
},
“dimension”: {
“type”: “default”,
“dimension”: “url”,
“outputName”: “url”
},
“aggregations”: [
{
“name”: “count”,
“type”: “doubleSum”,
“fieldName”: “count”
},
{
“name”: “!Theta_user_sketch”,
“type”: “thetaSketch”,
“fieldName”: “user_sketch”
},
{
“name”: “user_unique”,
“type”: “hyperUnique”,
“fieldName”: “user_unique”
}
],
“postAggregations”: [
{
“type”: “thetaSketchEstimate”,
“name”: “user_sketch”,
“field”: {
“type”: “fieldAccess”,
“fieldName”: “!Theta_user_sketch”
}
}
],
“metric”: “count”,
“threshold”: 50
}

``

Got error in query 1225: timeout (in 40001ms)

^^^^^^^^^^^^^^^^^^^^^^^^^^

error: timeout

Error: timeout

at /usr/local/lib/node_modules/imply-pivot/node_modules/plywood-druid-requester/build/druidRequester.js:190:23

at _rejected (/usr/local/lib/node_modules/imply-pivot/node_modules/q/q.js:844:24)

at /usr/local/lib/node_modules/imply-pivot/node_modules/q/q.js:870:30

at Promise.when (/usr/local/lib/node_modules/imply-pivot/node_modules/q/q.js:1122:31)

at Promise.promise.promiseDispatch (/usr/local/lib/node_modules/imply-pivot/node_modules/q/q.js:788:41)

at /usr/local/lib/node_modules/imply-pivot/node_modules/q/q.js:604:44

at runSingle (/usr/local/lib/node_modules/imply-pivot/node_modules/q/q.js:137:13)

at flush (/usr/local/lib/node_modules/imply-pivot/node_modules/q/q.js:125:13)

at process._tickCallback (node.js:415:13)

Historical logs

2016-08-11T17:22:08,901 INFO [topN_eventData_[2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-08-11T17:22:08.901Z”,“service”:“druid/historical”,“host”:“some.host.com:8083”,“metric”:“query/segmentAndCache/time”,“value”:15529,“dataSource”:“eventData”,“dimension”:“url”,“duration”:“PT3600S”,“hasFilters”:“true”,“id”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“interval”:[“2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z”],“numComplexMetrics”:“2”,“numMetrics”:“3”,“segment”:“eventData_2016-08-07T15:00:00.000Z_2016-08-07T16:00:00.000Z_2016-08-08T15:53:02.665Z”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]

2016-08-11T17:22:08,901 INFO [topN_eventData_[2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-08-11T17:22:08.901Z”,“service”:“druid/historical”,“host”:“some.host.com:8083”,“metric”:“query/wait/time”,“value”:24364,“dataSource”:“eventData”,“dimension”:“url”,“duration”:“PT3600S”,“hasFilters”:“true”,“id”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“interval”:[“2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z”],“numComplexMetrics”:“2”,“numMetrics”:“3”,“segment”:“eventData_2016-08-07T15:00:00.000Z_2016-08-07T16:00:00.000Z_2016-08-08T15:53:02.665Z”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]

2016-08-11T17:22:09,010 INFO [qtp806588571-19[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] io.druid.query.ChainedExecutionQueryRunner - Query timeout, cancelling pending results for query id [aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]

2016-08-11T17:22:09,010 INFO [qtp806588571-19[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] io.druid.server.QueryResource - Query timeout [aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]

2016-08-11T17:22:09,010 INFO [topN_eventData_[2016-08-06T04:00:00.000Z/2016-08-06T05:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-08-11T17:22:09.010Z”,“service”:“druid/historical”,“host”:“some.host.com:8083”,“metric”:“query/segment/time”,“value”:5687,“dataSource”:“eventData”,“dimension”:“url”,“duration”:“PT3600S”,“hasFilters”:“true”,“id”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“interval”:[“2016-08-06T04:00:00.000Z/2016-08-06T05:00:00.000Z”],“numComplexMetrics”:“2”,“numMetrics”:“3”,“segment”:“eventData_2016-08-06T04:00:00.000Z_2016-08-06T05:00:00.000Z_2016-08-06T16:51:25.061Z”,“status”:“failed”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]

2016-08-11T17:22:09,010 INFO [topN_eventData_[2016-08-06T14:00:00.000Z/2016-08-06T15:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-08-11T17:22:09.010Z”,“service”:“druid/historical”,“host”:“some.host.com:8083”,“metric”:“query/segment/time”,“value”:9232,“dataSource”:“eventData”,“dimension”:“url”,“duration”:“PT3600S”,“hasFilters”:“true”,“id”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“interval”:[“2016-08-06T14:00:00.000Z/2016-08-06T15:00:00.000Z”],“numComplexMetrics”:“2”,“numMetrics”:“3”,“segment”:“eventData_2016-08-06T14:00:00.000Z_2016-08-06T15:00:00.000Z_2016-08-06T16:51:25.061Z”,“status”:“failed”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]

2016-08-11T17:22:09,010 INFO [topN_eventData_[2016-08-07T16:00:00.000Z/2016-08-07T17:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-08-11T17:22:09.010Z”,“service”:“druid/historical”,“host”:“some.host.com:8083”,“metric”:“query/segment/time”,“value”:15772,“dataSource”:“eventData”,“dimension”:“url”,“duration”:“PT3600S”,“hasFilters”:“true”,“id”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“interval”:[“2016-08-07T16:00:00.000Z/2016-08-07T17:00:00.000Z”],“numComplexMetrics”:“2”,“numMetrics”:“3”,“segment”:“eventData_2016-08-07T16:00:00.000Z_2016-08-07T17:00:00.000Z_2016-08-08T15:53:02.665Z”,“status”:“failed”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]

Broker logs

2016-08-11T17:22:09,011 INFO [HttpClient-Netty-Worker-21] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-08-11T17:22:09.011Z”,“service”:“druid/broker”,“host”:“some.host.com:8082”,“metric”:“query/node/bytes”,“value”:26,“dataSource”:“eventData”,“dimension”:“url”,“duration”:“PT388800S”,“hasFilters”:“true”,“id”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“interval”:[“2016-08-06T00:00:00.000Z/2016-08-06T01:00:00.000Z”,“2016-08-06T02:00:00.000Z/2016-08-10T13:00:00.000Z”],“numComplexMetrics”:“2”,“numMetrics”:“3”,“server”:“some.host.com:8083”,“threshold”:“1000”,“type”:“topN”,“version”:“0.9.1.1”}]

2016-08-11T17:22:09,011 INFO [qtp783141366-52[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] io.druid.server.QueryResource - Query timeout [aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]

2016-08-11T17:22:09,011 INFO [qtp783141366-52[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-08-11T17:22:09.011Z”,“service”:“druid/broker”,“host”:“some.host.com:8082”,“metric”:“query/time”,“value”:40009,“context”:"{“queryId”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“timeout”:40000}",“dataSource”:“eventData”,“duration”:“PT604800S”,“hasFilters”:“true”,“id”:“aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b”,“interval”:[“2016-08-03T13:00:00.000Z/2016-08-10T13:00:00.000Z”],“remoteAddress”:“127.0.0.1”,“success”:“false”,“type”:“topN”,“version”:“0.9.1.1”}]

Please notice that this is using quickstart test configurations with the following changes:

  • on historical/run.properties

HTTP server threads

druid.server.http.numThreads=11

Processing threads and buffers

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=11

  • on broker/run.properties

HTTP server threads

druid.server.http.numThreads=11

Processing threads and buffers

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=11

Segments size between 50MB and 400MB each on hourly granularity (only one partition per segment, target of 5GB partition at index time). Using local storage as deep storage.

Druid version: 0.9.1.1. Java 1.8.

Server specs: 1 server, 12 cores, 32GB RAM, 2x 256GB Samsung SSD (100k IOPS).

I’m having trouble finding which is the bottle-neck here. Using an administrative tool on the server it seems that all the cores in parallel are being used for every query. However I can’t find more details about what is druid doing when the query fails or details about the failure (ie: why it takes so long to run the query).

Any help will be appreciated. Thanks a lot !