Kafka Ingestion tasks fail when starting, but eventually succeed

Hello,
We have a rather large 0.14.0 cluster with 2 masters in HA mode, 2 query nodes, 15 datanodes and 110 Supervisors. We’ve noticed that once we get to about ~90 tasks some begin to fail. It looks as if they attempt to start, the overlord cannot contact them, then they fail and another one is started until there is a task associated with the supervisor. This isn’t affecting runtime as far as I can tell, but it is disconcerting.
Per the documentation i raised the heapsize of both the coordinator and the overlord, but the problem persists. It doesn’t look like any of the nodes are resource starved, is there a setting I’m missing in order to scale to this size?
Any feedback would be appreciated.

Thanks,

Mike

Overlord logs:

2019-08-14T22:56:22,777 WARN [IndexTaskClient-cds00015_commerce-0] org.apache.druid.indexing.common.IndexTaskClient - Retries exhausted for [http://datanode:8103/druid/worker/v1/chat/index_kafka_cds00015_commerce_e9daf7bb57e2ecb_cielakdp/status], last exception:

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_222]

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_222]

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_222]

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_222]

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_222]

at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_222]

at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_222]

at java.net.Socket.(Socket.java:434) ~[?:1.8.0_222]

at java.net.Socket.(Socket.java:211) ~[?:1.8.0_222]

at org.apache.druid.indexing.common.IndexTaskClient.checkConnection(IndexTaskClient.java:209) ~[druid-indexing-service-0.14.0-incubating-iap3.jar:0.14.0-incubating-iap3]

at org.apache.druid.indexing.common.IndexTaskClient.submitRequest(IndexTaskClient.java:348) ~[druid-indexing-service-0.14.0-incubating-iap3.jar:0.14.0-incubating-iap3]

at org.apache.druid.indexing.common.IndexTaskClient.submitRequestWithEmptyContent(IndexTaskClient.java:220) ~[druid-indexing-service-0.14.0-incubating-iap3.jar:0.14.0-incubating-iap3]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient.getStatus(SeekableStreamIndexTaskClient.java:172) ~[druid-indexing-service-0.14.0-incubating-iap3.jar:0.14.0-incubating-iap3]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient.lambda$getStatusAsync$9(SeekableStreamIndexTaskClient.java:373) ~[druid-indexing-service-0.14.0-incubating-iap3.jar:0.14.0-incubating-iap3]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

``

Middlemanager logs:

2019-08-14T22:56:22,873 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task[index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg] started.

2019-08-14T22:56:22,875 INFO [forking-task-runner-4] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg] location changed to [TaskLocation{host=‘datanode’, port=8105, tlsPort=-1}].

2019-08-14T22:56:22,875 INFO [forking-task-runner-4] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg] status changed to [RUNNING].

2019-08-14T22:56:22,875 INFO [forking-task-runner-4] org.apache.druid.indexing.overlord.ForkingTaskRunner - Logging task index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg output to: var/druid/task/index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg/log

2019-08-14T22:57:37,476 INFO [qtp2081703861-95] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg] because: [shut down request via HTTP endpoint]

2019-08-14T22:57:37,476 INFO [qtp2081703861-95] org.apache.druid.indexing.overlord.ForkingTaskRunner - Killing process for task: index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg

2019-08-14T22:57:37,477 INFO [forking-task-runner-4] org.apache.druid.storage.s3.S3TaskLogs - Pushing task log var/druid/task/index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg/log to: druid/indexing-logs/index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg/log

2019-08-14T22:57:37,595 INFO [qtp2081703861-92] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg] because: [shut down request via HTTP endpoint]

2019-08-14T22:57:37,595 INFO [qtp2081703861-92] org.apache.druid.indexing.overlord.ForkingTaskRunner - Killing process for task: index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg

2019-08-14T22:57:37,613 INFO [forking-task-runner-4] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: var/druid/task/index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg

2019-08-14T22:57:37,616 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Job’s finished. Completed [
index_kafka_cds00015_commerce_a931b9abd995cb2_koolngbg] with status [FAILED]

``

overlord runtime properties:

druid.service=druid/overlord

druid.plaintextPort=8090

druid.indexer.queue.startDelay=PT300S

druid.indexer.runner.type=remote

druid.indexer.storage.type=metadata

druid.indexer.runner.pendingTasksRunnerNumThreads=6

``

overlord jvm.config

-server

-Xms12g

-Xmx12g

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-Daws.region=eu-central-1

``

Coordinator runtime properties:

druid.service=druid/coordinator

druid.plaintextPort=8081

druid.coordinator.startDelay=PT30S

druid.coordinator.period=PT30S

Deleting unused segments

druid.coordinator.kill.on=true

druid.coordinator.kill.period=PT1H

druid.coordinator.kill.durationToRetain=PT72H

druid.coordinator.kill.maxSegments=1000

``

Coordinator jvm.config:

-server

-Xms12g

-Xmx12g

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-Dderby.stream.error.file=var/druid/derby.log

-Daws.region=eu-central-1

``

middlemanager runtime properties:

druid.service=druid/middleManager

druid.plaintextPort=8091

Number of tasks per middleManager

druid.worker.capacity=9

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xmx3g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Daws.region=eu-central-1

druid.indexer.task.baseTaskDir=var/druid/task

HTTP server threads

druid.server.http.numThreads=25

Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

druid.indexer.fork.property.druid.processing.numThreads=2

Processing threads and buffers

druid.processing.buffer.sizeBytes=536870912

druid.processing.numThreads=2

druid.indexer.task.hadoopWorkingPath=var/druid/hadoop-tmp

Hadoop indexing

druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.8.3”, “org.apache.hadoop:hadoop-aws:2.8.3”]

druid.indexer.task.restoreTasksOnRestart=true

druid.indexer.runner.javaOptsArray=["-XX:MaxDirectMemorySize=3g"]

``

middlemanager jvm.config:

-server

-Xms128m

-Xmx128m

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-Daws.region=eu-central-1

``

Hi Michael:

Any error message from the actual ingestion task log, that says why the task fails? Can you show the last few lines of the failed task’s logs? What about the successful task’s logs, any difference there? Please also include the complete ingestion spec.

Thanks

thanks for getting back to me.

There’s no errors at the end of the failed task’s logs. i do see these errors related to lookups in both the succeeding task and the failing task. note it’s actually a different datasource than the one i posted about earlier

2019-08-15T12:02:34,500 ERROR [LookupExtractorFactoryContainerProvider-Startup-0] org.apache.druid.server.lookup.namespace.cache.CacheScheduler - CacheScheduler[namespace [JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘htp740801306_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}] : org.apache.druid.server.lookup.namespace.cache.CacheScheduler$EntryImpl@18829f99] - problem during start or waiting for the first run

2019-08-15T12:02:34,500 ERROR [LookupExtractorFactoryContainerProvider-Startup-0] org.apache.druid.query.lookup.NamespaceLookupExtractorFactory - Failed to schedule and wait for lookup [namespace-factory-JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘htp740801306_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}-ee44eb2d-4575-4b03-8395-9625766134bb]

2019-08-15T12:02:34,500 ERROR [LookupExtractorFactoryContainerProvider-Startup-0] org.apache.druid.query.lookup.LookupReferencesManager - Failed to start lookup [htp740801306_profile]:[LookupExtractorFactoryContainer{version=‘v0’, lookupExtractorFactory=NamespaceLookupExtractorFactory{extractionNamespace=JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘htp740801306_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}, firstCacheTimeout=5000, injective=false, extractorID=‘namespace-factory-JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘htp740801306_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}-ee44eb2d-4575-4b03-8395-9625766134bb’}}]

``

Here is the supervisor spec:

{

“type”: “kafka”,

“dataSchema”: {

“dataSource”: “cds00015_commerce”,

“parser”: {

 "type": "string",

 "parseSpec": {

   "format": "json",

   "flattenSpec": {

     "useFieldDiscovery": false,

     "fields": [

       {

         "name": "channel_ref",

         "expr": "$.channelRef",

         "type": "path"

       },

       {

         "name": "currency",

         "expr": "$.body.currency",

         "type": "path"

       },

       "date",

       {

         "name": "event_type",

         "expr": "$.type",

         "type": "path"

       },

       "guid",

       {

         "name": "item_currency",

         "expr": "$.body.lineItems[0].currency",

         "type": "path"

       },

       "item_index",

       {

         "name": "item_pos",

         "expr": "$.body.lineItems[0].pos",

         "type": "path"

       },

       {

         "name": "item_price_effective",

         "expr": "$.body.lineItems[0].price_effective",

         "type": "path"

       },

       {

         "name": "item_price_list",

         "expr": "$.body.lineItems[0].price_list",

         "type": "path"

       },

       {

         "name": "item_quantity",

         "expr": "$.body.lineItems[0].quantity",

         "type": "path"

       },

       {

         "name": "item_ref",

         "expr": "$.body.lineItems[0].ref",

         "type": "path"

       },

       {

         "name": "item_status",

         "expr": "$.body.lineItems[0].status",

         "type": "path"

       },

       {

         "name": "item_tax_amount",

         "expr": "$.body.lineItems[0].taxAmount",

         "type": "path"

       },

       {

         "name": "item_type",

         "expr": "$.body.lineItems[0].type",

         "type": "path"

       },

       {

         "name": "item_unit",

         "expr": "$.body.lineItems[0].unit",

         "type": "path"

       },

       {

         "name": "order_id",

         "expr": "$.body.orderId",

         "type": "path"

       },

       {

         "name": "order_value",

         "expr": "$.body.orderValue",

         "type": "path"

       },

       {

         "name": "payment_info_payment_type",

         "expr": "$.body.paymentInfo.paymentType",

         "type": "path"

       },

       {

         "name": "payment_info_status",

         "expr": "$.body.paymentInfo.status",

         "type": "path"

       },

       "profile_id",

       "profile_trace_id",

       {

         "name": "promotion_info_ref",

         "expr": "$.body.promotionInfo.ref",

         "type": "path"

       },

       {

         "name": "promotion_info_type",

         "expr": "$.body.promotionInfo.type",

         "type": "path"

       },

       {

         "name": "shipment_info_carrier",

         "expr": "$.body.shipmentInfo.carrier",

         "type": "path"

       },

       {

         "name": "shipment_info_status",

         "expr": "$.body.shipmentInfo.status",

         "type": "path"

       },

       {

         "name": "status",

         "expr": "$.body.status",

         "type": "path"

       }

     ]

   },

   "dimensionsSpec": {

     "dimensions": [

       "channel_ref",

       "currency",

       "date",

       "event_type",

       "guid",

       "item_currency",

       "item_index",

       {

         "name": "item_pos",

         "type": "long"

       },

       {

         "name": "item_price_effective",

         "type": "float"

       },

       {

         "name": "item_price_list",

         "type": "float"

       },

       {

         "name": "item_quantity",

         "type": "long"

       },

       "item_ref",

       "item_status",

       {

         "name": "item_tax_amount",

         "type": "float"

       },

       "item_type",

       "item_unit",

       "order_id",

       {

         "name": "order_value",

         "type": "float"

       },

       "payment_info_payment_type",

       "payment_info_status",

       "profile_id",

       "profile_trace_id",

       "promotion_info_ref",

       "promotion_info_type",

       "shipment_info_carrier",

       "shipment_info_status",

       "status"

     ],

     "dimensionsExclusions": []

   },

   "timestampSpec": {

     "format": "auto",

     "column": "date"

   }

 }

},

“metricsSpec”: [

 {

   "type": "count",

   "name": "count"

 }

],

“granularitySpec”: {

 "type": "uniform",

 "segmentGranularity": "HOUR",

 "queryGranularity": {

   "type": "none"

 },

 "rollup": true,

 "intervals": null

},

“transformSpec”: {

 "filter": null,

 "transforms": []

}

},

“tuningConfig”: {

“type”: “kafka”,

“maxRowsInMemory”: 1000000,

“maxBytesInMemory”: 0,

“maxRowsPerSegment”: 5000000,

“maxTotalRows”: null,

“intermediatePersistPeriod”: “PT10M”,

“basePersistDirectory”: “/mnt/imply/var/tmp/1563288588787-0”,

“maxPendingPersists”: 0,

“indexSpec”: {

 "bitmap": {

   "type": "concise"

 },

 "dimensionCompression": "lz4",

 "metricCompression": "lz4",

 "longEncoding": "longs"

},

“buildV9Directly”: true,

“reportParseExceptions”: false,

“handoffConditionTimeout”: 0,

“resetOffsetAutomatically”: true,

“segmentWriteOutMediumFactory”: null,

“workerThreads”: null,

“chatThreads”: null,

“chatRetries”: 8,

“httpTimeout”: “PT10S”,

“shutdownTimeout”: “PT80S”,

“offsetFetchPeriod”: “PT30S”,

“intermediateHandoffPeriod”: “P2147483647D”,

“logParseExceptions”: false,

“maxParseExceptions”: 2147483647,

“maxSavedParseExceptions”: 0,

“skipSequenceNumberAvailabilityCheck”: false

},

“ioConfig”: {

“topic”: “cds00015_commerce”,

“replicas”: 1,

“taskCount”: 1,

“taskDuration”: “PT3600S”,

“consumerProperties”: {

 "bootstrap.servers": "kafka1:9092,kafka2:9092,kafka3:9092"

},

“pollTimeout”: 100,

“startDelay”: “PT5S”,

“period”: “PT30S”,

“useEarliestOffset”: true,

“completionTimeout”: “PT1800S”,

“lateMessageRejectionPeriod”: “PT129600S”,

“earlyMessageRejectionPeriod”: null,

“stream”: “cds00015_commerce”,

“useEarliestSequenceNumber”: true

},

“context”: null,

“suspended”: false

}

``

here are the last few messages of the failing task:

2019-08-15T12:03:33,626 INFO [LookupExtractorFactoryContainerProvider-Startup-0] org.apache.druid.query.lookup.LookupReferencesManager - Started lookup [htv730395685_profile]:[LookupExtractorFactoryContainer{version=‘null’, lookupExtractorFactory=NamespaceLookupExtractorFactory{extractionNamespace=JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘htv730395685_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}, firstCacheTimeout=5000, injective=false, extractorID=‘namespace-factory-JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘htv730395685_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}-e8db70c7-ffe5-4185-9230-0eaa08fc4d81’}}]

2019-08-15T12:03:33,627 INFO [LookupExtractorFactoryContainerProvider-Startup-0] org.apache.druid.query.lookup.LookupReferencesManager - Starting lookup [testknocker_profile]:[LookupExtractorFactoryContainer{version=‘v0’, lookupExtractorFactory=NamespaceLookupExtractorFactory{extractionNamespace=JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘testknocker_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}, firstCacheTimeout=5000, injective=false, extractorID=‘namespace-factory-JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘testknocker_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}-116c8edb-b477-40ca-92eb-dd8e71fbfb4b’}}]

``

and successful:

2019-08-15T00:57:58,628 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - Closing lookup [htp740801306_profile]

2019-08-15T00:57:58,628 INFO [main] org.apache.druid.server.lookup.namespace.cache.CacheScheduler - Closing namespace [JdbcExtractionNamespace{connectorConfig=DbConnectorConfig{createTables=true, connectURI=‘jdbc:mysql://mysql:3306/lookups’, user=‘lookup’, passwordProvider=org.apache.druid.metadata.DefaultPasswordProvider, dbcpProperties=null}, table=‘htp740801306_profile’, keyColumn=‘aliasValue’, valueColumn=‘primaryValue’, tsColumn=‘null’, filter=‘null’, pollPeriod=PT1M}] : org.apache.druid.server.lookup.namespace.cache.CacheScheduler$EntryImpl@258e8608

2019-08-15T00:57:58,628 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - LookupExtractorFactoryContainerProvider is stopped.

2019-08-15T00:57:58,628 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[org.apache.druid.indexing.worker.executor.ExecutorLifecycle@37c2f0b4].

2019-08-15T00:57:58,628 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop()] on object[org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner@6658f08a].

2019-08-15T00:57:58,628 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_kafka_cds00015_commerce_d72c36e6948cdc9_gheckidh].

2019-08-15T00:57:58,628 INFO [main] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping gracefully (status: [PUBLISHING])

2019-08-15T00:57:58,630 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Graceful shutdown of task[index_kafka_cds00015_commerce_d72c36e6948cdc9_gheckidh] finished in 2ms.

2019-08-15T00:57:58,630 INFO [main] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_cds00015_commerce_d72c36e6948cdc9_gheckidh] status changed to [SUCCESS].

2019-08-15T00:57:58,630 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.client.cache.CaffeineCache.close() throws java.io.IOException] on object[org.apache.druid.client.cache.CaffeineCache@2570b316].

2019-08-15T00:57:58,632 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.discovery.DruidLeaderClient.stop()] on object[org.apache.druid.discovery.DruidLeaderClient@786a3477].

2019-08-15T00:57:58,632 INFO [main] org.apache.druid.discovery.DruidLeaderClient - Stopped.

2019-08-15T00:57:58,632 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[org.apache.druid.curator.discovery.ServerDiscoverySelector@63f2e0b].

2019-08-15T00:57:58,634 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.discovery.DruidLeaderClient.stop()] on object[org.apache.druid.discovery.DruidLeaderClient@5f883d90].

2019-08-15T00:57:58,634 INFO [main] org.apache.druid.discovery.DruidLeaderClient - Stopped.

2019-08-15T00:57:58,634 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[org.apache.druid.curator.discovery.ServerDiscoverySelector@20b54cfe].

2019-08-15T00:57:58,635 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider.stop()] on object[org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider@1c2d63f0].

2019-08-15T00:57:58,635 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopping

2019-08-15T00:57:58,635 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopped

2019-08-15T00:57:58,635 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.http.client.NettyHttpClient.stop()] on object[org.apache.druid.java.util.http.client.NettyHttpClient@5409dfdd].

2019-08-15T00:57:58,647 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.metrics.MonitorScheduler.stop()] on object[org.apache.druid.java.util.metrics.MonitorScheduler@4a6a6a69].

2019-08-15T00:57:58,648 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[ServiceEmitter{serviceDimensions={service=druid/middleManager, host=ip-10-64-123-201:8104, version=0.14.0-incubating-iap3}, emitter=org.apache.druid.java.util.emitter.core.NoopEmitter@292158f8}].

2019-08-15T00:57:58,648 INFO [main] org.apache.druid.curator.CuratorModule - Stopping Curator

2019-08-15T00:57:58,649 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting

2019-08-15T00:57:58,651 INFO [main] org.apache.zookeeper.ZooKeeper - Session: 0x46be143f710afb0 closed

2019-08-15T00:57:58,651 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]

2019-08-15T00:57:58,651 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x46be143f710afb0

2019-08-15T00:57:58,652 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[org.apache.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@54567b05].

Finished peon task

``

We are also seeing that creating and updating supervisors is taking a long time. A post to the supervisor endpoint to update can take 60 seconds.

So, this is actually affecting our runtime. we are seeing it take some minutes for a new task to start, because one fails before a new one starts, so there is a time where our data is not actually realtime.
Thanks again