Hello,
I did Druid upgrade to 0.10.1 version.
I follow rolling-updates procedures and use Rolling restart (graceful-termination-based) strategy for Middle Managers upgrade.
Unfortunately after upgrade Overlord logs WARNINGS and ERRORS:
- `Exception while sending request
`
2017-09-06 11:07:30,597 WARN i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-prod_sample_1_events-0] Exception while sending request
io.druid.java.util.common.IAE: Received 400 Bad Request with body: Task must be paused before changing the end offsets
at io.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:515) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.kafka.KafkaIndexTaskClient.setEndOffsets(KafkaIndexTaskClient.java:293) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.kafka.KafkaIndexTaskClient$11.call(KafkaIndexTaskClient.java:429) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.kafka.KafkaIndexTaskClient$11.call(KafkaIndexTaskClient.java:425) [druid-kafka-indexing-service-0.10.1.jar:0.10.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
``
Exception while getting current/latest offsets
2017-09-06 11:28:43,877 WARN i.d.i.k.s.KafkaSupervisor [KafkaSupervisor-prod_
sample_1_events-Reporting-0] Exception while getting current/latest offsets java.util.concurrent.TimeoutException: Timeout waiting for task. at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269) ~[guava-16.0.1.jar:?] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) ~[guava-16.0.1.jar:?] at io.druid.indexing.kafka.supervisor.KafkaSupervisor.updateCurrentOffsets(KafkaSupervisor.java:1802) ~[?:?] at io.druid.indexing.kafka.supervisor.KafkaSupervisor.lambda$updateCurrentAndLatestOffsets$12(KafkaSupervisor.java:1810) ~[?:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_141] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_141] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_141] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_141] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Kafka partitions [,,,] do not match task partition [,,,] ``occurs frequently
2017-09-06 11:25:31,410 WARN i.d.i.k.s.KafkaSupervisor [KafkaSupervisor-prod_sample_1
_events-Reporting-0] Lag metric: Kafka partitions [0, 1, 2, 3, 4, 5, 6, 7] do not match task partitions [1, 2, 3, 4, 6]
``
Overlord hangs, and while trying to restart there are another problems:
Cannot get a connection, general error
`
2017-09-06 11:44:10,741 WARN i.d.j.u.c.RetryUtils [qtp1373220972-115] Failed on try 1, retrying in 720ms.
org.skife.jdbi.v2.exceptions.UnableToObtainConnectionException: java.sql.SQLException: Cannot get a connection, general error
at org.skife.jdbi.v2.DBI.open(DBI.java:230) ~[jdbi-2.63.1.jar:2.63.1]
at org.skife.jdbi.v2.DBI.withHandle(DBI.java:279) ~[jdbi-2.63.1.jar:2.63.1]
at io.druid.metadata.SQLMetadataConnector$2.call(SQLMetadataConnector.java:129) ~[druid-server-0.10.1.jar:0.10.1]
at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:63) [java-util-0.10.1.jar:0.10.1]
at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) [java-util-0.10.1.jar:0.10.1]
at io.druid.metadata.SQLMetadataConnector.retryWithHandle(SQLMetadataConnector.java:133) [druid-server-0.10.1.jar:0.10.1]
at io.druid.metadata.SQLMetadataConnector.retryWithHandle(SQLMetadataConnector.java:142) [druid-server-0.10.1.jar:0.10.1]
at io.druid.metadata.IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForIntervals(IndexerSQLMetadataStorageCoordinator.java:127)
[druid-server-0.10.1.jar:0.10.1]
at io.druid.metadata.IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval(IndexerSQLMetadataStorageCoordinator.java:119)
[druid-server-0.10.1.jar:0.10.1]
at io.druid.indexing.common.actions.SegmentAllocateAction.perform(SegmentAllocateAction.java:156)
[druid-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.common.actions.SegmentAllocateAction.perform(SegmentAllocateAction.java:54)
[druid-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.common.actions.LocalTaskActionClient.submit(LocalTaskActionClient.java:64)
[druid-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.overlord.http.OverlordResource$3.apply(OverlordResource.java:341)
[druid-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.overlord.http.OverlordResource$3.apply(OverlordResource.java:330)
[druid-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.overlord.http.OverlordResource.asLeaderWith(OverlordResource.java:673)
[druid-indexing-service-0.10.1.jar:0.10.1]
at io.druid.indexing.overlord.http.OverlordResource.doAction(OverlordResource.java:327)
[druid-indexing-service-0.10.1.jar:0.10.1]
at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_141]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_141]
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
[jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) [jersey-servlet-1.19.3.jar:1.19.3]
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
[jersey-servlet-1.19.3.jar:1.19.3]
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
[jersey-servlet-1.19.3.jar:1.19.3]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) [javax.servlet-api-3.1.0.jar:3.1.0]
at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135) [guice-servlet-4.1.0.jar:?]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at io.druid.server.http.RedirectFilter.doFilter(RedirectFilter.java:72) [druid-server-0.10.1.jar:0.10.1]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.Server.handle(Server.java:534) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
[jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
[jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
[jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
[jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.sql.SQLException: Cannot get a connection, general error
at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:131) ~[commons-dbcp2-2.0.1.jar:2.0.1]
at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1413) ~[commons-dbcp2-2.0.1.jar:2.0.1]
at org.skife.jdbi.v2.DataSourceConnectionFactory.openConnection(DataSourceConnectionFactory.java:36)
~[jdbi-2.63.1.jar:2.63.1]
at org.skife.jdbi.v2.DBI.open(DBI.java:212) ~[jdbi-2.63.1.jar:2.63.1]
... 63 more
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
~[?:1.8.0_141]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
~[?:1.8.0_141]
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:524) ~[commons-pool2-2.2.jar:2.2]
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:438) ~[commons-pool2-2.2.jar:2.2]
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361) ~[commons-pool2-2.2.jar:2.2]
at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:119) ~[commons-dbcp2-2.0.1.jar:2.0.1]
at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1413) ~[commons-dbcp2-2.0.1.jar:2.0.1]
at org.skife.jdbi.v2.DataSourceConnectionFactory.openConnection(DataSourceConnectionFactory.java:36)
~[jdbi-2.63.1.jar:2.63.1]
at org.skife.jdbi.v2.DBI.open(DBI.java:212) ~[jdbi-2.63.1.jar:2.63.1]
... 63 more
``
`Couldn't stop Thread and ``Failed to get task runner
2017-09-06
11:44:25,739 WARN o.e.j.u.t.QueuedThreadPool [Thread-41] qtp1373220972{STOPPING,40<=40<=40,i=0,q=36} Couldn’t stop Thread[qtp1373220972-102,5,main]
2017-09-06 11:44:25,739 WARN o.e.j.u.t.QueuedThreadPool [Thread-41] qtp1373220972{STOPPING,40<=40<=40,i=0,q=36} Couldn’t stop Thread[qtp1373220972-84,5,main]
2017-09-06 11:44:25,739 WARN o.e.j.u.t.QueuedThreadPool [Thread-41] qtp1373220972{STOPPING,40<=40<=40,i=0,q=36} Couldn’t stop Thread[qtp1373220972-91,5,main]
2017-09-06 11:51:03,706 ERROR i.d.i.k.s.KafkaSupervisor [KafkaIndexTaskClient-prod_sample_1_events-0] Failed to get task runner because I’m not the leader!
``
druid.host={{overlord-fqdn}}
druid.port={{overlord-port}}
druid.service=druid/overlord
# Upload all task logs to deep storage
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket={{druid-s3-bucket-name}}
druid.indexer.logs.s3Prefix=prod/logs/v1
Run in remote mode
druid.indexer.runner.type=remote
druid.indexer.runner.minWorkerVersion=#{WORKER_VERSION}
Store all task state in the metadata storage
druid.indexer.storage.type=metadata
druid.monitoring.monitors=[“com.metamx.metrics.SysMonitor”,“io.druid.client.cache.CacheMonitor”,“io.druid.server.metrics.EventReceiverFirehoseMonitor”,“com.metamx.metrics.JvmMonitor”][root@druid-master01
~]#
``
druid.host={{mm-fqdn}}
druid.port={{mm-port}}
druid.service=druid/middlemanager
druid.monitoring.monitors=[“com.metamx.metrics.SysMonitor”,“io.druid.client.cache.CacheMonitor”,“io.druid.server.metrics.EventReceiverFirehoseMonitor”,“com.metamx.metrics.JvmMonitor”]
Resources for peons
druid.indexer.runner.javaOpts=-server
-Xmx6g -XX:MaxDirectMemorySize=4g -XX:+UseG1GC -XX:MaxGCPauseMillis=100
-Dcom.amazonaws.services.s3.enableV4
druid.indexer.task.baseTaskDir=/persistent/task/
# Store task logs in deep storage --> please refere to runtime.properties in the _common dir
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=s3-deep-storage
druid.indexer.logs.s3Prefix=prod/logs/v1
Peon properties
druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1073741824
druid.indexer.fork.property.druid.processing.numThreads=2
druid.indexer.fork.property.druid.processing.numMergeBuffers=1
#druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/persistent/zk_druid”, “maxSize”: 0}]
druid.indexer.fork.property.druid.server.http.numThreads=100
druid.indexer.fork.property.druid.storage.archiveBaseKey=prod
druid.indexer.fork.property.druid.storage.archiveBucket=**
{{druid-s3-bucket-name}}**
-archive
druid.indexer.fork.property.druid.storage.baseKey=prod/v1
druid.indexer.fork.property.druid.storage.bucket={{druid-s3-bucket-name}}
druid.indexer.fork.property.druid.storage.type=s3
druid.worker.capacity=11
druid.worker.ip=localhost
druid.worker.version=0
``
The only option to stop it is to kill -9 overlord_pid
OUR CONFIGS:
overlord.runtime.properties
middleManager.runtime.properties
Any idea?
Regards,
Jan