Migration of Druid from 0.10.0 to 0.11.0

Hello,

We are trying to migrate our Druid cluster (used in production) to newer version, starting by 0.11.0 due to the Double data format change coming in 0.12.0.
We are seeing big issues of performance when going to 0.11.0 with the same hardware and queries sent to the cluster as in 0.10.0.

The issue is as follow:

  • We have scripts sending a large amount of timeseries queries that works perfectly fine in 0.10.0
  • The moment we update to 0.11.0, historicals starts throwing this error :

`

2020-05-26T11:39:38,190 WARN [qtp902064508-226[timeseries_[main]_find-event]] io.druid.server.QueryLifecycle - Exception while processing queryId [find-event]
io.druid.query.QueryInterruptedException: Timeout waiting for task.
at io.druid.query.ChainedExecutionQueryRunner$1.make(ChainedExecutionQueryRunner.java:174) ~[druid-processing-0.11.0.jar:0.11.0]
at io.druid.java.util.common.guava.BaseSequence.toYielder(BaseSequence.java:65) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.common.guava.CombiningSequence.toYielder(CombiningSequence.java:80) ~[druid-common-0.11.0.jar:0.11.0]
at io.druid.java.util.common.guava.MappedSequence.toYielder(MappedSequence.java:49) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:87) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:83) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.query.CPUTimeMetricQueryRunner$1.wrap(CPUTimeMetricQueryRunner.java:74) ~[druid-processing-0.11.0.jar:0.11.0]
at io.druid.java.util.common.guava.WrappingSequence.toYielder(WrappingSequence.java:82) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.java.util.common.guava.Yielders.each(Yielders.java:32) ~[java-util-0.11.0.jar:0.11.0]
at io.druid.server.QueryResource.doPost(QueryResource.java:193) [druid-server-0.11.0.jar:0.11.0]
at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) [jersey-server-1.19.3.jar:1.19.3]
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) [jersey-servlet-1.19.3.jar:1.19.3]
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) [jersey-servlet-1.19.3.jar:1.19.3]
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) [jersey-servlet-1.19.3.jar:1.19.3]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) [javax.servlet-api-3.1.0.jar:3.1.0]
at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120) [guice-servlet-4.1.0.jar:?]
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135) [guice-servlet-4.1.0.jar:?]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at io.druid.server.security.PreResponseAuthorizationCheckFilter.doFilter(PreResponseAuthorizationCheckFilter.java:84) [druid-server-0.11.0.jar:0.11.0]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at io.druid.server.security.AllowAllAuthenticator$1.doFilter(AllowAllAuthenticator.java:85) [druid-server-0.11.0.jar:0.11.0]
at io.druid.server.security.AuthenticationWrappingFilter.doFilter(AuthenticationWrappingFilter.java:60) [druid-server-0.11.0.jar:0.11.0]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at io.druid.server.security.SecuritySanityCheckFilter.doFilter(SecuritySanityCheckFilter.java:86) [druid-server-0.11.0.jar:0.11.0]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.Server.handle(Server.java:534) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269) ~[guava-16.0.1.jar:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) ~[guava-16.0.1.jar:?]
at io.druid.query.ChainedExecutionQueryRunner$1.make(ChainedExecutionQueryRunner.java:159) ~[druid-processing-0.11.0.jar:0.11.0]
… 64 more

`

The cluster is then not working anymore and never answering any queries.

For information, here are the config we use for historicals:

20 core CPU / 40 threads
251GB of RAM

JVM option

`

-server
-Xms12g
-Xmx12g
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=var/tmp
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown
-Dlog4j.shutdownHookEnabled=true
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-Xloggc:/var/logs/druid/historical.gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=50
-XX:GCLogFileSize=10m
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/logs/druid/historical.hprof
-XX:MaxDirectMemorySize=10240g

`

historical runtime

`

druid.service=historical

druid.segmentCache.locations=[{“path”: “/segments”, “maxSize”: 2500000000000}]
druid.segmentCache.numLoadingThreads=7
druid.segmentCache.numBootstrapThreads=7
druid.server.maxSize=2500000000000
druid.server.http.numThreads=200
druid.server.tier=_default_tier
druid.server.priority=1

druid.processing.numMergeBuffers=10

druid.processing.buffer.sizeBytes=1073741824
druid.processing.numThreads=39

druid.historical.cache.useCache=false
druid.historical.cache.populateCache=false
druid.cache.sizeInBytes=1073741824

`

I would be interested if anyone has any idea what could be causing this issue after 0.10.0 ?

This is preventing us to upgrade to any version as this performance issue is making our cluster unusable.