Increase druid.processing.buffer.sizeBytes beyond 2 GB

Hello,

How to go beyond 2 GB for druid.processing.buffer.sizeBytes? I am facing error:

io.druid.server.QueryResource - Exception handling request: {class=io.druid.server.QueryRes

ource, exceptionType=class io.druid.query.ResourceLimitExceededException, exceptionMessage=Not enough aggregation table space to execute this query. Try increasing druid.processing.buffer.sizeBytes or enable disk spilling by setting druid.query.groupBy.maxOnDiskStorage to a positive number., exception=io.druid.query.ResourceLimitExceededException: Not enough aggregation table space to execute this query. Try increasing druid.processing.buffer.sizeBytes or enable disk spilling by setting druid.query.groupBy.maxOnDiskStorage to a positive number., query=GroupByQuery{..}

io.druid.query.ResourceLimitExceededException: Not enough aggregation table space to execute this query. Try increasing druid.processing.buffer.sizeBytes or enable disk spilling by setting druid.query.groupBy.maxOnDiskStorage to a positive number.

at io.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2.waitForFutureCompletion(GroupByMergingQueryRunnerV2.java:325) ~[druid-processing-0.10.1.jar:0.10.1]

at io.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2.access$700(GroupByMergingQueryRunnerV2.java:72) ~[druid-processing-0.10.1.jar:0.10.1]

at io.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1.make(GroupByMergingQueryRunnerV2.java:268) ~[druid-processing-0.10.1.jar:0.10.1]

at io.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1.make(GroupByMergingQueryRunnerV2.java:155) ~[druid-processing-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.BaseSequence.toYielder(BaseSequence.java:65) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.common.guava.CombiningSequence.toYielder(CombiningSequence.java:80) ~[druid-common-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.MappedSequence.toYielder(MappedSequence.java:49) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:87) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:83) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.WrappingSequence.toYielder(WrappingSequence.java:82) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.MappedSequence.toYielder(MappedSequence.java:49) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:87) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:83) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.query.CPUTimeMetricQueryRunner$1.wrap(CPUTimeMetricQueryRunner.java:74) ~[druid-processing-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.WrappingSequence.toYielder(WrappingSequence.java:82) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.java.util.common.guava.Yielders.each(Yielders.java:32) ~[java-util-0.10.1.jar:0.10.1]

at io.druid.server.QueryResource.doPost(QueryResource.java:259) [druid-server-0.10.1.jar:0.10.1]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_144]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_144]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_144]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_144]

at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) [jersey-server-1.19.3.jar

:1.19.3]

at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) [jersey-server-1.19.3.jar:1.19.3]

at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) [jersey-servlet-1.19.3.jar:1.19.3]

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) [jersey-servlet-1.19.3.jar:1.19.3]

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) [jersey-servlet-1.19.3.jar:1.19.3]

at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) [javax.servlet-api-3.1.0.jar:3.1.0]

at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286) [guice-servlet-4.1.0.jar:?]

at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276) [guice-servlet-4.1.0.jar:?]

at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181) [guice-servlet-4.1.0.jar:?]

at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) [guice-servlet-4.1.0.jar:?]

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) [guice-servlet-4.1.0.jar:?]

at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120) [guice-servlet-4.1.0.jar:?]

at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135) [guice-servlet-4.1.0.jar:?]

at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.Server.handle(Server.java:534) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]

at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

My setup’s druid.processing.buffer.sizeBytes is already 2^31 - 1. If I increase it further I get numberFormatException at broker node startup time.

Is there a way to extend druid.processing.buffer.sizeBytes beyond 2^31?

One option is to spill over to disk (druid.query.groupBy.maxOnDiskStorage) but that would make druid slow.

Hey Sergio,

groupBy doesn’t currently support in memory aggregation buffers > 2GB. You might be able to rewrite your query somehow to use less memory, but if not then try allowing it to use disk and see how it performs.

Hey Gian, we use only GroupBy queries over :

FLOOR(__time TO DAY) AND dimension-with-cardinality-150

and we FILTER on a lot of dimensions, like 30-40.

  1. Does that mean that we need small druid.processing.buffer.sizeBytes as it never fills up for such queries and that we should rather increase druid.processing.numMergeBuffers ?

  2. what is the role of HEAP in this use case, is it also needed?

  3. should there be a reserve in -XX:MaxDirectMemorySize or it should be almost equal to druid.processing.buffer.sizeBytes * (druid.processing.numMergeBuffers + druid.processing.numThreads + 1) ?

  4. should be identical setup applied on historical nodes too?

We use r4.xlarge instances with 30GB ram, so my broker and historical setup is currently :

-XX:MaxDirectMemorySize=18g

druid.processing.numMergeBuffers=4

druid.processing.numThreads=3

buffer.sizeBytes=2147483647

HEAP=8g

Problem is that our performance dropped down 6 times when we upgraded from 0.9.2 to 0.10.1

Also we use 6G for caching which is why HEAP=8g

Hi Jakub,

Problem is that our performance dropped down 6 times when we upgraded from 0.9.2 to 0.10.1

Would you be able to provide an example query where the performance dropped going from 0.9.2->0.10.1? Is the performance degradation you’re seeing happening on all queries or only certain types?

Hi Jonathan,

it’s a groupBy with time bucketing :

http http://host/druid/v2/sql/ ““query””=“SELECT FLOOR(__time TO DAY), d_section, COUNT(DISTINCT diid) FROM “gwiq-daily-p” WHERE TIMESTAMP ‘2017-10-01 00:00:00’ <= __time AND __time <= TIMESTAMP ‘2017-10-10 00:00:00’ AND cid = ‘c0248’ GROUP BY “d_section”, FLOOR(__time TO DAY)”

We don’t use almost any other queries. The only thing that changes is amount of filters, from 1 to 20, but it doesn’t really make a difference.

0.9.2 didn’t support Druid SQL so I guess you were using some other query mechanism there. Could you please let us know what it was?

One guess I have is that maybe you were using PlyQL beforehand which might have used a granularity=day query with one dimension to execute that SQL. Druid SQL would use a granularity=all query with two dimensions (one being a time extraction fn). The Druid SQL approach is somewhat slower. If that’s what’s going on, could you create a github issue for the performance issue please, with some details about the queries you ran on 0.9.2 and on 0.10.1?

It was PlyQL, but the performance drop is not related to the sql engines, it dropped ~ 5-6 times on both engines. We currently support both plyql and druid-native-sql.

There is a problem on historical nodes, when we repeated the same query after it got cached on Druid 0.9.2, there was almost no CPU load on historicals, now it’s like Broker cache doesn’t work.
We currently cache both on historicals and brokers with druid.historical.cache.unCacheable=[“select”]

I think that the actual slow down could be caused by https://github.com/druid-io/druid/pull/3950. Brokers are now sort of idling and all the hard work happens on historicals.

Hi Gian,

if I make a benchmark with 50 queries, each with FILTER over a different dimension and then run it again, this is what CPU does :

The first run with cold cache takes almost 2 minutes, the second one 15 seconds while CPU is not utilized well.

I’m also trying to bench the GroupBy V1 engine so that we could switch to it to get back to original performance, but this :

{

“query” : “SELECT FLOOR(__time TO DAY), ‘c-geo:c3’, COUNT(DISTINCT diid) FROM “gwiq-daily-p” WHERE TIMESTAMP ‘2017-05-01 00:00:00’ <= __time AND __time <= TIMESTAMP ‘2017-10-10 00:00:00’ AND cid = ‘c0248’ GROUP BY “c-geo:c3”, FLOOR(__time TO DAY)”,

“context” : {“groupByStrategy” : “v1”}

}

Actually errors out on :

GroupBy v1 only supports dimensions with an outputType of STRING.

Hey Jakub,

A couple things.

  1. If https://github.com/druid-io/druid/pull/3950 is the culprit, and you have caching enabled on both the broker and historical, then I guess I’m confused since I would think the historical caching should still help you out here. Could you try profiling the historicals and seeing what they’re doing while this slowdown is happening? You could do it with a profiler like YourKit, or with “poor man’s profiling” like running jstack -l. Can you also please open a github issue with whatever info you’ve gathered about this issue, including your cache configs on both the broker and historical?

  2. groupBy v1 is giving you that error since the SQL query is using an outputType of “long” for the FLOOR(__time TO DAY) function. groupBy v1 doesn’t support that. (Some Druid SQL features require groupBy v2.) If you write the query as a native JSON groupBy without that, it should work. Maybe use PlyQL’s verbose mode to help you write it.