How do I set the configuration if the size of the box is not the same as in the website?

For example

http://druid.io/docs/latest/configuration/production-cluster.html. The broker node is using r3.8xlarge but we’re using just r3.large. How should we set the right number for the size of the box for these configuration?

We have seen a couple times that the box ran out of memory but not sure if you guys have a magic formula?

-server
-Xmx25g
-Xms25g
-XX:NewSize=6g
-XX:MaxNewSize=6g
-XX:MaxDirectMemorySize=64g
-XX:+UseConcMarkSweepGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/mnt/tmp

-Dcom.sun.management.jmxremote.port=17071
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false

Hi,

Do you know what is the reason of the OOM ?
is this a broker ?

can you share some logs maybe ?

I didn’t really know the reason of OOM. And yes it was the broker. I didn’t some digging. We may have run some slow queries but I didn’t think that was the reason.

go you have the GC logs ?

Let me dig it up.

This is what we got.

Feb 4 21:34:17 ip-10-90-26-28 java: [Free CSet: 0.0 ms]

1. ```
Feb  4 21:34:17 ip-10-90-26-28 java: [Eden: 0.0B(408.0M)->0.0B(408.0M) Survivors: 0.0B->0.0B Heap: 8190.0M(8192.0M)->8190.0M(8192.0M)]

Feb 4 21:34:17 ip-10-90-26-28 java: [Times: user=0.19 sys=0.00, real=0.12 secs]

1. ```
Feb  4 21:34:17 ip-10-90-26-28 java: 586.289: [GC concurrent-root-region-scan-start]

Feb 4 21:34:17 ip-10-90-26-28 java: 586.289: [GC concurrent-root-region-scan-end, 0.0000120 secs]

1. ```
Feb  4 21:34:17 ip-10-90-26-28 java: 586.289: [GC concurrent-mark-start]

Feb 4 21:34:21 ip-10-90-26-28 puppet-agent[31655]: Finished catalog run in 4.81 seconds

1. ```
Feb  4 21:34:34 ip-10-90-26-28 java: 586.292: [Full GC 8190M->7095M(8192M), 16.5820380 secs]

Feb 4 21:34:34 ip-10-90-26-28 java: [Eden: 0.0B(408.0M)->0.0B(408.0M) Survivors: 0.0B->0.0B Heap: 8190.0M(8192.0M)->7095.7M(8192.0M)]

1. ```
Feb  4 21:34:34 ip-10-90-26-28 java: [Times: user=20.99 sys=0.00, real=16.58 secs]

Feb 4 21:34:34 ip-10-90-26-28 java: 602.874: [GC concurrent-mark-abort]

1. ```
Feb  4 21:34:34 ip-10-90-26-28 java: 2016-02-04T21:34:34,099 INFO [HttpClient-Netty-Worker-2] LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-02-04T21:34:34.093Z","service":"druid/broker","host":"10.90.26.28:8082","metric":"query/node/ttfb","value":578974,"dataSource":"sparrow-firehose-web","dimension":"pHr","duration":"PT67251S","hasFilters":"true","id":"e5c6f8e0-4f82-42ad-817c-49be1b1e27f2","interval":["2016-02-04T00:00:00.000Z/2016-02-04T01:00:00.000Z","2016-02-04T03:00:00.000Z/2016-02-04T20:40:51.000Z"],"numComplexMetrics":"1","numMetrics":"2","server":"10.90.31.17:8084","threshold":"1000","type":"topN"}]

Feb 4 21:34:35 ip-10-90-26-28 java: 603.943: [GC pause (young), 0.3521970 secs]

1. ```
Feb  4 21:34:35 ip-10-90-26-28 java: [Parallel Time: 350.6 ms, GC Workers: 2]

Feb 4 21:34:35 ip-10-90-26-28 java: [GC Worker Start (ms): Min: 603943.7, Avg: 603943.7, Max: 603943.7, Diff: 0.0]

1. ```
Feb  4 21:34:35 ip-10-90-26-28 java: [Ext Root Scanning (ms): Min: 12.5, Avg: 12.7, Max: 12.8, Diff: 0.2, Sum: 25.3]

Feb 4 21:34:35 ip-10-90-26-28 java: [Update RS (ms): Min: 0.4, Avg: 0.5, Max: 0.6, Diff: 0.2, Sum: 1.0]

1. ```
Feb  4 21:34:35 ip-10-90-26-28 java: [Processed Buffers: Min: 8, Avg: 8.0, Max: 8, Diff: 0, Sum: 16]

Feb 4 21:34:35 ip-10-90-26-28 java: [Scan RS (ms): Min: 0.1, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.2]

1. ```
Feb  4 21:34:35 ip-10-90-26-28 java: [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, S

The broker didn’t die but it was just really slow during the Full GC.

Noppanit, if you are looking for alternative configuration guidelines, 0.9.0-RC has some more: http://druid.io/docs/0.9.0-rc1/tutorials/cluster.html

This is based off of http://imply.io/docs/latest/