We have successfully setup a 2 node cluster. We use druid rc-2 and tranquility to submit data.
We are struggling with following issues.
Event loss: 94%-96% events are recorded and we have observed loss in almost every iteration. We concluded this based on the input count and number reported by count aggregator.
Each tasks takes additional 5-8 minutes to complete after the window period is over. How can we reduce this interval? See attached log. Please advise if you see anything odd with the task spec.
Event after setting -XX:MaxDirectMemorySize=5g “everywhere”, Peon tasks error out with
Please adjust -XX:MaxDirectMemorySize, druid.processing.buffer.sizeBytes, or druid.processing.numThreads: maxDirectMemory[3,491,758,080], memoryNeeded[4,900,000,000] = druid.processing.buffer.sizeBytes[700,000,000] * ( druid.processing.numThreads + 1 )
The command that gets fired is attached.
4.Some of the tasks end successfully but show following exception in the tail part of attached log
- java.lang.IllegalStateException: instance must be started before calling this method
at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[guava-16.0.1.jar:?]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:347) ~[curator-framework-2.8.0.jar:?]
at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:505) ~[curator-x-discovery-2.8.0.jar:?]
at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:155) [curator-x-discovery-2.8.0.jar:?]
at io.druid.curator.discovery.DiscoveryModule$5.stop(DiscoveryModule.java:222) [druid-server-0.8.1-rc2.jar:0.8.1-rc2]
at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:267) [java-util-0.27.0.jar:?]
at io.druid.cli.CliPeon$2.run(CliPeon.java:220) [druid-services-0.8.1-rc2.jar:0.8.1-rc2]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]*
- At one point, we observed that timeseries query reporting higher count, then decrease the count and stabilize at lesser count than original as other tasks finished. We noticed this as we were continuously submitting the timeseries query to get the running count.
task.log (36.5 KB)
peon.txt (6.14 KB)