Druid Cluster Capacity issue?

Hi

Can you please check this druid cluster setup and let us know why we are facing some of these problems?

Requirement:

  1. Handle 100-200 million data points per minute
  2. Currently we are testing with 1/4th of that full load and trying to stabilize the cluster

Problems:(Note: The graph’s timezone is IST and server is at MST)

  1. The indexing seems slow and the time taken for the full data to be indexed is really high. The data point at say time 19:48 takes a few minutes(15-20) to reach that value. As you can see the data points are high at the beginning and then drifts off to a smaller amount later on.
  2. 30 th minute of every hour in the graph is when peons for new hour are spawned.
  3. at the 45th minute of every hour in the graph is when old peons are shutdown after writing to HDFS. That is when you see a sudden spike in the counts as well.
    Is this a capacity issue or some configuration problem?

The middle manager configuration is as follows. We had help from Nishant who suggested that we were choking the CPU. This is the new configuration we are using right now to give ample CPU for the peons. Any help is greatly appreciated as we are trying really hard to scale up Druid and we might be goofing up somewhere. The version of druid is 0.7.0 and tranquility is 0.4.2. The configuration of the machines running middle manager is

vCPUs: 8

Memory:24GB

Disk:150GB

Number of middle managers:32

Replication: None

Capacity of each middle manager: 3

No of Partitions: 48

If you require more information please let me know. Thank you for all your support.

druid.host=druid-mm-0-52019.stratus.slc.XXXX.com

druid.port=8080

druid.service=middleManager

druid.extensions.remoteRepositories=

druid.extensions.localRepository=/home/appmon/druid-0.7.0/repo

druid.extensions.coordinates=[“io.druid.extensions:druid-kafka-eight:0.7.0”,“io.druid.extensions:druid-hdfs-storage:0.7.0”]

druid.zk.service.host:2181=zk-2-31056.stratus.slc.XXXX.com,zk-1-31055.stratus.slc.XXXX.com:2181,zk-0-31054.stratus.slc.XXXX.com:2181

druid.selectors.indexing.serviceName=overlord

Dedicate more resources to peons

druid.indexer.runner.javaCommand=/home/appmon/jdk1.6/bin/java

druid.indexer.runner.javaOpts=-server -Xmx6g -Xms6g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=7g -Duser.timezone=MST -Dfile.encoding=UTF-8

druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.4.0”]

druid.indexer.task.baseTaskDir=/home/appmon/druid/persistent/task

druid.indexer.task.chathandler.type=announce

druid.indexer.runner.startPort=9080

druid.indexer.fork.property.druid.storage.type=hdfs

druid.indexer.fork.property.druid.storage.storageDirectory=hdfs://artemis-lvs-nn-ha/user/appmon/druid/segments

druid.indexer.fork.property.druid.computation.buffer.size=1000000000

druid.indexer.fork.property.druid.processing.numThreads=6

druid.indexer.fork.property.druid.request.logging.type=file

druid.indexer.fork.property.druid.request.logging.dir=request_logs/

druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/home/appmon/druid/persistent/zk_druid”, “maxSize”: 0}]

druid.indexer.fork.property.druid.server.http.numThreads=50

druid.indexer.fork.property.druid.computation.buffer.size=268435456

druid.indexer.fork.property.druid.emitter.logging.level=debug

druid.indexer.fork.property.druid.emitter=logging

druid.indexer.fork.property.druid.emitter.logging.loggerClass=LoggingEmitter

druid.worker.capacity=3

druid.emitter=logging

druid.emitter.logging.loggerClass=LoggingEmitter

druid.emitter.logging.level=debug

``

If that’s roughly what your CPU usage looks like then it is totally reasonable for the 0.7.0 branch. Note that there are a lot of improvements since 0.7.0 that help with indexing speed. So part of it is a druid indexing speed problem.

There’s also https://github.com/druid-io/druid/pull/984 which gives you more flexibility over favoring cpu time for events vs queries, but is not merged yet.

The scvs should be able to scale horizontally pretty well, so you can add more nodes to catch events if you aren’t getting the throughput you desire, but the first thing I would do if possible is try to upgrade to the latest stable (0.8.2) or even latest RC if you’re just testing stuff out.

Estimating # of events per second per node depends completely on the type of data you are throwing at it, the cardinality of the dimensions, the quantity of high cardinality dimensions, and the metrics chosen. So it might be worth looking at the resultant segments to make sure you are getting all the data you need, and only the data you need. For example, it is very easy to make a HyperUnique metric of the input, but accidentally end up with it also as a dimension (which will increase indexing time due to the high cardinality).

charles.allen, Thank you for your reply. We will upgrade the druid version to 0.8.2. While upgrading I am facing some issues. If I start the coordinator node it throws org.codehaus.plexus in not present in class path. I am pretty sure I have set the extensions-repo folder in the classpath. Any idea what I am missing here? If I start the overlord then kafka-eight extensions throw the same classdef error for plexus jar

Startup:

#!/usr/bin/env bash

set +e

set +u

shopt -s xpg_echo

shopt -s expand_aliases

trap “exit 1” 1 2 3 15

SCRIPT_DIR=dirname $0

MAVEN_DIR="${SCRIPT_DIR}/extensions-repo"

CURR_DIR=pwd

cd ${SCRIPT_DIR}

SCRIPT_DIR=pwd

cd ${CURR_DIR}

start process

JAVA_ARGS="-Xmx4g -Duser.timezone=MST -Dfile.encoding=UTF-8"

JAVA_ARGS="{JAVA_ARGS} -Ddruid.extensions.localRepository={MAVEN_DIR}"

DRUID_CP="${SCRIPT_DIR}/extensions-repo"

DRUID_CP="{DRUID_CP}:{SCRIPT_DIR}/config/coordinator"

DRUID_CP="{DRUID_CP}:{SCRIPT_DIR}/lib/*"

exec /home/appmon/jdk/bin/java {JAVA_ARGS} -classpath "{DRUID_CP}" io.druid.cli.Main server coordinator

``

Coordinator Config:

druid.host=druid-cn-0-26195.stratus.slc.XXXX.com

druid.service=coordinator

druid.port=8082

druid.zk.service.host:2181=zk-2-31056.stratus.slc.XXXX.com,zk-1-31055.stratus.slc.XXXX.com:2181,zk-0-31054.stratus.slc.XXXX.com:2181

druid.extensions.remoteRepositories=

druid.extensions.localRepository=/home/appmon/druid-0.8.2/extensions-repo

druid.extensions.coordinates=[“io.druid.extensions:mysql-metadata-storage”]

druid.metadata.storage.type=mysql

druid.metadata.storage.connector.connectURI=jdbc:mysql://druid-cn-0-26195.stratus.slc.XXXX.com:3306/druid

druid.metadata.storage.connector.user=XXXX

druid.metadata.storage.connector.password=XXXX

The coordinator begins assignment operations after the start delay.

We override the default here to start things up faster for examples.

In production you should use PT5M or PT10M

druid.coordinator.startDelay=PT5M

``

Error:

2015-12-15T10:50:00,048 INFO [main] io.druid.guice.PropertiesModule - Loading properties from runtime.properties

Dec 15, 2015 10:50:00 AM org.hibernate.validator.internal.util.Version

INFO: HV000001: Hibernate Validator 5.1.3.Final

2015-12-15T10:50:00,830 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=[io.druid.extensions:mysql-metadata-storage], defaultVersion=‘0.8.2’, localRepository=’./extensions-repo’, remoteRepositories=}]

2015-12-15T10:50:01,025 INFO [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:mysql-metadata-storage] for class[io.druid.cli.CliCommandCreator]

Exception in thread “main” java.lang.NoClassDefFoundError: org.codehaus.plexus.configuration.PlexusConfiguration

    at io.tesla.aether.connector.AetherRepositoryConnector.<init>(AetherRepositoryConnector.java:133)

    at io.tesla.aether.connector.AetherRepositoryConnectorFactory.newInstance(AetherRepositoryConnectorFactory.java:48)

    at org.eclipse.aether.internal.impl.DefaultRepositoryConnectorProvider.newRepositoryConnector(DefaultRepositoryConnectorProvider.java:139)

    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads(DefaultArtifactResolver.java:531)

    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:436)

    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:262)

    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifact(DefaultArtifactResolver.java:239)

    at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.loadPom(DefaultArtifactDescriptorReader.java:320)

    at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.readArtifactDescriptor(DefaultArtifactDescriptorReader.java:217)

    at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:461)

    at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

    at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

    at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

    at org.eclipse.aether.internal.impl.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:261)

    at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:342)

    at io.tesla.aether.internal.DefaultTeslaAether.resolveArtifacts(DefaultTeslaAether.java:289)

    at io.druid.initialization.Initialization.getClassLoaderForCoordinates(Initialization.java:253)

    at io.druid.initialization.Initialization.getFromExtensions(Initialization.java:153)

    at io.druid.cli.Main.main(Main.java:76)

Caused by: java.lang.ClassNotFoundException: org.codehaus.plexus.configuration.PlexusConfiguration

    at java.net.URLClassLoader.findClass(URLClassLoader.java:588)

    at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:743)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:711)

    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:313)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:690)

    ... 19 more

``

attaching the error messages again
2015-12-15T10:50:00,048 INFO [main] io.druid.guice.PropertiesModule - Loading properties from runtime.properties

Dec 15, 2015 10:50:00 AM org.hibernate.validator.internal.util.Version

INFO: HV000001: Hibernate Validator 5.1.3.Final

2015-12-15T10:50:00,830 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=[io.druid.extensions:mysql-metadata-storage], defaultVersion=‘0.8.2’, localRepository=’./extensions-repo’, remoteRepositories=}]

2015-12-15T10:50:01,025 INFO [main] io.druid.initialization.Initialization - Loading extension[io.druid.extensions:mysql-metadata-storage] for class[io.druid.cli.CliCommandCreator]

Exception in thread “main” java.lang.NoClassDefFoundError: org.codehaus.plexus.configuration.PlexusConfiguration

at io.tesla.aether.connector.AetherRepositoryConnector.(AetherRepositoryConnector.java:133)

at io.tesla.aether.connector.AetherRepositoryConnectorFactory.newInstance(AetherRepositoryConnectorFactory.java:48)

at org.eclipse.aether.internal.impl.DefaultRepositoryConnectorProvider.newRepositoryConnector(DefaultRepositoryConnectorProvider.java:139)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads(DefaultArtifactResolver.java:531)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:436)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:262)

at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifact(DefaultArtifactResolver.java:239)

at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.loadPom(DefaultArtifactDescriptorReader.java:320)

at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.readArtifactDescriptor(DefaultArtifactDescriptorReader.java:217)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:461)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.process(DefaultDependencyCollector.java:573)

at org.eclipse.aether.internal.impl.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:261)

at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:342)

at io.tesla.aether.internal.DefaultTeslaAether.resolveArtifacts(DefaultTeslaAether.java:289)

at io.druid.initialization.Initialization.getClassLoaderForCoordinates(Initialization.java:253)

at io.druid.initialization.Initialization.getFromExtensions(Initialization.java:153)

at io.druid.cli.Main.main(Main.java:76)

Caused by: java.lang.ClassNotFoundException: org.codehaus.plexus.configuration.PlexusConfiguration

at java.net.URLClassLoader.findClass(URLClassLoader.java:588)

at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:743)

at java.lang.ClassLoader.loadClass(ClassLoader.java:711)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:313)

at java.lang.ClassLoader.loadClass(ClassLoader.java:690)

… 19 more

Inline.

Hi

Can you please check this druid cluster setup and let us know why we are facing some of these problems?

Requirement:

  1. Handle 100-200 million data points per minute

Should be fine.

  1. Currently we are testing with 1/4th of that full load and trying to stabilize the cluster

Problems:(Note: The graph’s timezone is IST and server is at MST)

  1. The indexing seems slow and the time taken for the full data to be indexed is really high. The data point at say time 19:48 takes a few minutes(15-20) to reach that value. As you can see the data points are high at the beginning and then drifts off to a smaller amount later on.

I don’t think anyone will be able to take the time to fully review your cluster setup and configs in the community support channels. You can try http://imply.io/ if you need dedicated help.

  1. 30 th minute of every hour in the graph is when peons for new hour are spawned.
  2. at the 45th minute of every hour in the graph is when old peons are shutdown after writing to HDFS. That is when you see a sudden spike in the counts as well.
    Is this a capacity issue or some configuration problem?

Unclear without more information.

Are you guys running behind a firewall by any chance? Have you set things up to be able to download extensions locally? Do you require a Druid distribution that has all extensions packaged?

Yes I forgot about the package download behind the firewall. I was able to download all the packages and get the latest version up and running. Thanks. Let me look at the configurations again and check if anything is wrong with the setup.