More than one instance of coordinator

Hello,

So far I have used one machine. Each node works on exactly one machine. I use hadoop indexer for loading batch data.

I am going to run second instance of coordinator on other machine. Is it possible that it speed up process of loading batch data?

Hi Tom,

No it will not speed up loading, coordinator doesn’t do any loading work, it just assign segments.

In order to improve or speed loading you can think about either adding more historical nodes or tweak the parameter replicationThrottleLimit
The maximum number of segments that can be replicated at one time.

see http://druid.io/docs/latest/configuration/coordinator.html#dynamic-configuration.

More information about your use case/ performance might help to give a better answer.

The best way to speed up batch ingestion is to leverage a remote Hadoop cluster to build Druid segments. The param Slim mentions only controls how fast segments are replicated, not how fast data is made available. Is your bottleneck how fast segments are getting created or how fast they are getting loaded after they are created?

Hi Fangjin Yang,
Btw, I’m a big fan of you & druid!. I kind of wanted to use this thread to ask a related question - Currently I have built a cluster to hold 30 days worth of data - Each day is about 30 GB (raw size) - So total is about 900 GB worth of data. Batch indexing 1 day’s worth of data takes about 7 hours

Cluster configuration:

  • Broker : 8 Core, 61 GB - 1 nos

  • Historical/Middlemanager : 8 Core, 61 GB - 3 nos

  • Coord/Overlord/Metadata/Zookeeper : 8 Core, 30 GB - 1 nos (all in 1 box)

I have 8 worker tasks in each historical nodes and here is the middle manager configs:

[ec2-user@ip-172-31-22-182 druid-0.9.1.1]$ more conf-aws-edit/druid/middleManager/jvm.config

-server

-Xms64m

-Xmx64m

-XX:MaxDirectMemorySize=10240m

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=/mnt/druid/runtime/var/tmp

-Dhadoop.hadoop.tmp.dir=/mnt/druid/runtime/var/druid/hadoop-tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

[ec2-user@ip-172-31-22-182 druid-0.9.1.1]$ more conf-aws-edit/druid/middleManager/runtime.properties

druid.service=druid/middleManager

druid.port=8091

Number of tasks per middleManager

druid.worker.capacity=8

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xmx12g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

druid.indexer.task.baseTaskDir=/mnt/druid/runtime/var/druid/task

HTTP server threads

druid.server.http.numThreads=40

Processing threads and buffers

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=8

Hadoop indexing

druid.indexer.task.hadoopWorkingPath=/mnt/druid/runtime/var/druid/hadoop-tmp

druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.3.0”]

I have set numThreads to 8 and sizeBytes to 1GB - What is the difference between worker capacity and numthreads? I have set both to 8 and the box is a 8 core box. In my case, what should be ideal configuration so that I’m able to load the data into Druid with few minutes? Is it possible?

You were mentioning about a remote hadoop cluster doing the indexing - Can you pls. elaborate more on this how to do this and how to get it to Druid?

Your help is deeply appreciated!!

Thanks,

Uday.

Hi Uday,

Do these docs on batch ingestion help? They describe leveraging a remote cluster for faster batch ingest.

https://imply.io/docs/latest/ingestion-files