Faster indexing and better compression - How?

Hi all & Fangjin Yang,
I have 2 basic questions.

  1. I have a data of 30 GB (raw size) - currently it takes a lot of time to index this (details below)

  2. After indexing the size of that data in druid cluster in about 9.11 GB - so we got 3.1x compression. But earlier when I used Druid, I was getting 6x compression easily - What are the parameters which affects the compression? Is there a way to tune it to some desired levels? I feel 3.1x is probably very low compression ratio.

Currently I have built a cluster to hold 30 days worth of data - Each day is about 30 GB (raw size) - So total is about 900 GB worth of data. Batch indexing 1 day’s worth of data takes about 7 hours

Cluster configuration:

  • Broker : 8 Core, 61 GB - 1 nos

  • Historical/Middlemanager : 8 Core, 61 GB - 3 nos

  • Coord/Overlord/Metadata/Zookeeper : 8 Core, 30 GB - 1 nos (all in 1 box)

I have 8 worker tasks in each historical nodes and here is the middle manager configs:

[ec2-user@ip-172-31-22-182 druid-0.9.1.1]$ more conf-aws-edit/druid/middleManager/jvm.config

-server

-Xms64m

-Xmx64m

-XX:MaxDirectMemorySize=10240m

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=/mnt/druid/runtime/var/tmp

-Dhadoop.hadoop.tmp.dir=/mnt/druid/runtime/var/druid/hadoop-tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

[ec2-user@ip-172-31-22-182 druid-0.9.1.1]$ more conf-aws-edit/druid/middleManager/runtime.properties

druid.service=druid/middleManager

druid.port=8091

Number of tasks per middleManager

druid.worker.capacity=8

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xmx12g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

druid.indexer.task.baseTaskDir=/mnt/druid/runtime/var/druid/task

HTTP server threads

druid.server.http.numThreads=40

Processing threads and buffers

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=8

Hadoop indexing

druid.indexer.task.hadoopWorkingPath=/mnt/druid/runtime/var/druid/hadoop-tmp

druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.3.0”]

I have set numThreads to 8 and sizeBytes to 1GB - What is the difference between worker capacity and numthreads? I have set both to 8 and the box is a 8 core box. In my case, what should be ideal configuration so that I’m able to load the data into Druid with few minutes? Is it possible?

You were mentioning about a remote hadoop cluster doing the indexing - Can you pls. elaborate more on this how to do this and how to get it to Druid?

Your help is deeply appreciated!!

Thanks,

Uday.