Faster indexing and better compression - How?

Hi all & Fangjin Yang,
I have 2 basic questions.

  1. I have a data of 30 GB (raw size) - currently it takes a lot of time to index this (details below)

  2. After indexing the size of that data in druid cluster in about 9.11 GB - so we got 3.1x compression. But earlier when I used Druid, I was getting 6x compression easily - What are the parameters which affects the compression? Is there a way to tune it to some desired levels? I feel 3.1x is probably very low compression ratio.

Currently I have built a cluster to hold 30 days worth of data - Each day is about 30 GB (raw size) - So total is about 900 GB worth of data. Batch indexing 1 day’s worth of data takes about 7 hours

Cluster configuration:

  • Broker : 8 Core, 61 GB - 1 nos

  • Historical/Middlemanager : 8 Core, 61 GB - 3 nos

  • Coord/Overlord/Metadata/Zookeeper : 8 Core, 30 GB - 1 nos (all in 1 box)

I have 8 worker tasks in each historical nodes and here is the middle manager configs:

[ec2-user@ip-172-31-22-182 druid-]$ more conf-aws-edit/druid/middleManager/jvm.config









[ec2-user@ip-172-31-22-182 druid-]$ more conf-aws-edit/druid/middleManager/



Number of tasks per middleManager


Task launch parameters

druid.indexer.runner.javaOpts=-server -Xmx12g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager


HTTP server threads


Processing threads and buffers



Hadoop indexing



I have set numThreads to 8 and sizeBytes to 1GB - What is the difference between worker capacity and numthreads? I have set both to 8 and the box is a 8 core box. In my case, what should be ideal configuration so that I’m able to load the data into Druid with few minutes? Is it possible?

You were mentioning about a remote hadoop cluster doing the indexing - Can you pls. elaborate more on this how to do this and how to get it to Druid?

Your help is deeply appreciated!!