Peon configuration setting for data load

Hi All,

I have question on data ingestion . we are ingesting the 1 gzip file which has 20 million records . its taking 8 minutes with default memory 1g for middle manger
I have increased the default to up to 8 gig but still showing same time . is there anyway we can improve the performance of ingestion ?

I tried changing the runtime properties under middle manager, but doesn’t look any improvement in terms for data ingestion

Current I am running on micro-quickstart, is there anyways I am improve the data ingestion ?

Number of tasks per middleManager

druid.worker.capacity=2

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms2g -Xmx2g -XX:MaxDirectMemorySize=2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

druid.indexer.task.baseTaskDir=var/druid/task

HTTP server threads

druid.server.http.numThreads=12

Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.numMergeBuffers=2

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000

druid.indexer.fork.property.druid.processing.numThreads=1

What type of ingestion task are you running? Can you paste your ingestion spec?

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

{

“type”: “index_parallel”,

“spec”: {

“dataSchema”: {

“dataSource”: “sampledatasource”,

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: “NONE”,

“rollup”: false

},

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “csv”,

“timestampSpec”: {

“column”: “slice_endtime”,

“format”: “YYYY-MM-dd HH:mm:ss”

},

“hasHeaderRow”: false,

“columns”: [

“division”,

“region”,

“market”,

“site”,

“group”,

“location”,

“application”,

“slice_endtime”,

“slice_endtime_utc”,

“Measure1”,

“asset”

],

“skipHeaderRows”: 0,

“listDelimiter”: “,”,

“dimensionsSpec”: {

“dimensions”: [

{

“type”: “string”,

“name”: “Peergroup”,

“createBitmapIndex”: true

},

“applicationName”,

“assetclass”,

“division”,

“market”,

“region”,

{

“type”: “long”,

“name”: “Measure1”

},

{

“type”: “string”,

“name”: “Location”,

“createBitmapIndex”: true

},

“site”,

“slice_endtime_utc”

]

}

}

}

},

“ioConfig”: {

“type”: “index_parallel”,

“firehose”: {

“type”: “local”,

“filter”: “exportdata_20190727_22.gz”,

“baseDir”: “”

},

“appendToExisting”: true

},

“tuningConfig”: {

“type”: “index_parallel”

}

}

}{

“type”: “index_parallel”,

“spec”: {

“dataSchema”: {

“dataSource”: “sampledatasource”,

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: “NONE”,

“rollup”: false

},

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “csv”,

“timestampSpec”: {

“column”: “slice_endtime”,

“format”: “YYYY-MM-dd HH:mm:ss”

},

“hasHeaderRow”: false,

“columns”: [

“division”,

“region”,

“market”,

“site”,

“group”,

“location”,

“application”,

“slice_endtime”,

“slice_endtime_utc”,

“Measure1”,

“asset”

],

“skipHeaderRows”: 0,

“listDelimiter”: “,”,

“dimensionsSpec”: {

“dimensions”: [

{

“type”: “string”,

“name”: “Peergroup”,

“createBitmapIndex”: true

},

“applicationName”,

“assetclass”,

“division”,

“market”,

“region”,

{

“type”: “long”,

“name”: “Measure1”

},

{

“type”: “string”,

“name”: “Location”,

“createBitmapIndex”: true

},

“site”,

“slice_endtime_utc”

]

}

}

}

},

“ioConfig”: {

“type”: “index_parallel”,

“firehose”: {

“type”: “local”,

“filter”: “exportdata_20190727_22.gz”,

“baseDir”: “”

},

“appendToExisting”: true

},

“tuningConfig”: {

“type”: “index_parallel”

}

}

}

What is the system spec on the middle manager node? Do you have the middle manager and historical processes running on the same server?

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Yes , Currently everything running on same machine

What are your system specs - CPU/mem?

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 4

On-line CPU(s) list: 0-3

Thread(s) per core: 1

Core(s) per socket: 1

Socket(s): 4

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 45

Model name: Intel® Xeon® Gold 6152 CPU @ 2.10GHz

Stepping: 2

CPU MHz: 2100.000

BogoMIPS: 4200.00

Hypervisor vendor: VMware

Virtualization type: full

L1d cache: 32K

L1i cache: 32K

L2 cache: 1024K

L3 cache: 30976K

NUMA node0 CPU(s): 0-3

MemTotal: 32780724 kB

MemFree: 3661628 kB

MemAvailable: 27520440 kB

This system seems a little underpowered with 4 CPUs. If you have a system with more CPUs (16) you could increase your worker capacity, which will help.

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Can you try with the below settings?

Number of tasks per middleManager

druid.worker.capacity=3

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms1g -Xmx1g -XX:MaxDirectMemorySize=4g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

druid.indexer.task.baseTaskDir=var/druid/task

HTTP server threads

druid.server.http.numThreads=12

Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.numMergeBuffers=2

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1073741824

druid.indexer.fork.property.druid.processing.numThreads=1

I tried to use below settings , we dont see improvement in data load time