Batch ingestion performance (local, granuled, pre-aggregated)

Hello,

Seeking some general guidance on batch ingestion into druid. My scenario is:

  • 6GB of compressed (gz) data, 400M rows, TSV format
  • data is structured on day granularity, i.e. folders are like /y=2016/m=4/d=1
  • there are 128 folders total (~4 months), folder contains a single file ~60MB gzip
  • data is pre-aggregated, i.e. each file, each row contains unique dimensions combination with metric sums
  • each row contains 16 dimensions, 7 metrics
    Here is my load script:

{

“type” : “index_hadoop”,

“spec” : {

"dataSchema" : {
  "dataSource" : "largereport",
  "parser" : {
    "type" : "hadoopyString",
    "parseSpec" : {
      "format" : "tsv",
      "columns": ["yyyymmdd", "id_clientsites", "id_campaigns", ... MORE COLUMNS ],
      "timestampSpec" : {
        "column" : "yyyymmdd",
        "format" : "yyyyMMdd"
      },
      "dimensionsSpec" : {
        "dimensions": ["yyyymmdd", "id_clientsites", "id_campaigns", ... 13 MORE DIMS ],
        "dimensionExclusions" : [],
        "spatialDimensions" : []
      }
    }
  },
  "metricsSpec" : [ 
    { "type" : "count", "name" : "count" },
    { "type" : "longSum", "name" : "impressions", "fieldName" : "impressions" },
    { "type" : "longSum", "name" : "clicks", "fieldName" : "clicks" }

… 6 MORE METRICS

  ],
  "granularitySpec" : {
    "type" : "uniform",
    "segmentGranularity" : "DAY",
    "queryGranularity" : "DAY",
    "intervals" : [ "2016-01-01/2016-04-30" ]
  }
},
"ioConfig" : {
  "type" : "hadoop",
  "inputSpec" : {
    "type" : "granularity",
    "dataGranularity" : "day",
    "inputPath" : "/home/user/input/",
    "pathFormat" : "'y'=yyyy/'m'=M/'d'=d",
    "filePattern" : ".*"
  }
},
"tuningConfig" : {
  "type" : "hadoop",
  "maxRowsInMemory" : 5000000,
  "overwriteFiles" : true,
  "useCombiner" : true
}

}

}

I have a single machine to test on, 40 cores, 256gb RAM. I’ve following quickstart instructions in general, but I’ve tried to give more resources to middleManager, but maybe I’ve missed something.

The jvm.config for middleManager:

-server

-Xms64m

-Xmx64m

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

The runtime.properties for middleManager:

druid.service=druid/middleManager

druid.port=8091

Number of tasks per middleManager

druid.worker.capacity=10

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xmx10g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

druid.indexer.task.baseTaskDir=var/druid/task

HTTP server threads

druid.server.http.numThreads=25

Processing threads and buffers

druid.processing.buffer.sizeBytes=256000000

druid.processing.numThreads=20

Hadoop indexing

druid.indexer.task.hadoopWorkingPath=var/druid/hadoop-tmp

druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.3.0”]

I’ve attempted to play a bit with assumeGroupe, with no significant boost.

“partitionsSpec”: {

“type”: “dimension”,

“partitionDimension”: “id_clientsites”,

“targetPartitionSize”: 2500000,

“assumeGrouped”: “true”

}

What should I look in the logs for hints? What optimal speed can I expect?

thanks, Ramunas

Hi Ramunas, providing the task log will provide info about why the ingestion has failed.