Large CSV of Historical Data taking forever to load into the cluster

We’ve been hard at work integrating Druid into a new application that we’re building. We have a great deal of historical data that we need to load into the cluster.

Format: Structured CSV

Ingest Type: index_hadoop

Source: S3

Approximate Size: ~3TB

We’re seeing these files taking a very long time (24 hours plus) to ingest into the cluster. The behavior that we’re seeing is that only a single MiddleManager taking the task.

These csv files have a large amount of columns (100+) containing a large amount of demographic data

Can you paste here your tuning/ioconig settings? Also, if you are using Hadoop Indexer, it’s recommended to have a remote Hadoop cluster to do the indexing. It is possible to ingest 10TB/hr using Hadoop Indexer.
If a Hadoop cluster is not an option, you can use the parallel native batch indexer. Jus make sure that you have enough MM to do the work.

Rommel Garcia

Here’s are the configuration settings

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “s3n://bucket-name/cvsfile.csv”

},

“appendToExisting”: true

},

“tuningConfig”: {

“type”: “hadoop”,

“maxParseExceptions”: 10000000,

“partitionsSpec”: {

“type”: “hashed”,

“targetPartitionSize”: 5000000

},

“hadoopDependencyCoordinates”: [

“org.apache.hadoop:hadoop-client:2.8.3”

]

Nothing glaring here.

What is the value of this property druid.worker.capacity in your MiddleManager runtime.properties? That essentially defines how many tasks can be spawned to do the ingest using Hadoop Indexer that comes with Druid.

Rommel Garcia
Director, Field Engineering

Thanks for replying Rommel! I’m a member of Eric’s team. druid.worker.capacity=9.

Tristian,

Can you paste here your runtime props for middlemanger and historical? You have 9 workers in MM but only 1 task is running for ingest. That doesn’t line up.

Rommel Garcia

Director, Field Engineering

I dont think multiple MM workers will pick up same file. If you have a list of say, 500 files, then each of your worker will run approved 55 times (in parallel) and share the work load.

Also, if it’s a single chunk of file, I am not sure how MM works internally but I’d imagine it will load the whole file on to memory and then segment it.