Tuning .csv.gz batch ingestion


Ingesting a 66MB .csv.gz local file with 35 columns and ~1.2 million rows takes ~3:35 minutes and I’d like to reach the maximum performance.

The setup I’m using is a physical 32 core, 128GB memory machine.

The Druid config files are the ones provided with the new 0.15.0 version.

The ingestion spec was generated using the unified console “Load Data” tool.

I’ve played around before with the different Druid configs but, back then, I had a small VM so couldn’t do much. Today, I found 0.15.0 out with different server size configuration files out of the box so I used those, instead (now that I have a new machine to play with). Started out with ./bin/start-single-server-medium and then tried ./bin/start-single-server-large but there was no difference in performance. Looking at the values in “conf/druid/single-server/*”, as far as I can tell, worker counts increased and, supposedly, the middleManager (and Historical seems relevant, too) has more resources to work with.

I’ve tried to play around with the different ingestion spec parameters. Such as “maxRowsInMemory” (set it over the rows in the file), “rollup”: “true”/“false” (removing metric calculations), “segmentGranularity” (set it to encompass the entire .csv. Only one timestamp, anyone). Tried to play around with the Partition parameter “Max rows per segment” with a value of 5 million (more than the entire .csv). Nothing helped at all.

Attached is the ingestion spec, task log and dstat output during ingestion (with cpu, mem and page cache enabled. A lot of idle CPU). Changed a lot of field names to generic ones so don’t mind them.

Can anyone advise what I can do to increase performance? Other parameters in the ingestion spec/workers I should modify? Split the .csv.gz to smaller files+increase middleManager task worker count? Load it through Kafka, instead?

one_csv_dstat_20190715164201.log (15.4 KB)

one_csv_index_20190715164201.log (228 KB)

one_csv_ingestion_v4.json (3.07 KB)

Is your goal to ingest this 66MB file as fast as possible or are you trying to optimize the MB/s ingest rate for a larger dataset?

I’m trying to optimize the MB/s ingest rate for a larger dataset, yes.

To ingest only 66Mb of data, a simple schema should be enough and no need any extra properties which define in your tuning config.
Use hadoop batch ingestion through map reduce and you will see the performance of ingestion. Add “jobproperties” in your tuning config. follow link below


I followed the Hadoop tutorial and tried fine tuning with Hadoop ingestion.
Namely, number of files, memory for mapreduce and “segmentGranularity”.

The last line in the log says

2019-07-18T07:25:37,604 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Running job: job_1563432531797_0002

so I assume no errors, and it still runs extremely slowly.

I tried going through the Kafka tutorial and fine tuning according to post[s] on this board and Kafka ingestion.

Namely, shorter “taskDuration”, “segmentGranularity”, “rollup” and “replicas”.

It ran much faster than local or Hadoop at, roughly, 30k rows/sec but we produce 100k rows/sec. The local/Hadoop rates were in the low thousands.

Attached the Hadoop ingestion spec.

Used the Kafka ingestion spec in the tutorial (other than the mentioned fields).

Are there other things I might’ve missed or is my dataset simply unfit for Druid? An assumption I made based on this doc.

one_csv_index_hadoop_v2.json (5.01 KB)