Ingesting a 66MB .csv.gz local file with 35 columns and ~1.2 million rows takes ~3:35 minutes and I’d like to reach the maximum performance.
The setup I’m using is a physical 32 core, 128GB memory machine.
The Druid config files are the ones provided with the new 0.15.0 version.
The ingestion spec was generated using the unified console “Load Data” tool.
I’ve played around before with the different Druid configs but, back then, I had a small VM so couldn’t do much. Today, I found 0.15.0 out with different server size configuration files out of the box so I used those, instead (now that I have a new machine to play with). Started out with
./bin/start-single-server-medium and then tried
./bin/start-single-server-large but there was no difference in performance. Looking at the values in “conf/druid/single-server/*”, as far as I can tell, worker counts increased and, supposedly, the middleManager (and Historical seems relevant, too) has more resources to work with.
I’ve tried to play around with the different ingestion spec parameters. Such as “maxRowsInMemory” (set it over the rows in the file), “rollup”: “true”/“false” (removing metric calculations), “segmentGranularity” (set it to encompass the entire .csv. Only one timestamp, anyone). Tried to play around with the Partition parameter “Max rows per segment” with a value of 5 million (more than the entire .csv). Nothing helped at all.
Attached is the ingestion spec, task log and dstat output during ingestion (with cpu, mem and page cache enabled. A lot of idle CPU). Changed a lot of field names to generic ones so don’t mind them.
Can anyone advise what I can do to increase performance? Other parameters in the ingestion spec/workers I should modify? Split the .csv.gz to smaller files+increase middleManager task worker count? Load it through Kafka, instead?
one_csv_dstat_20190715164201.log (15.4 KB)
one_csv_index_20190715164201.log (228 KB)
one_csv_ingestion_v4.json (3.07 KB)