Batch ingestion performance

Hi folks,

I’m new to Druid, and am loading up 1.5m rows (around 1Gb) of data from a CSV file, with timestamps spanning a year or so.

It seems very slow to ingest the data - taking around 25 seconds to create a segment for each day.

Is this normal? I’m only using a simple setup at this point - 1 coordinator, 1 historical node, 1 indexing service node.

Is there a more optimal way to load the data? Maybe having separate CSV files per day? I notice that it looks like it parses the entire file twice per segment.

Thanks in advance,

Nick

Hey Nick,

If you’re using the local batch indexing (i.e. not a remote hadoop cluster) then you’ll get much better performance by splitting your data up into one task per segment. The hadoop indexing code does this automatically but the local indexing code does not.

Although 1.5m rows is pretty small, so you might be able to get away with using segmentGranularity = MONTH or YEAR. MONTH will probably work best as that will allow you to parallelize queries. A single YEAR segment will make all your queries single threaded.