Druid indexing csv is extremely slow

While I was trying to index a csv info druid for just 2 lines, it is extremely slow.

The csv consists of 2 lines:

1,0,90.95,“P”,385,“MA”,2018-05-08

``

1,0,91,6000,“MN”,2018-05-08

``

The indexing json config file:

``

{ “type” : “index_hadoop”, “spec” : { “dataSchema” : { “dataSource” : “hkexsales”, “parser” : { “type” : “string”, “parseSpec” : { “format” : “csv”, “timestampSpec” : { “column” : “data_date”, “format” : “iso” }, “columns” : [“stockcode”,“seq”,“price”,“flag”,“quantity”,“session”,“data_date”], “dimensionsSpec” : { “dimensions”: [“stockcode”,“seq”,“price”,“flag”,“quantity”,“session”], “dimensionExclusions” : , “spatialDimensions” : } } }, “metricsSpec” : [ { “type” : “count”, “name” : “count” } ], “granularitySpec” : { “type” : “uniform”, “segmentGranularity” : “DAY”, “queryGranularity” : “NONE”, “intervals” : [ “2013-08-31/2020-09-01” ] } }, “ioConfig” : { “type” : “hadoop”, “inputSpec” : { “type” : “static”, “paths” : “/data/HkexDayQuot/output/Sales/d180508e_1525922483177_skipheader2.htm” } }, “tuningConfig” : { “type” : “hadoop” } } }

``

Duration Spent is 8906764, it is unacceptable.

In case, I index the entire csv file with more than 400,000 rows, it gave me a wrong timestamp format error.

Please advise

“granularitySpec” : { “type” : “uniform”, “segmentGranularity” : “DAY”, “queryGranularity” : “NONE”, “intervals” : [ “2013-08-31/2020-09-01” ]

Given the size of your interval, I would suggest using a larger segmentGranularity, like YEAR or MONTH. (Or shrinking the interval to cover the data you are ingesting more tightly).

Also, have you tried the local indexing task instead of the hadoop indexing task? http://druid.io/docs/latest/ingestion/tasks.html

I would give that a try as well, and see how long it takes.

In case, I index the entire csv file with more than 400,000 rows, it gave me a wrong timestamp format error.

What is the specific error you’re getting? I would check the row that’s causing that error and fix the timestamp there.

If you’re using the hadoop task, you can set ignoreInvalidRows to ignore the row with the bad timestamp.

If you’re using the local indexing task, a similar property is “reportParseExceptions”.