Batch loading data from CSV file

Hi,

I am struggling to craft a correct spec file for loading CSV data. Quite simply I want to load data that is in the following format in a csv file.

<app_id,time,name,version,num_starts,num_crashes>

Is this the correct format of the spec file?

{

“type”: “index”,

“spec”: {

“dataSchema”: {

“dataSource”: “AppInfo”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “csv”,

“timestampSpec”: {

“column”: “time”

},

“dimensionsSpec”: {

“dimensions”: [

“app_id”,

“time”,

“name”,

“version”,

“num_starts”,

“num_crashes”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “doubleSum”,

“name”: “num_starts”,

“fieldName”: “num_starts”

},

{

“type”: “doubleSum”,

“name”: “num_crashes”,

“fieldName”: “num_crashes”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “NONE”,

“intervals”: [“2010-01-01/2015-03-15”]

}

},

“ioConfig”: {

“type”: “index”,

“firehose”: {

“type”: “local”,

“baseDir”: “dataFolder/”,

“filter”: “appdata.csv”,

“parser” : {

“timestampSpec” : {

“column” : “time”

},

“data” : {

“format” : “csv”,

“columns” : [“app_id”,“time”,“name”,“version”,“num_starts”,“num_crashes”],

“dimensions” : [“app_id”,“time”,“name”,“version”,“num_starts”,“num_crashes”]

}

}

}

},

“tuningConfig”: {

“type”: “index”,

“targetPartitionSize”: 0,

“rowFlushBoundary”: 0

}

}

}

Thanks,

Ranjit

appinfo.json (1.75 KB)

When I submit the indexing task, I get the following error

Error 500

HTTP ERROR: 500

Problem accessing /druid/indexer/v1/task. Reason:

    javax.servlet.ServletException: com.fasterxml.jackson.databind.JsonMappingException: Instantiation of [simple type, class io.druid.data.input.impl.CSVParseSpec] value failed: columns

Powered by Jetty://

Hi Ranjit:
Here is an example index task:

http://druid.io/docs/latest/Tasks.html

Please read this to understand how to construct a data schema:

http://druid.io/docs/latest/Ingestion.html

You are missing a “columns” field.

Hi Fangjin,

Thanks for your response. I am running into an error after adding ‘columns’, perhaps I am off by something tiny here, and thought it would help for another pair of eyes to look at it. If you can be so kind to take a look at my spec, I’d highly appreciate it.

Spec

I got past that error after I removed the “data” section from “firehose”.

Now, I am running into a timestamp issue. Does the timestamp have to be the first column in the data? My data looks like this

“app_a”,“2013-08-31T01:02:33Z”,“A”,1,2,1

“app_b”,“2013-09-31T01:02:33Z”,“B”,1,2,1

“app_c”,“2013-07-31T01:02:33Z”,“C”,1,2,1

“app_d”,“2013-06-31T01:02:33Z”,“D”,1,2,1

But the Indexer seems to have assumes the full line as the app_id.

Null timestamp in input: {app_id=app_a,2013-08-31T01:02:33Z,A,1,2,1}

Caused by: java.lang.NullPointerException: Null timestamp in input: {app_id=app_a,2013-08-31T01:02:33Z,A,1,2,1}
at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:46) ~[druid-api-0.3.4.jar:0.3.4]
… 11 more
2015-03-18T18:03:41,860 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_AppInfo_2015-03-18T18:03:32.070Z”,
“status” : “FAILED”,
“duration” : 62
}

Thanks,

Ranjit

I further reduced it down but still get date parse errors.

Caused by: java.lang.IllegalArgumentException: Invalid format: "2015-08-31T01:02:33Z,"app_a"" is malformed at ","app_a""
	at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) ~[joda-time-2.6.jar:2.6]

CSV
"2015-08-31T01:02:33Z,"app_a" is missing a "

Although the error seems to say I’m missing quotes, I don’t have them right now around the date in the CSV file. Do I need to have double quotes around the date?

CSV

Can you include a line of your csv data?

Sure,

CSV File (Just one line)

instead of “type”:“csv”, can you use “format”:“csv”?

That fixed it! Thanks so much! My sincere apologies for this newbie mistake. I got confused between ‘type’ and ‘format’

I think I got confused because the ‘CSVParser’ section in this doc http://druid.io/docs/latest/Ingestion.html mentions ‘type’. Not sure if I interpreted this incorrectly or if the doc needs to be updated…

Hi Ranjit, this is a bug in the documentation. Great catch. Fixing now.

Hey Fangjin,

I am loading a ~700MB CSV and the Indexer has been loading for a good 5 hours now but hasn’t completed. I see the logs getting updated, but not sure if the loading is complete? What should I look for to know how close it is to finishing?

Thanks,

Ranjit

Hi Ranjit, that is very slow, even for the index task. How do you have your peons set up?

I only ran the Indexer. The Middle Manager and Peons were not running during this time. Do they need to be explicitly started?

Hi Ranjit,
you seem to be running with an interval of around 5 years with daily granularity, that will generate around 1800 segments for 700M of data.

IndexTask is not greatly optimized for generating lots of sall segments,

We usually suggest segment sizes around 500M.

Can you try changing the segment granularity to a YEAR or MONTH ?

Also specifying the exact interval for which the data is being ingested helps.