Out of memory on CSV import

Hi

I’m trying to import a CSV file with 21 million rows. This gave me a out-of-memory error, so now i’m trying with just 1000 rows in the CSV file.

I still get this error though: https://gist.github.com/Danielss89/23764fb7783e919432d7966951f857a4

This is the setup config: https://gist.github.com/Danielss89/e18f1953af2d69380f6078631f362a1d

The server has 24gb memory and 4 cores.

Did i configure something wrong?

Thanks,

Daniel

You have two nobs to tune at least.

First is the hadoop job property where you can set the container java opts.

here is an exemple

“jobProperties” : {

“mapreduce.map.memory.mb” : 2048,

“mapreduce.map.java.opts” : “-server -Xmx1536m -Duser.timezone=UTC -Dfile.encoding=UTF-8”,

“mapreduce.reduce.memory.mb” : 2048,

“mapreduce.reduce.java.opts” : "-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8”

}

The Second is the target partition size for segments

Ok, thank you.
I suppose you mean the “targetPartitionSize” in “tuningConfig”->“partitionSpec”.

Should i increase or decrease it?

yes here is an example.

“tuningConfig” : {

“type”: “hadoop”,

“workingPath”: “/tmp/slim”,

“leaveIntermediate”:true,

“partitionsSpec”: {

“type”: “hashed”,

“targetPartitionSize”: 5000000,

“maxPartitionSize”: 75000000,

“assumeGrouped”: false,

“numShards”: -1,

“partitionDimensions”:

}

Unfortunately i’m stilling getting the same error(i think): https://gist.github.com/Danielss89/337d82edd8f2a0ac734147f846d20238
Any ideas?

Would you share your jobProperties? It would be great if you can also share the entire indexTask json.

Another factor can affect to the memory usage is maxRowsInMemory (http://druid.io/docs/latest/ingestion/batch-ingestion.html).

Would you try to decrease maxRowsInMemory in tuningConfig?

Jihoon

2017년 5월 19일 (금) 오전 4:24, Daniel Strøm danielss89@gmail.com님이 작성:

The task looks like this: https://gist.github.com/Danielss89/5d65e09445d145851ef9612245d1f029
I’m now trying to set maxRowsInMemory to 500, i just don’t understand why this happens, i mean, 1000 rows is not a lot, is it?

Still running out of memory :frowning:

Ah, the max heap size in ‘mapreduce.reduce.java.opts’ should be smaller than the Yarn container memory limit (‘mapreduce.reduce.memory.mb’) to reserve some memory for java code (http://stackoverflow.com/questions/24070557/what-is-the-relation-between-mapreduce-map-memory-mb-and-mapred-map-child-jav).

Would you try again after tuning those params?

2017년 5월 19일 (금) 오후 3:52, Daniel Strøm danielss89@gmail.com님이 작성:

Still getting out of memory error: https://gist.github.com/Danielss89/c013b67e1147c0ea5236e7b7df02f494
Job properties looks like this:

“jobProperties” : {

“mapreduce.map.memory.mb” : 5048,

“mapreduce.map.java.opts” : “-server -Xmx3536m -Duser.timezone=UTC -Dfile.encoding=UTF-8”,

“mapreduce.reduce.memory.mb” : 2048,

“mapreduce.reduce.java.opts” : “-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8”

}

The reducer configurations seem to be not changed. The container memory limit (2048) should be larger than java heap size (2560).

Same error :frowning: https://gist.github.com/Danielss89/e07a3abada1e71a20485df11098f00bd

"jobProperties" : {
        "mapreduce.map.memory.mb" : "3500",
        "mapreduce.map.java.opts" : "-server -Xmx5536m -Duser.timezone=UTC -Dfile.encoding=UTF-8",
        "mapreduce.reduce.memory.mb" : "5048",
        "mapreduce.reduce.java.opts" : "-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8"
      },

Going off a limb, but

“segmentGranularity” : “MINUTE”

Is that what you really need ?

Yes, i need to plot minute data on a map.

What i don’t understand is that this is only a 1000 line file, shouldn’t it be more than easy for druid to process this?

Yes, definitely. It’s weird.
I’m also looking at this.

i got the same error,while uploading 8 million rows
i changed the format csv to json-preety format and solved

I’m trying to load CSV data which is having only 20 records but i end up with following error.

java.sql.SQLTransactionRollbackException: An SQL data change is not permitted for a read-only connection, user or database. [statement:"INSERT INTO druid_tasks (id, created_date, datasource, payload, active, status_payload) VALUES (:id, :created_date, :datasource, :payload, :active, :status_payload)"

``

Could any one guide me in right direction to resolve above error.

Thanks

are you trying to insert data via SQL insert into statement ?

it is not supported by the native SQL druid layer, you need to send an index task to the overlord.

Thanks for responding.

Following command is used to load CSV data.

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @test2.json localhost:8090/druid/indexer/v1/task

Basically, Im new to Druid and trying to understand the basics. So,started with CSV. my intention is to load CSV data as of now and then to do some analysis. This csv is exported from my DWH dimension table not from transactional.

Here is my .json file

{
“type” : “index_hadoop”,
“spec” : {
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : “/home/helical/Documents/dim_channel.csv”
}
},
“dataSchema” : {
“dataSource” : “dimchannel”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “minute”,
“queryGranularity” : “none”,
“intervals” : [“2017-01-01/2017-06-01”]
},
“parser” : {
“type” : “hadoopyString”,
“parseSpec”:{
“format” : “csv”,
“timestampSpec” : [{“column” : “audit_insert”,“format”: “yyyy-M-d H:m:s”},{“column” : “audit_update”,“format”: “yyyy-M-d H:m:s”}]
“columns” : [“dim_channel_id”,“channelname”,“channelcode”,“subchannelname”,“subchannelcode”,“active_flag”,“audit_flag”,“audit_insert”,“version”,“audit_update”],
“dimensionsSpec” : {
“dimensions” : [
“channelname”,
“subchannelname”
]
}
}
},
“metricsSpec” : [
{
“name” : “count”,
“type” : “count”
},
{
“name” : “speed”,
“type” : “longSum”,
“fieldName” : “dim_channel_id”
}
]
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec” : {
“type” : “hashed”,
“targetPartitionSize”: 5000000,
“maxPartitionSize”: 75000000,
“assumeGrouped”: false,
“numShards”: -1,
“partitionDimensions”:
},
“jobProperties” : { }
}
}
}

Kindly point me in the right direction.

Thanks