I try to ingest some data with a distant hadoop cluster. When i submit a task with one or two files it succeed, but when i try with many files it fails.
My error is due to memory:
Container [pid=19644,containerID=container_1464869502526_0007_01_001594] is running beyond physical memory limits. Current usage: 6.0 GB of 6 GB physical memory used; 7.8 GB of 30 GB virtual memory used. Killing container.
2016-06-03T09:17:41,571 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 97%
I think i need to persist data on disk using maxRowsInMemory or partitionsSpec could do the trick, but i really want to know how druid works with hadoop before.
My hadoop cluster has cores with 15Go memory (m3.xlarge), so
1- Why hadoop specify i’m out of memory after 6GB of data in ? We can say that this question is more like: How Druid and Hadoop work together ? (For me, Overlord send task to MiddleManager which creates peon which submit job to hadoop, then hadoop works mystically and return the dataset to the peon which persist data in S3 in my case)
2- Is it worth it to use instances with more memory for multiple file ingestion? Because my segment granularity is “day” and i have 1 file per day so each segment will be ~500MB and i don’t want many shards (500MB is a good size for segment)
3- Is there any optimization to tell Druid my files match my segment granularity to improve the ingestion time?