Hadoop Indexing Service, Memory issue

Hi,

I try to ingest some data with a distant hadoop cluster. When i submit a task with one or two files it succeed, but when i try with many files it fails.

My error is due to memory:

hadoop:

Container [pid=19644,containerID=container_1464869502526_0007_01_001594] is running beyond physical memory limits. Current usage: 6.0 GB of 6 GB physical memory used; 7.8 GB of 30 GB virtual memory used. Killing container.

Overlord logs:

2016-06-03T09:17:41,571 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 97%

I think i need to persist data on disk using maxRowsInMemory or partitionsSpec could do the trick, but i really want to know how druid works with hadoop before.

My hadoop cluster has cores with 15Go memory (m3.xlarge), so

1- Why hadoop specify i’m out of memory after 6GB of data in ? We can say that this question is more like: How Druid and Hadoop work together ? (For me, Overlord send task to MiddleManager which creates peon which submit job to hadoop, then hadoop works mystically and return the dataset to the peon which persist data in S3 in my case)

2- Is it worth it to use instances with more memory for multiple file ingestion? Because my segment granularity is “day” and i have 1 file per day so each segment will be ~500MB and i don’t want many shards (500MB is a good size for segment)

3- Is there any optimization to tell Druid my files match my segment granularity to improve the ingestion time?

Thanks,

Ben

Hi Ben, do you have more info about where it is running out of memory? Some more comments inline.

Hi,

I try to ingest some data with a distant hadoop cluster. When i submit a task with one or two files it succeed, but when i try with many files it fails.

My error is due to memory:

hadoop:

Container [pid=19644,containerID=container_1464869502526_0007_01_001594] is running beyond physical memory limits. Current usage: 6.0 GB of 6 GB physical memory used; 7.8 GB of 30 GB virtual memory used. Killing container.

Overlord logs:

2016-06-03T09:17:41,571 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 97%

I think i need to persist data on disk using maxRowsInMemory or partitionsSpec could do the trick, but i really want to know how druid works with hadoop before.

My hadoop cluster has cores with 15Go memory (m3.xlarge), so

1- Why hadoop specify i’m out of memory after 6GB of data in ? We can say that this question is more like: How Druid and Hadoop work together ? (For me, Overlord send task to MiddleManager which creates peon which submit job to hadoop, then hadoop works mystically and return the dataset to the peon which persist data in S3 in my case)

The indexing service acts as a driver and sends the job to a remote Hadoop cluster to create the segment.

2- Is it worth it to use instances with more memory for multiple file ingestion? Because my segment granularity is “day” and i have 1 file per day so each segment will be ~500MB and i don’t want many shards (500MB is a good size for segment)

Can you share the details of your Hadoop cluster?

3- Is there any optimization to tell Druid my files match my segment granularity to improve the ingestion time?

I’m not exactly sure what you mean by this.

Hi tanks for your answer Fangjin ! I resolved my issue, it was simply a bad configuration in my emr confirguration file. “mapreduce.reduce.memory.mb” was set to 6GB, so i wasn’t using all my memory !
See inline.

Hi Ben, do you have more info about where it is running out of memory? Some more comments inline.

Hi,

I try to ingest some data with a distant hadoop cluster. When i submit a task with one or two files it succeed, but when i try with many files it fails.

My error is due to memory:

hadoop:

Container [pid=19644,containerID=container_1464869502526_0007_01_001594] is running beyond physical memory limits. Current usage: 6.0 GB of 6 GB physical memory used; 7.8 GB of 30 GB virtual memory used. Killing container.

Overlord logs:

2016-06-03T09:17:41,571 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 97%

I think i need to persist data on disk using maxRowsInMemory or partitionsSpec could do the trick, but i really want to know how druid works with hadoop before.

My hadoop cluster has cores with 15Go memory (m3.xlarge), so

1- Why hadoop specify i’m out of memory after 6GB of data in ? We can say that this question is more like: How Druid and Hadoop work together ? (For me, Overlord send task to MiddleManager which creates peon which submit job to hadoop, then hadoop works mystically and return the dataset to the peon which persist data in S3 in my case)

The indexing service acts as a driver and sends the job to a remote Hadoop cluster to create the segment.

2- Is it worth it to use instances with more memory for multiple file ingestion? Because my segment granularity is “day” and i have 1 file per day so each segment will be ~500MB and i don’t want many shards (500MB is a good size for segment)

Can you share the details of your Hadoop cluster?

3- Is there any optimization to tell Druid my files match my segment granularity to improve the ingestion time?

I’m not exactly sure what you mean by this.

Here i just want to know if the fact that my files are per day files and my granularity is day, is there a way to tell druid something like “if this file have a timestamp which go in this segment, then all rows will go in this segment too”. Kind of optimization ?

Thanks,

Ben

Again, thanks for your awesome work, i’m a fan ! :slight_smile: