Hadoop disk usage (mapred/local cache cleanup) issue

Hi,

I have a single node pseudo-cluster setup using Imply.io (1.2.1, installed manually), Hadoop (2.7.1, installed using the Apache BigTop repo) and MySQL for metadata on CentOS 7.2. I noticed that the disk usage seems to be going up faster than I might expect.

Investigating I found a large amount of disk usage under /var/lib/hadoop-hdfs/cache/imply/mapred/local in the form of many directories which appear to be named as a timestamp (eg. ‘1463738866867’, ‘1463738866870’)

“ls -1a | wc -l” returns 1,499,549

Looking into a number of these directories, they all seem to contains a single file:

[root@server 1463738866890]# ls -al

total 55800

drwxr-xr-x 2 imply imply 55 May 20 11:07 .

drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 …

-rw-r–r-- 1 imply imply 8976 May 20 11:07 .tmp_aws-java-sdk-dynamodb-1.10.21.jar.crc

[root@server 1463738866879]# ls -al

total 55796

drwxr-xr-x 2 imply imply 60 May 20 11:07 .

drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 …

-rw-r–r-- 1 imply imply 4160 May 20 11:07 .tmp_aws-java-sdk-swf-libraries-1.10.21.jar.crc

[root@server 1463738866893]# ls -al

total 55792

drwxr-xr-x 2 imply imply 50 May 20 11:07 .

drwxr-xr-x 1499549 imply imply 38014976 Jun 23 10:20 …

-rw-r–r-- 1 imply imply 1956 May 20 11:07 .tmp_aws-java-sdk-sqs-1.10.21.jar.crc

It looks to me like something isn’t getting cleaned up correctly. Is this a known issue? Where should I look next to work out what is/isn’t happening and why? Are there any settings I should look at to improve the setup so that this doesn’t happen?

Thanks in advance

AllenJB

Hey Allen,

Do you mean /var/lib/hadoop-hdfs/cache/imply/mapred/local on hdfs or on your actual filesystem? What is your hadoop.tmp.dir set to?

Hi,

The specified path is what I see on the local filesystem, not on hdfs.

hadoop.tmp.dir does not appear to be set either in the imply configuration or the hadoop configuration (under /etc/hadoop), so I assume it should be using the default value.

AllenJB

I have the same issue ( druid-0``.8.3 ), and it seems to be related to local Hadoop ingestion not cleaning up after itself.

I also have many folders in this local Hadooop directory.

/data/hadoop-tmp/username/mapred/local/
ls -1a | wc -l
1293570

``

A couple of quotes about local Hadoop ingestion:

https://groups.google.com/d/msg/druid-user/kvvQtb4F1Lw/jFc-ndAJBAAJ

Fangjin Yang
Auto Generated Inline Image 2.png

26 Jun

Answers duplicated with https://groups.google.com/forum/#!topic/druid-user/SFYlum_wu38.

Do not use local hadoop ingestion for anything beyond a small POC data set and expect good performance.

https://groups.google.com/d/msg/druid-user/SFYlum_wu38/9TsEW8YJBAAJ

Fangjin Yang
Auto Generated Inline Image 2.png

26 Jun

The
local hadoop task is only meant for quickstarts and PoCs, it is not designed to be performant at all. For ingestion of large batch static files, we recommend using a remote Hadoop cluster or if you have your data in Kafka and are using Kafka 0.9.1, you can stream your data via the new Kafka indexing task.

Be aware that if you wish to use a remote Hadoop Cluster, it may require a custom Druid distribution.

The following commands might be helpful

find /data/hadoop-tmp/username/mapred/local -mindepth 1 -maxdepth 1 -type d -mtime +3 -exec echo {} ;
find /data/hadoop-tmp/username/mapred/local -mindepth 1 -maxdepth 1 -type d -mtime +3 -exec rm -rf {} ;

``