Taking out the trash in HDInsight

One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.

However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you :)

HDFS Commands to help free up space

So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:

hadoop fs -rmr -skipTrash /user/hadoop/data

This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.

The next thing you can do is reach for the expunge command, which forces an empty of the Trash:

hadoop fs -expunge

However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.

HDFS Commands to find what is using disk space

Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)

hadoop fs -dus /

Which will then give you the size of that data on your datanodes. Then dig deeper with ls:

hadoop fs -ls /

Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.

The full file system shell commands are listed here

One thought on “Taking out the trash in HDInsight

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>