One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.
However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you
HDFS Commands to help free up space
So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:
hadoop fs -rmr -skipTrash /user/hadoop/data
This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.
The next thing you can do is reach for the expunge command, which forces an empty of the Trash:
hadoop fs -expunge
However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.
HDFS Commands to find what is using disk space
Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)
hadoop fs -dus /
Which will then give you the size of that data on your datanodes. Then dig deeper with ls:
hadoop fs -ls /
Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.
The full file system shell commands are listed here