benji shine

View Original

handy shell tools for finding large files

Heard over and over in development shops everywhere: "We're out of disk space! Who is the spacehog?" "Not me! It must be your project!" "Let's delete temp files | log files | core dumps | stuff that looks old." I will spare you the lecture on the heartbreak of irreplaceable data loss, and instead I provide a few one-line shell goodies to identify where the disk space is going, with human-readable text reports sufficient for mailing to all your co-workers.
The classic command for analyzing disk usage is
$ du -k
which will print something like this
32 Documents/Standards/sac-1.3/doc/org/w3c/css/sac/helpers
568 Documents/Standards/sac-1.3/doc/org/w3c/css/sac
568 Documents/Standards/sac-1.3/doc/org/w3c/css
568 Documents/Standards/sac-1.3/doc/org/w3c
568 Documents/Standards/sac-1.3/doc/org
That lists the size in kilobytes, followed by the file name. Output like this quickly gets unreadable. We can apply some concepts of information visualization to improve this output. Let's put the most important stuff at the end, by adding a sort command:
$ du -k Documents | sort -n
The last few lines of this list the biggest directories and their size in kilobytes:
80180 Documents/Reference/docs/api/java
82924 Documents/Reference/docs/api/javax
110788 Documents/Speed Download 4
205708 Documents/Reference/docs/api
251800 Documents/Reference/docs
254668 Documents/Reference
434216 Documents
Comparing six-digit numbers at a glance requires brain work. To make it easier, get human-readable output from du, by replacing the -k flag with -h. Now a line of output looks like this:
4.1M Documents/Standards
That breaks our sort, though; sort -n is numeric, and 2M is less than 4K. Wrong. Let's just throw out any du output less than 1 mb. I do that by piping the output through sed. I also want to limit how deep we descend into directories, since directories sizes include the summarry of their children's sizes. On the mac, pass in a -d depth flag; on linux, use --maxdepth=depth.
$ du -h -d 3 . | sed -e /\n*[KB]/d | sort -n
Then to get just the highlights, pipe that through a tail command, to select just the last 30 or so big guys:
$ du -h -d 3 . | sort -n | sed -e /\n*[KB]/d | tail -30
But wait, this is kind of stupid; I'm asking sort to sort a whole lot of stuff, then promptly throwing out most of the sorted things. Let's switch the order of the sed and the sort, which will make the sort smaller and faster.
$ du -h -d 3 . | sed -e /\n*[KB]/d | sort -n | tail -30
Props to Unix Power Tools and Jeffrey Friedl's Mastering Regular Expressions. We're just mortals, here, folks, but we're living in a well-documented world.
On the mac, for an easier way to do this, try OmniDiskSweeper.