Thursday, April 30, 2015

Mac OS / Linux : Finding all the files that meet some criteria and returning an escaped list of the paths, sorted by length

This is another post "for the next poor bastard"

One of the side effects of running AnimEigo is that I end up with a lot of video on hard disc -- terabytes and terabytes of it. This is amusing when you consider my first hard drive cost $5000, stored 20MB, and shook the table when the disc heads moved.

When projects are finished, we archive all the video and project files just in case we need them later, and rather than just copying the files onto an archive disc (easy and fast), since much of the original source materials are uncompressed audio and video, compressing these files before archiving means we can save a few bucks and free up some hard drives for reuse and replicate the files on multiple drives as insurance.

The Mac finder has a built-in feature that lets you compress files and folders, but it occurred to me recently to check to see if there were better options.

After doing a little research, I settled on using pbzip2, the multicore implementation of bzip2, which seems to do a good job of compressing uncompressed video files -- often down to 20-25% of the original size. If you're using a Mac, the easiest way to install it is by using the fink package manager.

As pbzip2 is a command-line tool, you invoke it using the Terminal app, by typing something like this:

pbzip2 -v "path to the first file you want to compress" "path to the next file" ...

and pbzip happily goes off and (slowly) compresses the file for you. All fine and good, and you can just type the pbzip2 -v part and then drag files in from a folder window to enter the paths.

However, because I'm lazy and thus willing to spend many hours automating things to save myself a few seconds of drudgery, I started playing around in the default Bash shell that Terminal provides; it had been a while since I'd done more than trivial things in it and a refresher couldn't hurt.

The basic philosophy of Unix shell tools is "lots of little tools that do a small number of things well that you can hook up to do something complicated". You send the output of one tool into the next tool using a pipe, represented by the | character. Here's the command sequence I came up with:

find . \( -iname '*.aiff' -or -iname '*.aif' -or -iname '*.wav' -or \( -size +500000 -iname '*.mov' -not -iname '*ProRes*' -not -iname '*H264*' \) \) -print0 2> /dev/null | xargs -0 du -s | sort -n | cut -f 2 | while read line; do printf "%q " "$line" ; done ; echo

Here's what it does. The first part invokes the find command; this finds any file that ends in .aiff, .aif, .wav or .mov, with the added restriction that .mov files need to be at least 5GB long and have a name that doesn't include the strings ProRes or H264; this eliminates most if not all of the compressed video files. The -print0 command says to separate the output file paths with a nul character instead of a linefeed (needed so the next tool doesn't get confused by spaces in filenames), and the 2> /dev/null redirects any error messages to the great bit bucket in the sky.

Each path gets processed by xargs, which is a tool that lets you run other tools on each line. The -0 means use nul as the line delimiter, and it runs du -s (disk usage) on the file paths.

The output of that is a set of lines, each containing the length of the file in disk blocks plus the path, separated by a tab. This gets piped into the sort tool, which is told to sort the lines by their numeric value by using the -n flag; I want them in this order so pbzip2 can compress the smallest files first, freeing up space for the larger ones; often an archive drive will be almost full when I start to compress it.

Next the cut tool is used to extract the second field, which gets us back our list of paths, now sorted smallest to largest.

Finally, I need to put all these paths on a single line, separated by spaces, and properly escaped (spaces changed to "\ ", for example). There is a printf (print formatted) tool for this, but the "%q" formatting code that does the escaping is not implemented in the MacOS version of printf (bitch moan bitch moan). However, printf is also implemented as a built-in command in the Bash shell, and that version does implement "%q", so a little inline shell script will do what I need - it reads each line, prints it out escaped with a space after it, and then echos a blank line. The final result is a single long line containing all the file paths, which admittedly looks like crap but I can just copy it, type in pbzip2 -v (or any other compression command) and paste it in. Actually, given how pbzip2 spawns multiple threads and can chew up a lot of your cpu resources, you probably want to do something like nice -5 pbzip2 -v to make it a bit more polite.

This won't work if the filename has really weird characters in it, like carriage returns, but that isn't a problem for me.

Let me end with a big shoutout to all the contributors to the many postings on stackoverflow that helped me find the right tools and combinations.

PS: I later stumbled upon this excellent comparison of various compression tools which includes an efficiency/time tradeoff chart. Of course, depending on what you are compressing, your mileage may vary!