It’s a kind of strange feeling, but while half of the IT world seems to either already burn (or to tremble with fear), I can choose freely whatever topic I want to write about this month. I haven’t had a Windows box for almost a decade now and people who I work or keep in contact with, are also mostly *nix only. So this post is not about encryption or ransomware at all. It is about useful, respectable compression. Or more precise: The art of re-compressing already compressed data!
In January Precomp, a precompression utility, has been open-sourced! The first two sections tell a bit about how I became interested in this topic and in Precomp. Skip them if you don’t want to read that kind of stuff.
Compressing compressed data?
When I was young and new to PCs, I once tried to compress a ZIP archive with ACE (a lesser known archiver that once was comparable to the more popular RAR). I knew that ACE offered stronger compression and so I thought that this should make the file smaller. Just imagine my surprise when it turned out that I was wrong!
I guess that most of us have a story like that to tell, a story from our childhood when compression was nothing short of magic. Later when I begun to understand that even though it in fact does start with “m”, it’s not magic but math (a subject that I totally sucked at in school – but fortunately I grasped enough to get a rough idea on how compression works ;)). Now there was no surprise anymore: The compressed data is not well fit for any other general purpose compression method, even if it’s compressed with a weak algorithm.
How to work around that? Well, decompressing the ZIP file and creating a new ACE archive does the trick in the case mentioned above. Of course things are not always that straight forward. If they were, I wouldn’t really have much to write about right now and this post would be really, really short!
For whatever reason, compression continued to fascinate me and I loved compressing things to sizes as tiny as possible. It was fun to try out new experimental compression programs specialized on some specific types of files. I did that for years – until I had to stop due to a lack of time.
Games
Let’s fast forward some years from that failed compression experiment with ACE; I had replaced DOS 6.22 with Win95 which I had replaced with Win98 (SE) that I had replaced with WinME, … On some day I wanted to install Quake ]|[ Arena (yes, friends, I once was 1337 young enough to spell it like that!) on my main computer to get into it again for a LAN party next weekend. So I went looking for the darn CD. It took me a while but I finally found the CD case. I opened it up and… the CD itself was missing. Oh great! Since I didn’t feel like looking into all the other cases to find out into which I might have put it accidentally, I decided to just copy it off an older computer which had it already installed (ID were nice people. I don’t remember which version of Q3A it was, but there eventually was an official patch which also removed the CD check for the game so there was no need for a crack or anything).
Now, different versions of Windows didn’t always play together too well on the LAN and since my Quake installation was on a computer with an older Windows (and I didn’t have another cable at hand), I decided that I’d just burn it to CD. It turned out however, that the other machine didn’t have vanilla Q3A installed but the expansion set as well. Together it was obviously too big to fit on one CD. There would have been easy solutions: Leave out the resource files for the expansion, burn two CDs, put the hard drive into the new computer, … Sure, easy solutions are nice and all. But sometimes they are also boring! And when you’re young and have some free time, you don’t do boring stuff. So of course I opted for the more challenging solution: Get it all on one cd!
Quake 3’s resource containers go by the file extension of .pk3 and, more importantly, are in fact ZIP files without any compression. This meant that they could be compressed well because there was no ZIP compression getting in the way. But guess what: Even after applying the most extreme compression programs, the result simply would not fit onto one CD…
Bad luck, eh? Well, not really. Unpacking the container files was in fact the solution in this case. Not because of weak compression but because it enabled me to test each of the files it contained separately with all compressors and could group together all files that compressed best with one compression utility or another! I think that I was able to shrink it down almost as much as needed with just a couple of megs over the CD limit. There were blank CDs with 800 MB capacity as well, so it would have fit onto one CD – but I didn’t have one of those. So I replaced the ID video with an empty video file and I was set.
Since I liked doing these things I begun doing backups like that for a lot of my favorite games, ripping apart (and later rebuild) resource containers, convert between file formats, decompress whatever could be decompressed before applying stronger compression, etc.
How Precomp works
The more I got into free and open source things, the more I wondered if some of them wouldn’t benefit from better compression. A friend and former classmate of mine invented Precomp and I of course was among the first to make use of it and provide feedback. But what is Precomp?
Precomp is what the name says: A pre-compressor. It is not directly meant to reduce the size of files. On the contrary: It can make some files even bigger than the original input. But that’s a good thing really! How’s that? Well, it’s meant to prepare files for compression so that eventually these files can be compressed to a smaller size than the original file could – without losing data of course!
What Precomp does is look for streams in its input file that are compressed with a compression method known to Precomp. It then decompresses and recompresses them so that they can be compared. If they are identical, Precomp will write the decompressed stream (plus how to recompress it properly) to its output file.
While this sounds quite simple in theory, it is in fact a bit more complex. The reason for that lies in the flexibility of some compression algorithms. Have you ever zipped up a file? Then you know that there are a lot of parameters that you can provide which affects how the file will be compressed: “fast”, “normal”, “strong” or “maximum” compression? What about the dictionary size? A lot of things like that. So either combination of compression parameters will result in a valid zip stream that can be decompressed by any zip-compatible utility. Replacing such a stream with a compatible one is fairly easy. Reproducing the exact, bit for bit identical stream, is not.
To be truly lossless, Precomp uses trial and error on each stream. If it can figure out the combination of parameters that result in the original stream: Great! If not, that stream has to be left untouched.
What Precomp can do
Early versions of Precomp were only available on Windows but there have been Linux versions for quite a while as well. I also use it on FreeBSD without any problems. The .PCF files are platform-independent. You can restore the original file on Windows from a file precompressed on Linux or BSD and vice versa.
While Precomp originally was only a pre-compressor for zlib streams (which are used in a variety of file formats like ZIP, GZIP, PNG, PDF, …), it can do more things now. It can use bzip2 to compress its input file after precompression. It can losslessly compress some JPEG pictures to smaller sizes (thanks to an external library). And in the current development version there’s even support for compressing MP3 music files further (also using an external lib)!
Currently, Precomp relies on temporary files for all the extracted streams and thus puts heavy load on your hard drive (and is a bit slow due to that bottleneck). SSDs obviously perform better, but it totally makes sense to use a memdrive if you can spare some RAM for it. I’ve forked the project on Github and added an experimental shell script to assist with the creation of such a memdrive. It’s currently FreeBSD only (I’ve migrated all of my boxes to *BSD and currently have no Linux machine remaining but will set up one for cases like that some time in the future). Feel free to take a look at it if you’re into portable shell scripting and please do tell me if you have any suggestions!
Precomp is not at all at the limit of its possibilities. There are a lot of things that can be tweaked, optimized or added. If you feel like that could be a fun project – go ahead and play with it, it’s on Github. Or perhaps you have an idea what this could be useful for? Please help yourself and use it. It’s free software after all (Apache licensed).