Zip your zipper ^_^
Today I encountered a problem where I needed to transfer huge amounts of data (we are talking about more than 3Gb of numbers…) to my other computer. Naturally USB-sticks were nowhere to be found. For many different reasons, my only option was to transfer the data files (there were really many of them) one by one.
But hey, we have the zipping tool in every Windows, so I though it wasn’t so big deal as I could form zipped objects that included many files. (Just like if you have many files and your email wants you to attach them one by one, zip them into one zipped folder, and the email will deal with the zipped folder as one file).
That was what I thought. First I tested the zipping by zipping couple of the .csv-files into one zipped folder. It worked great: the size of my data shrank almost into nonexistent. Then I started to zip many files in same time into a one zipped folder. The combined size didn’t shrink. It grew.
I have to confess that I have no idea what the zipping-tool does in Windows but this incident inspired me to write about the zipping-method in general. Or…ummm…. Maybe I’ll just take the easiest of the methods as I’m lazy and the others need quite much math :P
This easy method is called LZ77-algorithm and it is still used as it works quite fine. Other compressing methods are (to my knowledge) just extended versions of this.
In this method, the data consist of two types of symbols. First type is just the symbol: any regular character. With this type you can write normal text. The second type consists of two numbers: the first number tells us where to start in the text already written, the second numbers tells us how many from that position onwards are the same. Like this:
Okay, to be honest, it probably looks a bit different in reality, and quite often the start/place number tells where the last number is. Anyhow, the method is quite simple, implementation is just one little thing :)
And there IS one method that doesn’t base on the previous one: Run-length encoding. It does pretty much this:
Aaahh, now I just have to talk also about Huffman-coding :) Not going into details but in this one, the idea is to give the most common character the shortest (==the most space efficient) bit string, the next most common gets the second shortest string and so on. This way the long bit strings are encoded in more space efficient way, we just use different set of symbols. We search the most efficient set of symbols to represent our data. Like if there is no letter A, you really don’t need that letter in your symbols.
One important thing: because of the methods how the new strings are made, we absolutely can NOT start reading the new compressed text from just any point. Zipped things must be read from the beginning. Always. (Unless there has been done some magic ^^).