The need to explore a compression algorithm has struck again. After playing with LZSS (LZ77) I thought LZW (LZ78) was something I should eventually get around to studying. I've played with several other things since having that thought, but it finally took hold, and became the thing that consumed "free time"

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the Lempel-Ziv-Welch (LZW) encoding/decoding algorithm. Anyone familiar with ANSI C and LZW or LZ78 should be able to follow and learn from my implementation. I'm sure that there's room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

Click here for a link to my LZW source. The rest of this page discusses LZW and my implementation.


Algorithm Overview

Like it's predecessor LZSS (LZ77), the Lempel-Ziv-Welch algorithm uses a dynamically generated dictionary and and encodes strings by a reference to the dictionary. It is intended that the dictionary reference should be shorter than the string it replaces. As you will see, LZW achieves it's goal for all strings larger than 1.

Encoding

The LZSS algorithm uses a sliding window dictionary. Each entry in the dictionary is a character. LZSS code words consist of an offset into the dictionary and the number of characters following the offset to include in an encoded string.

Unlike LZSS, entries in the LZW dictionary are strings, and every LZW code word is a reference to a string in the dictionary.

Okay, so where does the dictionary come from, and why can't I find an entry for my whole file in it?

The Dictionary and Encoded Strings

The LZW dictionary is not an external dictionary that lists all known symbol strings. Instead, the dictionary is initialized with an entry for every possible byte. Other strings are added as they are built from the input stream. The code word for a string is simply the next available value at the time it is added to the dictionary.

An "encoded" string is used to add new strings to the dictionary. The encoded string is built from the input stream. The input stream is read 1 byte at a time. If the string formed by concatenating the new byte to the encoded string is in the dictionary, the new byte is added to the end of the encoded string. Otherwise a dictionary entry is made for the new string and the code word for the coded string is written to the output stream. Then the byte that was just read becomes thet encoded string.

The Basic Encoding Algorithm

Based on the discussion above, encoding input consists of the following steps:
Step 1. Initialize dictionary to contain one entry for each byte. Initialize the encoded string with the first byte of the input stream.
Step 2. Read the next byte from the input stream.
Step 3. If the byte is an EOF goto step 6.
Step 4. If concatenating the byte to the encoded string produces a string that is in the dictionary:

  • concatenate the the byte to the encoded string.
  • go to step 2.

Step 5. If concatenating the byte to the encoded string produces a string that is not in the dictionary:

  • add the new sting to the dictionary.
  • write the code for the encoded string to the output stream.
  • set the encoded string equal to the new byte.
  • go to step 2.

Step 6. Write out code for encoded string and exit.

Example: The string "this_is_his_thing" is encoded as follows:

New Byte Encoded String New Code Code Output
t t None None
h h 256 (th) t
i i 257 (hi) h
s s 258 (is) i
_ _ 259 (s_) s
i i 260 (_i) _
s is None None
_ _ 261 (is_) 258 (is)
h h 262 (_h) _
i hi None None
s s 263 (his) 257 (hi)
_ s_ None None
t t 264 (s_t) 259 (s_)
h th None None
i i 265 (thi) 256 (th)
n n 266 (in) i
g g 267 (ng) n
None None None g

In the example above, a 17 character string is represented by 13 code words. Any actual compression that would occur would be based on the size of the code words. In this example code words could be as short as 9 bits. Typically code words are 12 to 16 bits long. Of course the typical input stream is also longer than 17 characters.

Decoding

It shouldn't be a big surprise that LZW data is decoded pretty much the opposite of how it's encoded. The dictionary is initialized so that it contains an entry for each byte. Instead of maintaining an encoded string, the decoding algorithm maintains the last code word and the first character in the string it encodes. New code words are read from the input stream one at a time and string encoded by the new code is output.

During the encoding process, the code prior to the current code is written because concatenating the first character of the current code with the string encoded by the prior code generated a code that wasn't in the dictionary. When that happened the string formed by the concatenation was added to the dictionary. The same string needs to be added to the dictionary when decoding.

The Basic Decoding Algorithm

Based on the discussion above, decoding input consists of the following steps:
Step 1. Initialize dictionary to contain one entry for each byte.
Step 2. Read the first code word from the input stream and write out the byte it encodes.
Step 3. Read the next code word from the input stream.
Step 4. If the code word is an EOF exit.
Step 5. Write out the string encoded by the code word.
Step 6. Concatenate the first character in the new code word to the string produced by the previous code word and add the resulting string to the dictionary.
Step 7. Go to step 3.

Example 2: Decode the stream 't' 'h' 'i' 's' '_' 258 '_' 257 259 256 'i' 'n' 'g' produced by the previous example

Input Code Encoded String Added Code String Output
t t None t
h h 256 (th) h
i i 257 (hi) i
s s 258 (is) s
_ _ 259 (s_) _
258 is 260 (_i) is
_ _ 261 (is_) _
257 hi 262 (_h) hi
259 s_ 263 (his) s_
256 th 264 (s_t) th
i i 265 (thi) i
n n 266 (in) n
g g 267 (ng) g

The decode string matches the original encoded string, so I must have done something right. One of my favorite things about LZW is that the decoder doesn't require any additional information from the encoder. There's no need to include extra information commonly required by statistical algorithms like Huffman Code and Arithmetic Code. So compression ratios are never reduced by extra data cost.

Exception to the Rules

Whoever said that there's an exception for every rule must have studied LZW. It turns out that an exception may occur. When decoding certain input streams the decoder may see a code word that's one larger than anything that it has in it's dictionary. Fortunately for me others have figured out when the exception happens and how to deal with it.

The exception occurs if the dictionary contains an entry for string + character and the input stream is string + character + string + character + string.

When the exception occurs, concatenate the first character of the string encoded by the previous code word to the end of the string encoded by the previous code word. The resulting string is the value of the new code word. Write it to the output stream and add it to the dictionary.

Example 2: The string "abcabcabcabcabcabc" demonstrates an occurrence of the special exception:

New Byte Encoded String New Code Code Output
a a None None
b b 256 (ab) a
c c 257 (bc) b
a a 258 (ca) c
b ab None None
c c 259 (abc) ab (256)
a ca None None
b b 260 (cab) ca (258)
c bc None None
a a 261 (bca) 257 (bc)
b ab None None
c abc None None
a a 262 (abca) 259 (abc)
b ab None None
c abc None None
a abca None None
b b 263 (abcab) 262 (abca)
c bc None None
None None None 257 (bc)

Wow! That's a lot of work to force an exception to occur. The results still need to be decoded in order to witness the exception in action.

Input Code Encoded String Added Code String Output
a a None a
b b 256 (ab) b
c c 257 (bc) c
256 ab 258 (ca) ab
258 ca 259 (abc) ca
257 bc 260 (cab) bc
259 abc 261 (bca)
262 Not In Dictionary abca abca
NOTE: The above step demonstrates the exception and how it is handled. (New encoded string = old encoded string + the first character of the old encoded string = abc + a = abca)
257 bc 263 (abcab) bc

There you have it, the exception and how it's handled. It can be made to occur with a shorter pattern of repeated 2 character strings like "ababababababab", but I doesn't make as nice of an example.

That's all there is to the LZW algorithm. There are a few more issues to consider when actually implementing the algorithm on a computer. You should to read on if you care about them.


Implementation Issues

As I have stated the introduction, my intent is to publish an easy to follow ANSI C implementation of the Lempel-Ziv-Welch Encoding (LZW) encoding/decoding algorithm. I have also tried to make the code portable. Those goals drove much of this implementation.

Size of Code Word

One of the first things that needs to be decided when implementing LZW is how big of a code word to use. The following items should be considered when choosing a code word size:

  • Code words must be bigger than a length 1 string of what ever is encoded.
  • All encoded strings (including length 1 strings) will require a code word to represent them.
  • Larger code words mean more entries may be contained in a dictionary.
  • Consider code word endian/byte order issues if the LZW is implementation will be used on different platforms.

I encode strings of bytes, so the size of a code word must be bigger than a byte. That rules out any code words 8 bits or less.

I also don't want the code word to be huge, because even size 1 strings will be written as a code word. Example 1 encodes a 17 byte (136 bit) string into 13 code words. If the code words were only 9 bits long, the encoded data would be 117 bits. However if the code words were 16 bits long, the encoded data would be 208 bits. Typically, a larger data set must be encoded before longer code words produce any compression.

Code word size also has an impact on the maximum number of strings in a dictionary. An n bit code word may have as many as 2n strings in it's dictionary. If you're encoding bytes the first 256 of those strings will be single bytes.

Larger dictionaries can contain codes for more strings and that typically improves compression. The downside is that the encoding algorithm needs to search the dictionary for matches to the current string. Search time increases with the size of the dictionary.

With so many factors to consider, I ended up using 12 bit code words for my version 0.1 implementation. It's really easy to modify my version 0.1 implementation to use 9 to 15 bits. I settled on 12 bits after trying all values between 9 and 15 on a random set of files between 1K and 128K in size. The test was highly unscientific, but the 12 bit code words had the best average compression results.

I got a little fancier with my version 0.2 implementation. I begin with a 9 bit code word, but allow the bit length to increase up to 15 bits as more code words are generated. It is usually the case that a data stream can be compressed with code words that start at 9 bits and grow to n bits better than it can be compressed with a fixed n bit code word. I use the word "usually", because there is a little overhead required to indicate that the code word size has changed (see Indicating Code Word Length).

Feel free to change the code word size for any reason that you might have. If you increase it beyond 15 bits, things will blow up on machines with 16 bit integers. If you increase the code word size to more than 16 bits, you'll need to modify the routines that read and write code words.

Representing Strings in the Dictionary

The dictionary in the Lempel-Ziv-Welch algorithm provides a way of associating strings with code words. Code words are of limited length. However, the LZW algorithm does not impose a limit on the length of strings that are encoded.

One way to represent strings of arbitrary length is by a NULL terminated array. With LZW, it's possible to have thousands of strings thousands of bytes long. As machine memory sizes increase, there may come a time when the memory requirements of NULL terminated strings isn't a big deal. I'm still using a vintage PC, so NULL terminated strings are out of the question.

Fortunately, somebody much smarter than me observed that all strings entered into the dictionary either have a size of 1, or consist of a character appended to a string that is in the dictionary. The string that's already in the dictionary must also have a code word associated with it. Rather then representing the new string as old string + character, it can be represented as code word for old string + character.

In Example 1 the code 263 represents the string "his". Rather than creating a dictionary entry for code word 263 and "his", an entry can be made for the code word 263, the prefix code 257, and byte 's'. Every dictionary entry consists of the code word, the prefix code, and the last character in the string.

The big advantage to this scheme is that all strings in the dictionary are stored as a small, fixed length value. The downside to this scheme is that you need to traverse the dictionary to determine what string a code word encodes. 263 encodes 257 + 's'. 257 encodes 'h' + 'i'. That's not too bad with 3 character strings, but it's not that fun with 1000 character strings. Still, having a known fixed length for dictionary entries outweighs the need to traverse the dictionary.

Encoding Issues

The biggest considerations when implementing the LZW encoding algorithm are:

  • Writing of code words that aren't an integral number of bytes.
  • Indicating the length of a code word.
  • The layout of the dictionary.
  • Finding string matches in the dictionary.
  • Inserting new strings into the dictionary.

Fortunately, writing code words wasn't that big of a deal. I have a bit file library containing functions for writing files one bit at a time. I use the same library for all of my other compression implementations, and it works just fine for LZW.

Indicating Code Word Length

The problem of indicating how long code words are only occurs when variable length code words are used. If fixed length code words are used, there's no question about how many bits the decoder should use when reading the encoded data.

I'm not sure what the common practice is for variable length code words and leaving hints that allow the decoder to recognize a code word length change. My implementation is based on some ideas that kept me up one night, and it works just fine. There are two observations that lead me to my implementation:

  1. Code words may be n bits long until the encoder is required to write a code word that needs n + 1 bits to represent it.
  2. The indication that n + 1 bits are needed must be n bits long, because decoder is still reading n bits at time.

With those observations, everything fell into place. When the encoder needs n + 1 bits to write out a code word, it writes an n bit indicator. Then all code words from that indicator until the next indicator will be written using n + 1 bits. Once an indicator signals increased code word size, there is no way to decrease code word size.

For my indicator I use a code word of all ones at the current code word length. The consequence of using an all ones indicator is that the code word represented by that value also requires an extra bit. Suppose the encoder is using 9 bits and it needs to output the code word 511. 511 is 9 bits of all ones. A single 511 will cause the decoder to switch to reading 10 bit code words without decoding code word 511. To get around this, the encoder must write an indicator (511 in this case) to switch to 10 bits, then it must write the code word 511 using 10 bits.

Version 0.2 of my implementation starts out using 9 bit code words and allows the code word length to grow to 15 bits.

Dictionary Layout

Determining the layout of the dictionary most definitely impacts finding and inserting strings. The dictionary stores strings represented as a code word prefix and a byte suffix.

Initially, the dictionary will contain one entry for every one character string. However, the number of strings in the dictionary may grow as the encoding process proceeds. Fortunately there is an upper bound on the number of strings that may be in the dictionary. If code words are n bits long, there can only be 2n unique code words.

The only real requirement for an LZW dictionary is that it can store 2n code word/string pairs. It seemed natural to use an array of 2n code word/string entries to me. I started out with a dictionary that was an array of 2n entries, but as I started playing with the algorithm I realized that all strings 1 byte long are encoded by the value of the byte. It's much faster to just write out the value of the byte than it is to look them up in the dictionary. So my implementation uses an array based dictionary only stores up to (2n - 256) strings. Single byte strings were handled without using the dictionary.

As natural as an array based dictionary seems, it can be really slow to search when the dictionary fills up (see below). Recently I have replaced the array based dictionary with a binary tree based dictionary to speed up average search time. Anybody who has seen a binary tree should be comfortable with my implementation.

Searching for and Inserting Strings in Dictionary

Though being able to store all the code words is the only real dictionary requirement, the ability to perform fast search and insertion operations is also desirable.

Every time the encoding algorithm reads a byte, it appends it to a string that it is building. The first thing it does with the string is look for it in the dictionary. If the string isn't in the dictionary and the dictionary isn't full, the encoding algorithm will insert the string into the dictionary. There is one dictionary search for each byte in the uncoded data stream, and there will be one insertion for each new code word.

The simplest way to search an array based dictionary is by brute force, from start to finish. Similarly the simplest way to insert strings into the dictionary is to start at the beginning and keep looking until an empty location is found. These simple brute force approaches are O(N), where N is the number of code words (which is O(2n) where n is the number of bits in a code word).

It would really be nice to have something faster without adding much computational overhead. Others have successfully used hashing to speed up searches and insertions. The ideal hash function will make searching and insertion O(1) operations. I haven't come up with that function yet. Rather than being creative I used a simple function to generate a hash key that is used as an initial index into the dictionary. First I shifted the string's code prefix by 8 bits and OR in it's final character. That gives me an n + 8 bit number. I used that number modulo the size of the dictionary to get my initial key.

key = (codeWord << 8) | lastCharacter;
key = key % dictionarySize;

The first dictionary index I checked is the value of the key. Unfortunately collisions are possible. In the case of searching for a match, a collision occurs when there's a string already located in the dictionary position, but it doesn't match the string that we're looking for. My collision resolution was simple, but no more efficient than a brute force source. I search the next dictionary indices until I've either: found a match, searched the entire dictionary, or found an empty dictionary location.

If the dictionary is full, the array based algorithm will take O(N) searches to discover that a string is not in the dictionary. I realized that I could increase the size of the dictionary array, so that it will always include empty entries and stop the search if I arrive at an empty entry, but that just cuts the search time to an average of O(N/2).

Rather than using a larger array, I chose to use a binary tree. It takes an average of O(log(N)) searches to discover that a string isn't in the full dictionary. I'm not making much of an effort to balance my tree, so worst case is still O(N), but I don't think there will be many naturally occurring data streams that produce that result.

In order to keep the binary tree close to balanced, dictionary entries are ordered using a simple key. I append the code for the prefix sting to the MS nibble of the new character, then I append the LS nibble to that.

key = (lastCharacter & 0xF0) << (codeWordLength)
key |= (prefix << 4);
key |= (lastCharacter & 0x0F)

It's possible that I may have future code release using balanced trees. But that will take motivation and time.

One nice side effect of these methods are that they don't require any additional work to find an insertion location for a new string. Strings are only inserted into the dictionary when a match isn't found and there is room to insert the string. If the array based search algorithm ends because it found an empty dictionary location, there wasn't a match and the dictionary isn't full. The new string and its new code word is inserted into the empty dictionary location where the search algorithm stopped. In the case of the tree based search, when it ends without a match, the new entry is inserted as the child of the leaf where the search ended.

Idea for Super-fast Search/Insertion

The whole dictionary search and insertion process may be trivialized by increasing the size of the dictionary so that it has one index for every possible code word prefix/byte suffix string. If code words are n bits long, then there are 2n + 8 different possible strings; no more than 2n of which will actually be used. The dictionary can then be implemented as a 2n + 8 array, where each entry is either empty or contains the code word for the code word prefix/byte suffix pair represented by that index.

Decoding Issues

The biggest considerations when implementing the LZW decoding algorithm are:

  • Reading of code words that aren't an integral number of bytes.
  • Determining the length of a code word.
  • The layout of the dictionary.
  • Decoding code words to strings.

As with writing code words, reading code words wasn't that big of a deal. I use my bit file library here too.

Determining Code Word Length

The problem of determining the length of code words is only an issue when variable length code words are allowed. In the section on encoding, I discuss a method that uses an all ones indicator when code word bit lengths are to increase by one. When my decoder reads a code word that is all ones bits, it must increase the code word length by one bit without decoding that code word. All code words that follow are read using the new length and operation proceeds normally.

Version 0.2 of my implementation starts out using 9 bit code words and allows the code word length to grow to 15 bits.

Dictionary Layout

The dictionary organization used for encoding would work just fine for decoding, however it's more complicated then it needs to be. During the encoding process strings are created and the dictionary is searched to see if code words already exits for that string. The decoding algorithm reads the encoded stream 1 code word at a time and uses the dictionary to find out how to decode a code word. Simply restated, encoding searches the dictionary based on string value. Decoding searches the dictionary based on code word value.

My decoding dictionary is just an array of 2n strings where the array index is the code word that encodes the string.

Decoding Code Words into Strings

After it reads a code word, the decoding algorithm will look up the code word in the dictionary and write out the encoded string that it encodes. Normally that wouldn't be an issue. However, strings in the dictionary are stored as a prefix code word + a suffix byte. In order to decode a code word into a string of bytes, the prefix code of the string must also be decoded. The process of decoding prefix strings continues until a prefix string decodes to a 1 byte string. The process of decoding strings builds the string being decoded in reverse order.

The processes of decoding code words into another code word and a byte and writing all the bytes out in the reverse order of their decoding made me think of recursion. I don't have much of an opportunity to use recursion in my paying job, so the idea of implementing a recursive decode routine seemed "fun". If you'd rather avoid recursion, store the decoded characters into an array that is at least 2n - 256 bytes long, then write out the results in reverse order.


Further Information

Further discussion of LZW with links to other documentation and libraries may be found at http://www.datacompression.info/lzw.shtml.

I found Mark Nelson's Dr. Dobb's Journal article be very helpful. He even provides source for a slightly different implementation.


Actual Software

I am releasing my implementations of the LZW algorithm under the LGPL. If you've actually read this page to get all the way down here, you already know that I have different implementations. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I'm retaining the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise.

Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.

Version Comment
Version 0.6 Fixed bug that occurs when output code word grows by two or more bits.
  Uses BitFilePutBitsInt/BitFileGetBitsInt to allow for code words as large as sizeof(int). (#define limits code words to 20 bits).
Version 0.5 Replaces getopt with my optlist library.
  Explicitly license the library under LGPL version 3.
Version 0.4 Uses latest bit file library. This may fix memory access errors with non-GNU compilers.
Version 0.3 Separated encoded and decode source to simplify creation of encode or decode only programs.
  Uses a binary tree to store encoding dictionary.
Version 0.2 Uses variable length code words 9 to 15 bits long.
Version 0.1 Initial release using 12 bit fixed length code words.
  Once the dictionary is full, it does not change.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

The software makes no assumptions about the endianess of the machine that it is executing on. However, it does make some assumptions about the size of data types. The software makes use of the #if and #error pre-processor directives as an attempt to check for violations of my assumptions at compile time.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu

For more information on the LZW algorithm and other compression algorithms, visit DataCompression.info.

DataCompression.Info

Home
Last updated on October 4, 2010