I had to scratch yet another compression itch. I know very little about compressing wave audio files, but some questions I received made me want to find out slightly more. It turns out that it's possible to convert wave files into a format where the bulk of the data can be represented by low value numbers. Once that is done, Rice Encoding (a special case of Golomb Coding) can be applied to reduce the bits required to represent the lower value numbers.

Rice's algorithm seemed easy to implement, so I felt compelled to give it a try. I was right (sometimes that happens). A rough cut of an implementation of the algorithm didn't take that long to implement. If you're going to try your own implementation, you'll need a way to read and write individual bits, the rest is not that big of a deal. I already have a bitfile library that handles the reading and writing of bits, so I was left with the stuff that is not a big deal.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the Rice encoding and decoding algorithms. Anyone familiar with ANSI C and Rice (or Golomb) Encoding should be able to follow and learn from my implementation. I'm sure that there's room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

## Algorithm Overview

Given a constant `M`, any symbol `S` can be represented as a quotient (`Q`) and remainder (`R`), where:
`S = Q × M + R`.

If `S` is small (relative to `M`) then `Q` will also be small. Rice encoding is designed to reduce the number of bits required to represent symbols where `Q` is small.

Rather than representing both `Q` and `R` as binary values, Rice encoding represents `Q` as a unary value and `R` as a binary value.

For those not familiar with unary notation, a value `N` may be represented by `N` 1s followed by a 0.
Example: 3 = 1110 and 5 = 111110.

Note: The following is true for binary values, if ```log2(M) = K``` where `K` is an integer:

1. `Q = S >> K` (`S` left shifted `K` bits)
2. `R = S & (M - 1)` (`S` bitwise ANDed with `(M - 1)`)
3. `R` can be represented using `K` bits.

### Encoding

Rice coding is fairly straightforward.

Given a bit length, `K`. Compute the modulus, `M` using by the equation `M = 2K`. Then do following for each symbol (`S`):

1. Write out `S & (M - 1)` in binary.
2. Write out `S >> K` in unary.

That's it. I told you it was straightforward.

Example:
Encode the 8-bit value 18 (0b00010010) when `K` = 4 (`M` = 16)

1. `S & (M - 1)` = 18 & (16 - 1) = 0b00010010 & 0b1111 = 0b0010
2. `S >> K` = 18 >> 4 = 0b00010010 >> 4 = 0b0001 (10 in unary)

So the encoded value is 100010, saving 2 bits.

### Decoding

Decoding isn't any harder than encoding.

As with encoding, given a bit length, `K`. Compute the modulus, `M` using by the equation ```M = 2K ```. Then do following for each encoded symbol (`S`):

1. Determine `Q` by counting the number of 1s before the first 0.
2. Determine `R` reading the next `K` bits as a binary value.
3. Write out `S` as `Q × M + R`.

Example:
Decode the encoded value 100010 when `K` = 4 (`M` = 16)

1. `Q` = 1
2. `R` = 0b0010 = 2
3. `S` = `Q × M + R` = 1 × 16 + 2 = 18

## Effectiveness

So how well does Rice coding work for generic data? Not very well at all. Rice coding only works well when symbols are encoded with small values of `Q`. Since `Q` is unary, encoded symbols can become quite large for even slightly large `Q`s. It takes 8 bits just to represent the value 7.

One way to improve the compression obtained by Rice coding on generic data is to apply reversible transformation on the data that reduces the average value of a symbol. The Burrows-Wheeler Transform (BWT) with Move-To-Front (MTF) encoding is such a transform.

I used the Calgary Corpus as a data set to test the effectiveness of Rice coding compared to my implementations of Huffman coding and LZSS. The executive summary is that even with the help of BWT and MTF, Rice coding couldn't match the compression ratios of Huffman coding or LZSS. However BWT and MTF allowed Rice coding to actually reduce the size of the data sets. The results of my test appear in the following table:

 File Original Huffman BWT + MTF + Huffman LZSS BWT + MTF + LZSS Rice (K = 4 bits) BWT + MTF + Rice (K = 4 bits) Rice (K = 2 bits) BWT + MTF + Rice (K = 2 bits) bib 111261 73173 50849 52551 69792 132690 73271 310945 63729 book1 768771 438792 375140 423905 531367 983146 498044 2411218 415556 book2 610856 368788 272857 285896 384448 780344 396846 1912774 327191 geo 102400 73845 74328 83349 80692 127322 92851 300282 164232 news 377109 246891 187912 194679 264605 466867 251512 1122082 232063 obj1 21504 17338 12835 12267 14233 27379 17040 66546 23962 obj2 246814 195384 103745 103090 129296 322336 179249 812091 203702 paper1 53161 33819 24041 24464 33116 66994 34729 162996 29119 paper2 82199 48077 37889 39693 52814 106376 53286 262680 43819 paper3 46526 27702 22428 23467 31066 60251 30261 149090 25339 paper4 13286 8267 6756 6795 8934 17006 8676 41738 7340 paper5 11954 7893 6069 6072 7862 14933 7873 36108 6827 paper6 38105 24495 17319 17619 23596 47046 24949 112869 21022 pic 513216 107354 102814 105373 118892 392391 327985 494376 227137 progc 39611 26381 17376 17559 23447 46306 26174 106976 22658 progl 71646 43425 24027 22536 32692 83408 46170 193202 34977 progp 49379 30666 17077 15449 22777 57154 32098 130982 25144 trans 93695 65720 36146 33796 49277 105406 61614 237650 51727 Total 3251493 1838010 1389608 1468560 1878906 3837355 2162628 8864605 1925544 Percent 56.53% 42.74% 45.17% 57.79% 118.02% 66.51% 272.63% 59.22%

## Implementation Issues

### Size of Symbol

Rice's algorithm does not place any restrictions on the size of symbols being encoded. However the size of the encoded symbols must be known when the algorithm is actually being applied to data. My implementation of Rice's algorithm only encodes bytes. This places the additional restriction that `K` is between 1 and 7.

### Handling End-of-File (EOF)

The EOF is of particular importance, because it is likely that an encoded file will not have a number of bits that is a integral multiple of bytes. Most file systems require that files be stored in bytes, so it's likely that encoded files will have spare bits. If you don't know where the `EOF` is, the spare bits may be decoded as another symbol.

There are at least three solutions to the `EOF` problem:

1. Encode an `EOF` in the data stream.
2. Keep a count of the number of data bytes encoded.
3. Prevent the decoder from spitting out another data byte until the actual `EOF` is reached.

I choose the third option for my implementation. My decoder will not produce another symbol until it sees the 0 indicating the end of the unary portion followed by `K` bits. Filling the spare bits in the last byte of a file with ones is enough to keep my decoder from decoding an extra symbol.

## Library Code Interface

### Encoding Data

#### RiceEncodeFile

Declaration:

```int RiceEncodeFile(FILE *inFile, FILE *outFile, const unsigned char k)```

Description:

This routine reads an input file one character at a time and writes out a Rice-Golomb encoded version of that file.

Parameters:

`inFile` - The file stream to be encoded. It must opened. NULL pointers will return an error.
`outFile` - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.
`k` - The length of binary portion of encoded word.

Effects:

`inFile` is encoded using the Rice-Golomb algorithm with the binary portion of code words `k` bits long. The encoded output is written to `outFile`. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. `errno` will be set in the event of a failure.

### Decoding Data

#### RiceDecodeFile

Declaration:

```int RiceDecodeFile(FILE *inFile, FILE *outFile, const unsigned char k)```

Description:

This routine reads an input file one encoded string at a time and decodes it using the Rice-Golomb algorithm.

Parameters:

`inFile` - The file stream to be decoded. It must opened. NULL pointers will return an error.
`outFile` - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.
`k` - The length of binary portion of encoded word.

Effects:

`inFile` is decoded using the Rice-Golomb algorithm with the binary portion of code words `k` bits long. The decoded output is written to `outFile`. Neither file is closed after exit.

Returned:

0 for success, -1 for failure. `errno` will be set in the event of a failure.

## Further Information

Further discussion of Rice Encoding may be found at Wikipedia and Monkey's Audio - a fast and powerful lossless audio compressor.

## Actual Software

I am releasing my implementations of Rice encoding/decoding under the LGPL. At this time I only have one revision of the code to offer. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I will retain the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise.

Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.

 Version Comment Version 0.2 Changed return value to 0 for success and -1 for failure with reason in errno. Changed the API so that encode and decode routines accept opened file streams instead of file names. Upgrade to latest optlist and bitfile libraries. Written with tighter adherence to Michael Barr's Top 10 Bug-Killing Coding Standard Rules. Version 0.1 Initial release.

### Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

The software makes no assumptions about the endianess of the machine that it is executing on. However, it does make some assumptions about the size of data types. The software makes use of the #if and #error pre-processor directives as an attempt to check for violations of my assumptions at compile time.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu

Home
Last updated on November 24, 2014