My experience with Rice (Golomb) coding and a challenge that a student sent me had thinking about creating a library that uses a small number of bits to encode the differences between consecutive symbols, but it didn't seem useful or all that interesting. It became interesting and useful after I attended BodyNets 2009 and saw a presentation on "Adaptive Lossless Compression in Wireless Body Area Sensor Networks". The presenter concluded that adaptive delta encoding was a code space and time efficient method for compressing data sampled from heart and gait monitors. So with renewed enthusiasm, I threw together a quick hack that implements adaptive delta encoding.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the adaptive delta coding algorithm. Anyone familiar with ANSI C and the adaptive delta coding algorithm should be able to follow and learn from my implementation. There's a lot of room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

Click here for information on downloading my source code.

The rest of this page discusses adaptive delta coding and the results of my efforts so far.


Algorithm Overview

Delta encoding represents streams of compressed symbols as the difference between the current symbol and the previous symbol. This will result in compression for any data streams where it takes fewer bits to represent the differences between consecutive symbols then it does to represent the symbols themselves.

If you happen to have data that is well suited to delta encoding (like audio or heart monitor data), you will find the simplicity of this algorithm attractive. However, delta encoding is not well suited for all data types.

Encoding

Delta coding is fairly straightforward. The first symbol of a delta encoded stream is always written out unecoded. After that, every symbol is replaced with an S bit long signed value representing the value of the current symbol minus the value of the previous symbol (or the other way around if you like).

Using S bits to represent a signed value, allows for differences ranging from -2S-1 to 2S-1-1.

Keeping S small is what allows for compression, but if S is too small, the range of values that can be represented by S bits might not include the range of differences between consecutive symbols in the data steam. Rather than giving up when a delta is too large to be represented by S bits, an escape symbol may be issued to indicate that the next symbol is not delta encoded, then the current symbol may be output.

The Delta encoding algorithm does not specify an escape symbol. I used the smallest signed value that can be represented in S bits (-2S-1) as my escape symbol.

Based on the discussion above, encoding input consists of the following steps:

Step 1. Write the first symbol out unencoded.

Step 2. Read the next symbol from the input stream. Exit on EOF.

Step 3. Compute the difference between the current symbol and the previous symbol.

Step 4. If the difference may be represented in S bits:

  • write the difference to the output stream
  • go to step 2

Step 5. If the difference cannot be represented in S bits:

  • write the escape symbol (the minimum S bit value)
  • write the current symbol unencoded
  • go to step 2

That's it. I told you it was straightforward.

Decoding

Decoding isn't any harder than encoding. Since first symbol of a delta encoded stream is always written out unecoded, it be read back and written without modification. After that, read in an S bit encoded symbol. If the S bit symbol is the escape code, read the next symbol and write it out, otherwise, write out the symbol that is the sum of the previous symbol and the difference that is represented by the S bit encoded symbol.

Continue this process until the end of file is reached.

Based on the discussion above, decoding encoded input consists of the following steps:

Step 1. Copy the first symbol to the decoded output.

Step 2. Read the next symbol from the input stream. Exit on EOF.

Step 3. If the current delta is an escape symbol:

  • read the next symbol as an uncoded symbol
  • write the symbol to the decoded output
  • go to step 2

Step 4. If the current delta is not an escape symbol:

  • compute the value of the previous symbol plus the delta
  • write the resulting symbol to the decoded output
  • go to step 2

Implementation Issues

Adapting to Delta Ranges

If you choose a value for the code word size (S) that is too small, the Delta encoding algorithm will generate a lot of escapes, and you'll end up with an encoded data stream that's larger than the unencoded stream. If you choose a value for S that is too large, your encoded stream will be far from optimal size. One solution to this problem is to implement an adaptive algorithm that allows the the value of S to change as the data stream is processed.

I'm not aware of any literature that provides details on different adaptation algorithms, so I invented my own. My goals for the algorithm and it's implementation were for it to be code space and time efficient, and easy to modify. It also had to produce results that were improvements over the results of the non-adaptive algorithm.

I incorporated an overflow counter to determine if the code word size (S) should be adjusted upward, and underflow counter to determine if S should be adjusted downward. When a delta value requires more (or less) bits than the full range of S, the overflow (underflow) counter is incremented. Otherwise the counter is decremented (but limited to 0). When a counter reaches 3, S is increased (decreased) by 1 and the counter is set to zero.

The source code used to adjust the code word size is all contained in the file adapt.c. If you want to experiment with different adaptive algorithms, you should start there. I look forward to hearing from anybody that has experimental results to report.

Handling End-of-File (EOF)

The EOF is of particular importance, because it is likely that an encoded file will not have a number of bits that is an integral multiple of bytes. Most file systems require that files be stored in bytes, so it's likely that encoded files will have spare bits. If you don't know where the EOF is, the spare bits may be decoded as another symbol.

There are at least three solutions to the EOF problem:

  1. Encode an EOF in the data stream.
  2. Keep a count of the number of data bytes encoded.
  3. Prevent the decoder from spitting out another data byte before the actual EOF is reached.

I choose the third option for my implementation. I write an escape symbol at the end of an encoded file. The decoder expects an unencoded symbol to follow the escape, but file can always be ended in less bits than it takes to write an unencoded symbol, so the decoder will get an EOF before it gets the symbol it was expecting.


Effectiveness

So how well does delta coding work for generic data? Not very well at all. Delta coding only works well when consecutive symbols have similar values. The greater the difference between two consecutive symbols, the more bits are required to encode them.

One way to improve the compression obtained by delta coding on generic data is to apply a reversible transformation on the data that reduces the average value of a symbol as well as the difference between consecutive symbols. The Burrows-Wheeler Transform (BWT) with Move-To-Front (MTF) encoding is such a transform.

I used the Calgary Corpus as a data set to test the effectiveness of delta coding compared to my implementations of Rice coding, Huffman coding and LZSS. The executive summary is that even with the help of BWT and MTF, delta coding couldn't match the compression ratios of Huffman coding or LZSS. Rice coding with BWT and MTF even has better compression ratios if the proper value for K is selected. However delta coding without BWT and MTF performed better than Rice coding without BWT and MTF, which is the way they should be used in a system with limited memory and processing power. The adaptive delta coding algorithm that I used also seems robust enough that there is very little penalty for choosing the wrong value for s.

The results of my test appear in the following tables:

TABLE 1. Uncompressed and Traditionally Compressed Calgary Corpus Files
File Original Huffman + BWT LZSS + BWT
bib 111261 50849 69792
book1 768771 375140 531367
book2 610856 272857 384448
geo 102400 74328 80692
news 377109 187912 264605
obj1 21504 12835 14233
obj2 246814 103745 129296
paper1 53161 24041 33116
paper2 82199 37889 52814
paper3 46526 22428 31066
paper4 13286 6756 8934
paper5 11954 6069 7862
paper6 38105 17319 23596
pic 513216 102814 118892
progc 39611 17376 23447
progl 71646 24027 32692
progp 49379 17077 22777
trans 93695 36146 49277
Total 3251493 1389608 1878906
Percent 100.00% 42.74% 57.79%

TABLE 2. Rice Calgary Corpus Files
File Rice (K = 2) Rice (K = 4)
No Processing BWT + MTF No Processing BWT + MTF
bib 310945 63729 132690 73271
book1 2411218 415556 983146 498044
book2 1912774 327191 780344 396846
geo 300282 164232 127322 92851
news 1122082 232063 466867 251512
obj1 66546 23962 27379 17040
obj2 812091 203702 322336 179249
paper1 162996 29119 66994 34729
paper2 262680 43819 106376 53286
paper3 149090 25339 60251 30261
paper4 41738 7340 17006 8676
paper5 36108 6827 14933 7873
paper6 112869 21022 47046 24949
pic 494376 227137 392391 327985
progc 106976 22658 46306 26174
progl 193202 34977 83408 46170
progp 130982 25144 57154 32098
trans 237650 51727 105406 61614
Total 8864605 1925544 3837355 2162628
Percent 272.63% 59.22% 118.02% 66.51%

TABLE 3. Delta Calgary Corpus Files
File Delta (S = 2) Delta (S = 4)
No Processing BWT + MTF No Processing BWT + MTF
bib 123019 84489 123017 84492
book1 847240 605189 847239 605192
book2 659954 466702 659953 466707
geo 117961 92639 117963 92640
news 404171 319905 404171 319906
obj1 22737 17481 22743 17482
obj2 295535 174819 295534 174821
paper1 57070 40565 57071 40567
paper2 89217 62711 89218 62711
paper3 49215 36416 49216 36416
paper4 14184 10398 14183 10403
paper5 13172 9394 13173 9394
paper6 41690 28949 41690 28949
pic 211592 182724 211593 182725
progc 41443 29544 41440 29544
progl 73771 44693 73772 44695
progp 51802 30931 51800 30934
trans 101770 66050 101770 66053
Total 3215543 2303599 3215546 2303631
Percent 98.89% 70.85% 98.89% 70.85%


Library Code Interface

Encoding Data

DeltaEncodeFile

Declaration:

int DeltaEncodeFile(FILE *inFile, FILE *outFile, unsigned charcodeSize);

Description:

This function reads from the specified input stream and writes an adaptive delta encoded version to the specified output stream.

Parameters:

inFile - The file stream to be encoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the encoded results. It must be opened. NULL pointers will return an error.
charcodeSize - The number of bits used for code words at the start of coding.

Effects:

Data from the inFile stream will be delta encoded and written to the outFile stream.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Decoding Data

DeltaDecodeFile

Declaration:

int DeltaDecodeFile(FILE *inFile, FILE *outFile, unsigned char codeSize);

Description:

This function reads from the specified adaptive delta encoded input stream and writes a decoded version to the specified output stream.

Parameters:

inFile - The file stream to be decoded. It must opened. NULL pointers will return an error.
outFile - The file stream receiving the decoded results. It must be opened. NULL pointers will return an error.
charcodeSize - The number of bits used for code words at the start of coding.

Effects:

Data from the inFile stream will be decoded and written to the outFile stream.

Returned:

0 for success, -1 for failure. errno will be set in the event of a failure.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

The software makes no assumptions about the endianess of the machine that it is executing on. However, it does make some assumptions about the size of data types. The software makes use of the #if and #error pre-processor directives as an attempt to check for violations of my assumptions at compile time.


Actual Software

I am releasing my implementations of adaptive delta encoding and decoding under the LGPL. At this time I only have two revisions of the code to offer. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I will retain the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise.

Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.

Version Comment
Version 0.3 Upgrade to latest bit file and optlist libraries
  Changed the API so that encode and decode routines accept opened file streams instead of file names.
  Corrected maximum underflow value. Previous versions are off by one.
  Written with tighter adherence to Michael Barr's Top 10 Bug-Killing Coding Standard Rules.
  Made adaptive code word length module (adapt) reentrant. Now caller must use helper function to allocate and free data structures.
Version 0.2 Refactored code to allow for easy modifications of the adaptive rules.
Version 0.1 Initial release.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu

Home
Last updated on November 23, 2014