Frequency Substitution is a term I made up for a transformation that I saw discussed on another web site. I have since lost the link to the site, but if somebody can find it I'll be happy to reference it.

The idea behind Frequency Substitution is simple. The goal is to transform data by replacing the highest frequency symbols with the lowest values. It sounded simple enough, so I thought that I'd give it a try.

Normally this type of transformation doesn't produce useful results, but there are a few compression techniques, such as Rice (Golomb) Coding, that produce smaller code words for lower value symbols. By making the lower valued symbols the most frequent symbols, these encoding schemes may achieve better compression ratios.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of frequency substitution encoding and decoding. Anyone familiar with ANSI C should be able to follow and learn from my implementation. I'm sure that there's room for improvement of speed and memory usage, but this project is about learning and sharing, not perfection.

Click here for a link to my frequency substitution source. The rest of this page discusses the concepts involved with frequency substitution and my implementation.


Algorithm Overview

The goal of frequency substitution is to transform data by replacing the highest frequency symbols with the lowest values. In a data set contains 256 unique symbols, the most common symbol will be replaced with 0 and the least common symbol will be replaced with 255.

Encoding

Encoding a data stream using frequency substitution is a two pass process. The first pass is used to count the frequency of the symbols. After the first pass, a table of substitution codes is generated. Then a second pass is made, where an encoded data set is generated using the substitution table.

The first pass should be straight forward, start with a table indicating a frequency of 0 for all possible symbols, then add 1 to a symbol's frequency count every time it is encountered in the data stream. Next sort the substitution table most frequent to least frequent.

On the second pass, write out enough information for the decoder to reconstruct the table, then read in each unencoded symbol and write out the sorted position of that symbol.

Example:
Given a symbol set of 0 ... 9, encode the following data stream:
3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9

Based on the discussion above, encoding input consists of the following steps:

Step 1. Start with an empty table.

Symbol Count
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0

Step 2. Generate a frequency table for the data stream.

Symbol Count
0 0
1 2
2 4
3 7
4 3
5 3
6 3
7 2
8 3
9 4

Step 3. Sort the symbols by frequency.

Sorted
Position
Symbol Count
0 3 7
1 2 4
2 9 4
3 4 3
4 5 3
5 6 3
6 8 3
7 1 2
8 7 2
9 0 0

Step 4. Write out the encoded data by substituting a symbol for it's position in the frequency table.

3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9
becomes
0,7,3,7,4,2,1,5,4,0,4,6,2,8,2,0,1,0,6,3,5,1,5,3,0,0,6,0,1,8,2

A quick sanity check shows that the sum of the symbols in the initial string (150) is greater than the sum of symbols in the encoded string (96) so we must have lowered the average symbol value.

Decoding

Decoding data that has been encoded by frequency substitution is a single pass process. Given a table of frequency substitutions, reverse the substitutions preformed during the encoding process.

Example:
The following sequence and table of frequency substitutions
0,7,3,7,4,2,1,5,4,0,4,6,2,8,2,0,1,0,6,3,5,1,5,3,0,0,6,0,1,8,2

Sorted
Position
Symbol Count
0 3 7
1 2 4
2 9 4
3 4 3
4 5 3
5 6 3
6 8 3
7 1 2
8 7 2
9 0 0

Produces the following sequence
3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9

If you were paying attention, you're probably wondering how the encoder passed the table of frequency substitutions to the decoder. The next section provides some ideas and discusses my implementations.

The Substitution Key

Before the decoder can do anything with the encoded data, it needs a key that tells it how to undo the substitutions. Since the key is needed immediately, it makes sense for the encoder to output the key before it outputs the encoded data.

So what kind of information do we need to have in the key? In order for the key to be useful, it must contain the original symbol and the value that it was substituted for.

If there are N possible symbols, the simplest thing to do is write out the complete list of all N (symbol, substitution) pairs. Such a list contains 2 × N entries. My goal was to use frequency substitution for data compression, so it would be nice if the list could be made even smaller.

As it turns out there are at least two ways to make the list smaller:

  1. Use the position of the list entries as the symbol value.
  2. Only transmit keys for symbols that appear in the data.

Position As Symbol Value

Since the substitution key is the first thing output by the encoder, the key for all N symbols will be output in positions 0 ... (N - 1). Position 0 contains the substitution used for symbol 0, position 1 contains the substitution used for symbol 1, and so on, the symbol number may be omitted from the key. This method requires just N entries for N symbols.

Excluding Unused Symbols

Sometimes the data being encoded is sparse; it only uses a fraction of all of the symbols in the symbol set. In this case there will be a large number of unneeded entries in a position indexed key.

For spare data sets, it is more efficient to transmit (symbol, substitution) pairs omitting symbols that don't appear in the input. To make this method work an end of key or a key length must also be transmitted. Otherwise the decoder would not be able to distinguish the encoded data from the substitution key.

My implementation first outputs the key length - 1, followed by (symbol offset, substitution) pairs. Where the symbol offset is the value of the current symbol minus the value of the previous symbol. The value of the previous symbol is taken as 0 at start up.

Example:
(symbol: 30, substitution: 5), (symbol: 31, substitution: 42)
becomes:
(offset: 30, substitution: 5), (offset: 1, substitution: 42)

Using a symbol offset instead of the actual symbol value has the advantage of producing substitution keys with lower value entries. Normally lower value entries would not be a big deal, but the whole purpose of frequency substitution is to produce data with lower values than the original source.

Effectiveness

If you actually read my introduction, you'd notice that I thought frequency substitution might be helpful when used with Rice (Golomb) Coding. In order to test that theory, I used the Calgary Corpus as a test data set.

The table below shows the results of applying frequency substitution to the Calgary Corpus and then using Rice coding (with K = 4) to compress encoded files. Both position indexed and sparse versions of my frequency substitution implemention were used. For reference, the table also provides results for Rice coding alone and Rice coding with BWT and MTF.

The results of my test appear in the following tables:

File Orig Rice BWT + Rice FreqSub + Rice Sparse FreqSub + Rice Rice BWT + Rice FreqSub + Rice Sparse FreqSub + Rice
bib 111261 132690 73271 76878 76601 310945 63729 85571 84629
book1 768771 983146 498044 500524 500249 2411218 415556 456305 455366
book2 610856 780344 396846 405157 404909 1912774 327191 396634 395744
geo 102400 127322 92851 90109 90272 300282 164232 152572 152676
news 377109 466867 251512 260957 260712 1122082 232063 290736 289854
obj1 21504 27379 17040 19186 19349 66546 23962 33097 33201
obj2 246814 322336 179249 212587 212749 812091 203702 358843 358947
paper1 53161 66994 34729 36213 35962 162996 29119 38018 37125
paper2 82199 106376 53286 54097 53839 262680 43819 50855 49947
paper3 46526 60251 30261 30918 30647 149090 25339 29792 28859
paper4 13286 17006 8676 9142 8864 41738 7340 9435 8490
paper5 11954 14933 7873 8444 8186 36108 6827 9330 8422
paper6 38105 47046 24949 26193 25939 112869 21022 28018 27117
pic 513216 392391 327985 325191 325081 494376 227137 219083 218491
progc 39611 46306 26174 27916 27659 106976 22658 32171 31266
progl 71646 83408 46170 47819 47553 193202 34977 47601 46680
progp 49379 57154 32098 33961 33700 130982 25144 36173 35258
trans 93695 105406 61614 67082 66840 237650 51727 82272 81394
Total 3251493 3837355 2162628 2232374 2229111 8864605 1925544 2356506 2343466
Percent 118.02% 66.51% 68.66% 68.56% 272.63% 59.22% 72.47% 72.07%

There are three things that you should learn from this table:

  1. The Calgary Corpus contains data ill suited for Rice encoding.
  2. Frequency substitution may be applied to the Calgary Corpus data to enable Rice encoding to compress the data.
  3. Better compression results may be achieved by applying BWT and MFT.

So why use frequency substitution? If your goal is to achieve the best possible compression ratios, most data sets don't provide a compelling reason. However, the computational and memory resources required for frequency substitution is less than that required by BWT and MFT. I didn't provide any proof of that, you'll just have to trust me or see for yourself.


Actual Software

I am releasing my implementations offrequency substitution encoding and decoding under the LGPL. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I will retain the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise.

Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.

Version Comment
Version 0.1 Initial release.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows XP.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu

For more information on compression algorithms, visit DataCompression.info.

DataCompression.Info

Home
Last updated on October 4, 2010