# Burrows-Wheeler Transform Discussion and Implementation

### by Michael Dipperstein

After playing with: Huffman coding, LZSS, arithmetic coding, and run length encoding (RLE), you'd figure that I'd have enough. I thought so too. I briefly diverted my attention from compression with a venture into C++ implementations of my bit libraries, but compression curiosities came back to haunt me. I had long known about bzip2 and its better compression ratios than gzip. I even knew that the "b" is for "block", but I never really understood how it worked. Where do blocks come into play and how does that let me improve on my compression ratios?

It turns out that the blocks are constructs of the Burrows-Wheeler block-sorting transform (BWT). This page discusses the Burrows-Wheeler transform and the related Move-To-Front coding (MTF). I have also provided links to my implementation of BWT and MTF.

As with my compression implementations, my intent is to publish an easy to follow ANSI C implementation of the Burrows-Wheeler transform. Anyone familiar with ANSI C and BWT should be able to follow and learn from my implementation. There's a lot of room for improvement of speed, memory usage, and block size optimizations, but this project is about learning and sharing, not perfection.

If you just want my source code, click here for information on downloading it.

The rest of this page discusses BWT, Move-To-Front coding and the results of my efforts so far.

## Algorithm Overview

The Burrows-Wheeler transform (BWT) is not actually a compression algorithm. In fact BWT requires that a small amount of additional information be stored with the transformed data so that the transformation may be reversed. This makes the transformed data larger than its original form.

BWT is useful because it converts the data into a format that is generally more compressible by run length encoders and statistical encoders with order greater than 0. By additionally applying move-to-front coding, the data will be in a format which is generally more compressible by even zero order statistical encoders such as traditional implementations of Huffman coding or arithmetic coding.

### Transform

The Burrows-Wheeler transform is applied on blocks of input data (symbols). It is usually the case that larger blocks result in greater compressibility of the transformed data at the expense of time and system resources.

One of the effects of BWT is to produce blocks of data with more and longer runs (strings of identical symbols) than those found in the original data. The increasing the number of runs and their lengths tends to improve a the compressibility of data.

The first step of BWT is to read in a block of N symbols C_{0}
… C_{N-1}.

**Example:**

Read in the following block: `this is a test.`

N = 15

C_{0} = 't'

C_{1} = 'h'

…

C_{13} = 't'

C_{14} = '.'

The next step is to think of the block as a cyclic buffer. N strings
(rotations) S_{0} … S_{N-1} may be constructed such
that:

S_{0} = C_{0}, …, C_{N-1}

S_{1} = C_{1}, …, C_{N-1}, C_{0}

S_{2} = C_{2}, …, C_{N-1}, C_{0},
C_{1}

…

S_{N-1} = C_{N-1}, C_{0}, …, C_{N-2}

**Example:**

"this is a test." yields the following rotations:

```
S
```

_{0} = "this is a test."

S_{1} = "his is a test.t"

S_{2} = "is is a test.th"

S_{3} = "s is a test.thi"

S_{4} = " is a test.this"

S_{5} = "is a test.this "

S_{6} = "s a test.this i"

S_{7} = " a test.this is"

S_{8} = "a test.this is "

S_{9} = " test.this is a"

S_{10} = "test.this is a "

S_{11} = "est.this is a t"

S_{12} = "st.this is a te"

S_{13} = "t.this is a tes"

S_{14} = ".this is a test"

The third step of BWT is to lexicographically sort S_{0}
… S_{N-1}.

**Example:**

"this is a test." yields the following sorted rotations:

```
S
```

_{7} = " a test.this is"

S_{4} = " is a test.this"

S_{9} = " test.this is a"

S_{14} = ".this is a test"

S_{8} = "a test.this is "

S_{11} = "est.this is a t"

S_{1} = "his is a test.t"

S_{5} = "is a test.this "

S_{2} = "is is a test.th"

S_{6} = "s a test.this i"

S_{3} = "s is a test.thi"

S_{12} = "st.this is a te"

S_{13} = "t.this is a tes"

S_{10} = "test.this is a "

S_{0} = "this is a test."

The final step in the transform is to output a string, `L`

,
consisting of the last character in each of the rotations in their sorted
order along with `I`

, the sorted row containing S_{0}.

**Example:**

"this is a test." yields the following output:

```
L = "ssat tt hiies .", I = 14
```

The input went from a block of 0 runs to a block of 3 runs of length 2 (not bad for 15 symbols). This transformation works on data blocks where there are symbols that are frequently preceded by the same symbol. In my example ' ' is twice preceded by 's'.

Believe it or not the string of last characters (`L`

) along
with the row S_{0} appears in after sorting (`I`

) is
enough information to reconstruct the original block (S_{0}).

### Reverse Transform

Reversing BWT is a little more complicated than the initial transform.
The reversal process starts with a string `L`

composed of last
characters of sorted rotations (S_{0} … S_{N-1}) and
`I`

, the position of the contribution S_{0} made to
`L`

. The reversal process must yield S_{0}, the original
block.

The approach to reversing the transform wasn't the least bit obvious to me. If it's obvious to you, feel free to skip this section.

It turns out there are a few ways to reverse the transform. The method discussed here is the one that I ended up implementing.

If `L`

is composed of the symbols V_{0} …
V_{N-1}, the transformed string may be parsed to determine the
following pieces of additional information:

- The number of symbols in the substring V
_{0}… V_{i-1}that are identical to V_{i}. - For each unique symbol, V
_{i}, in`L`

, the number of symbols that are lexicographically less than that symbol.

**Example:**

`L = "ssat tt hiies ."`

produces the following:

Position | Symbol | Number Matching |
Position | Symbol | Number Matching |

0 | 's' | 0 | 8 | 'h' | 0 |

1 | 's' | 1 | 9 | 'i' | 0 |

2 | 'a' | 0 | 10 | 'i' | 1 |

3 | 't' | 0 | 11 | 'e' | 0 |

4 | ' ' | 0 | 12 | 's' | 2 |

5 | 't' | 1 | 13 | ' ' | 2 |

6 | 't' | 2 | 14 | '.' | 0 |

7 | ' ' | 1 |

Symbol | Number Less Than |

' ' | 0 |

'.' | 3 |

'a' | 4 |

'e' | 5 |

'h' | 6 |

'i' | 7 |

's' | 9 |

't' | 12 |

Now we have:

`L`

, the string of last characters of rotations.`I`

, the position of the contribution that S_{0}made to`L`

.- The number of times the symbol V
_{i}appears in the substring V_{0}… V_{i-1}. - The number of times a symbol lexicographically less than a unique
symbol appears in
`L`

.

Amazingly this is all we need to reconstruct S_{0}.

Remember our string S_{0} = C_{0} …
C_{N-1}. So C_{N-1} is the contribution from S_{0},
and it's in position `I`

. (From #2)

Now that we've found the last character in S_{0},
C_{N-1}, we'll try to reconstruct the rest of S_{0}
backwards. (Trust me, there's a technique that will let us do that.)

That means we need to find the symbol C_{N-2}. C_{N-2}
is the contribution from S_{N-1}. so where is contribution from
S_{N-1} in `L`

?

It just so happens that S_{N-1} is the string that starts with
C_{N-1} and we just found C_{N-1}. So we know how
S_{N-1} starts, which can help us find out where it is in the
sort order.

The contributions from strings starting with symbols less than
C_{N-1} will occur earlier in `L`

than the contribution
from S_{N-1}. `L`

may also contain contributions from
strings starting with symbols equal to C_{N-1}. Some of these
strings occur before S_{N-1} and others after.

From #3, we know the number of symbols identical to C_{N-1} that
occur in `L`

before C_{N-1}. So we also know the number
of strings starting with identical symbols that made contributions before
S_{N-1}.

From #4, we know how many symbols there are that are less than
C_{N-1}. So we also know the number of strings starting with lesser
symbols that made contributions before S_{N-1}.

No other strings may contribute to `L`

before S_{N-1},
therefore the contribution from S_{N-1} occurs in the position
obtained by summing the values from #3 and #4. As stated earlier the
contribution from S_{N-1} is C_{N-2}.

Use the same technique to find C_{N-3}, C_{N-4},
…, C_{0}.

**Example:**

Using tables 1 and 2 reverse BWT
where `L = "ssat tt hiies ."`

and `I`

= 14.

We start with:

S_{0} = `???????????????`

We're given that C_{14} is V_{14} = '.'.

S_{0} = `??????????????.`

Table 1 tells us that there are 0 other '.' before V_{14} and Table 2
tells us that there are 3 characters < '.', so C_{14}
must be V_{0 + 3} = V_{3} = 't'.

S_{0} = `?????????????t.`

Table 1 tells us that there are 0 other 't' before V_{3} and Table 2
tells us that there are 12 characters < 't', so C_{13}
must be V_{0 + 12} = V_{12} = 's'.

S_{0} = `????????????st.`

Table 1 tells us that there are 2 other 's' before V_{12} and Table 2
tells us that there are 9 characters < 's', so C_{12}
must be V_{9 + 2} = V_{11} = 'e'.

S_{0} = `???????????est.`

Table 1 tells us that there are 0 other 'e' before V_{11} and Table 2
tells us that there are 5 characters < 'e', so C_{11}
must be V_{0 + 5} = V_{5} = 't'.

S_{0} = `??????????test.`

Table 1 tells us that there is 1 other 't' before V_{5} and Table 2
tells us that there are 12 characters < 't', so C_{10}
must be V_{1 + 12} = V_{13} = ' '.

S_{0} = `????????? test.`

Table 1 tells us that there is 2 other ' ' before V_{13} and Table 2
tells us that there are 0 characters < ' ', so C_{9}
must be V_{2 + 0} = V_{2} = 'a'.

S_{0} = `????????a test.`

…

The rest is left as an exercise to the student. (I always loved that statement)

### So What Do I Do with BWT?

Now that you understand BWT, you can publish web pages about it. Okay, that's not what you're looking for.

As I stated earlier BWT tends to transform blocks of symbols into blocks with more runs. This makes BWT useful for improving compression of run length encoders. The method for compression would be:

- Apply Burrows-Wheeler Transform on data to be encoded.
- Apply run length encoding on data to be encoded.

The method for decompression would be:

- Remove run length encoding on from the encoded data.
- Apply the reverse Burrows-Wheeler Transform on the encoded data.

Typically compression performed by run length encoders are far from optimal. Even after BWT, run length encoders don't perform as well as statistical encoders. BWT doesn't do anything to change the frequency distribution of symbols, so BWT alone won't do anything to make data more compressible by order zero statistical algorithms. That's where Move-To-Front Coding comes into play.

### Move-To-Front Coding

Move-To-Front (MTF) coding is a technique that encodes streams of symbols based on an adapting code. Imagine that symbols are encoded by their position in a list of every symbol in the alphabet. Initially the list is in a lexicographical (or some predetermined) order. Once a symbol in a stream is encoded, that symbol is moved from it's current position to the front of the list and its former predecessors are moved one position back.

**Example:**

Given an alphabet of ' ','.','a','e','h','i','s','t' encode
"ssat tt hiies .":

Renaming Symbols | Symbol List | Encoded Symbols |

ssat tt hiies . | ' ','.','a','e','h','i','s','t' | |

sat tt hiies . | 's',' ','.','a','e','h','i','t' | 6 |

at tt hiies . | 's',' ','.','a','e','h','i','t' | 6,0 |

t tt hiies . | 'a','s',' ','.','e','h','i','t' | 6,0,3 |

tt hiies . | 't','a','s',' ','.','e','h','i' | 6,0,3,7 |

tt hiies . | ' ','t','a','s','.','e','h','i' | 6,0,3,7,3 |

t hiies . | 't',' ','a','s','.','e','h','i' | 6,0,3,7,3,1 |

hiies . | 't',' ','a','s','.','e','h','i' | 6,0,3,7,3,1,0 |

hiies . | ' ','t','a','s','.','e','h','i' | 6,0,3,7,3,1,0,1 |

iies . | 'h',' ','t','a','s','.','e','i' | 6,0,3,7,3,1,0,1,6 |

ies . | 'i','h',' ','t','a','s','.','e' | 6,0,3,7,3,1,0,1,6,7 |

es . | 'i','h',' ','t','a','s','.','e' | 6,0,3,7,3,1,0,1,6,7,0 |

s . | 'e','i','h',' ','t','a','s','.' | 6,0,3,7,3,1,0,1,6,7,0,7 |

. | 's','e','i','h',' ','t','a','.' | 6,0,3,7,3,1,0,1,6,7,0,7,6 |

. | ' ','s','e','i','h','t','a','.' | 6,0,3,7,3,1,0,1,6,7,0,7,6,4 |

'.',' ','s','e','i','h','t','a' | 6,0,3,7,3,1,0,1,6,7,0,7,6,4,7 |

Notice that with MTF coding, the second and successive characters in any run are converted to 0 ("ss" became 6,0, "tt" became 1,0, and "ii" became 7,0). This works well with BWT, since it produces blocks with a lot of runs. You should also notice that there is nothing about MTF coding that requires prior use of BWT.

### Move-To-Front Decoding

Unlike the reverse BWT, Move-To-Front (MTF) decoding was fairly obvious to me. It's very much like the encoding process, but a little less time intensive. This time the position of a symbol in the list of every symbol in the alphabet is used to decode a symbol. Like the encode process, the list starts out in a lexicographical (or some predetermined) order. The encoded data indicates the position of the decoded symbol. After decoding the symbol, move it to the front of the list.

**Example:**

Given an alphabet of ' ','.','a','e','h','i','s','t' decode
decode 6,0,3,7,3,1,0,1,6,7,0,7,6,4,7:

Renaming Symbols | Symbol List | Decoded Symbols |

6,0,3,7,3,1,0,1,6,7,0,7,6,4,7 | ' ','.','a','e','h','i','s','t' | |

0,3,7,3,1,0,1,6,7,0,7,6,4,7 | 's',' ','.','a','e','h','i','t' | s |

3,7,3,1,0,1,6,7,0,7,6,4,7 | 's',' ','.','a','e','h','i','t' | ss |

7,3,1,0,1,6,7,0,7,6,4,7 | 'a','s',' ','.','e','h','i','t' | ssa |

3,1,0,1,6,7,0,7,6,4,7 | 't','a','s',' ','.','e','h','i' | ssat |

1,0,1,6,7,0,7,6,4,7 | ' ','t','a','s','.','e','h','i' | ssat_ |

0,1,6,7,0,7,6,4,7 | 't',' ','a','s','.','e','h','i' | ssat t |

1,6,7,0,7,6,4,7 | 't',' ','a','s','.','e','h','i' | ssat tt |

6,7,0,7,6,4,7 | ' ','t','a','s','.','e','h','i' | ssat tt_ |

7,0,7,6,4,7 | 'h',' ','t','a','s','.','e','i' | ssat tt h |

7,0,7,6,4,7 | 'h',' ','t','a','s','.','e','i' | ssat tt h |

0,7,6,4,7 | 'i','h',' ','t','a','s','.','e' | ssat tt hi |

7,6,4,7 | 'i','h',' ','t','a','s','.','e' | ssat tt hii |

6,4,7 | 'e','i','h',' ','t','a','s','.' | ssat tt hiie |

4,7 | 's','e','i','h',' ','t','a','.' | ssat tt hiies |

7 | ' ','s','e','i','h','t','a','.' | ssat tt hiies_ |

'.',' ','s','e','i','h','t','a' | ssat tt hiies . |

### So What Do I Do with MTF?

Now that you understand MTF, you can publish web pages about it too. Okay, it was a bad joke the first time.

As I stated earlier MTF tends to increase the frequency of low value symbols within a block. This makes MTF useful for improving compression of statistical encoders such as Huffman coding or arithmetic coding.

The method for compression would be:

- Apply Burrows-Wheeler Transform on data to be encoded.
- Apply Move-To-Front encoding on data to be encoded.
- Apply statistical encoding on data to be encoded.

The method for decompression would be:

- Remove statistical from the encoded data.
- Remove Move-To-Front encoding from the encoded data.
- Apply the reverse Burrows-Wheeler Transform on the encoded data.

## Implementation

I was extremely uncreative here. My implementation comes directly from "A Block-sorting Lossless Data Compression Algorithm" by Michael Burrows and David Wheeler. The source code includes comments which reference sections of their paper. My hope is that anybody reading their paper will be able to follow along with my source. This section addresses issues that were either discussed by Burrows and Wheeler or were required to implement their transform on real data.

### Block Size

BWT is a block sorting algorithm, but the transform isn't dependent upon the size of the block. BWT can be used on blocks of any size. When I was debugging my code I used 16 byte blocks, they are currently 4096 bytes. I chose 4096 bytes because it can be handled by most modern computers. The larger the block, the more opportunity you have for runs and repeated patterns. Burrows and Wheeler demonstrate improved compression ratios all the way to 103 megabyte blocks for one benchmark file.

The obvious downsides to large blocks are that blocks must be stored somewhere, so you need enough memory to hold the block and it's rotations. An N byte block will have N rotations. The rotations are also sorted, and sorting is not a linear operation; larger blocks will also take more time.

### End Of Block

In the ideal world any encoded data will be evenly divisible by whatever block size the transform happens to be using. Unfortunately, I don't live in that world. More often than not, the last block is a partial block. There are a few ways to deal with this that I can think of:

- Prefix the encoded data stream with a number of bytes encoded
- Encode an End-of-Block symbol into each block
- Prefix each block with a block size
- Use EOF to indicate end of last block

When I started playing with BWT I thought that I might want to dynamically adjust the block size, so I prefixed each block with its size. It also allowed for some error checking. If the file ended before I read all the data in a block something bad happened. The dynamic block size adjustment never came into being, so the code I'm releasing actually takes the more compact approach of using the EOF as an indicator of a partial block.

In order to handle a partial block at the end of a file, I made one minor
change to the BWT algorithm. Burrows and Wheeler output the last characters
of the sorted rotations (`L`

), followed by the index of the
contribution of the unrotated block (`I`

). If `L`

is
written before `I`

, an attempt to read all the characters of a
full size block will result in reading some or all of `I`

before
the EOF is reached. `L`

can always be adjusted by removing the
data that belongs to `I`

, but there's an easier approach. My
implementation writes `I`

then `L`

. If we hit an EOF
and `L`

is a partial block we don't need to do any magic to get
`I`

; we already have it.

### Block Rotations

Burrows and Wheeler discuss a conceptual N × N matrix containing rotations of N long blocks, but they leave implementation of the concept to the implementer. I'm pretty sure they don't expect anybody to use an actual N × N matrix.

Since rotations are just strings starting at index i in the block and wrapping around until, index (i - 1) mod N, I just refer to my rotations by their starting indices. My matrix of rotations is just an array of induces into the unrotated block.

### Sorting of Rotations

In order to keep things simple, I use the `qsort()`

function
to sort rotations. The efficiency of `qsort()`

really depends
upon your compiler. You'd think it was just a text book quick sort, but it
isn't always. I think gcc uses a merge sort.

If you want to play with different sort functions, my
sort library includes implementations of
common sort algorithms which are plug-in replacements for
`qsort()`

.

It turns out that the sorting of rotations is one of the most time consuming steps in the encode process. Optimizing the sorting process seems to be a pretty hot topic, with several published papers. Burrows and Wheeler suggest optimizing the process by using a radix sort to sort all rotations by their first two characters. Then using quick sort is use to sort within the groups of rotations that match at their first two characters. Versions 0.4 and later of my software introduce this optimization.

It should be pointed out that using a radix sort to sort rotations by their first two characters prior using quick sort isn't always more optimal than using quicksort alone. Suppose an entire block is just one repeated character. All rotations in that block start with the same two characters, so a radix sort of such a block will not reduce the amount of comparisons performed by quick sort.

### MTF Encode/Decode

My implementation of MTF encode/decode uses brute force. Apparently there are efficient implementations of MTF that are well known. Not by me, but I didn't go looking for them either. One possible area to look for optimization is in the search for a symbol's code during the encode process. I do a sequential search. It will find the code for a symbol in a run on the first try. The average search time for other symbols will be on the order of half the alphabet length. Other searches might have better average cases.

Another potential from optimization is the Move-To-Front operation itself. My code is stored in an array, and I move each character over one at a time. Linked lists will make the move faster, but then you can't use the array index as the code.

## Library Code Interface

### Transforming Data

#### BWXform

Declaration:

```
int BWXform(FILE *fpIn, FILE *fpOut, const xform_t method);
```

Description:

This function performs a Burrows-Wheeler transformation on a file (with optional move to front) and writes the resulting data to the specified output file. Comments in this function indicate corresponding variables, labels, and sections in "A Block-sorting Lossless Data Compression Algorithm" by M. Burrows and D.J. Wheeler.

Parameters:

`fpIn`

- The file stream to be transformed. It must non-NULL and opened.

`fpOut`

- The file stream receiving the transformed data. It must non-NULL
and opened.

`method`

- `xform_t`

type value indicating whether indicate
whether or not MTF is used.

Effects:

A Burrows-Wheeler transformation (and possibly move to front
encoding) is applied to `fpIn`

. The results of the
transformation are written to `fpOut`

.

Returned:

0 for success, non-zero for failure. `errno`

will be set
in the event of a failure.

### Reverse Transforming Data

#### BWReverseXform

Declaration:

```
int BWReverseXform(FILE *fpIn, FILE *fpOut,
const xform_t method);
```

Description:

This function reverses a Burrows-Wheeler transformation on a file (with optional move to front) and writes the resulting data to the specified output file. Comments in this function indicate corresponding variables, labels, and sections in "A Block-sorting Lossless Data Compression Algorithm" by M. Burrows and D.J. Wheeler.

Parameters:

`fpIn`

- The file stream to be reverse transformed. It must non-NULL and
opened.

`fpOut`

- The file stream receiving the reverse transformed data. It must
non-NULL and opened.

`method`

- `xform_t`

type value indicating whether indicate
whether or not MTF is used.

Effects:

A Burrows-Wheeler reverse transformation (and possibly move to front
encoding) is applied to `fpIn`

. The results of the
transformation are written to `fpOut`

.

Returned:

0 for success, non-zero for failure. `errno`

will be set
in the event of a failure.

### Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows 98/XP.

My implementation defines a value `BLOCK_SIZE`

as 4096. This
value may easily be changed. Increasing `BLOCK_SIZE`

generally
improves the compressibility of transformed data at the expense of memory
and computation time. If `BLOCK_SIZE`

is made larger than
`INT_MAX`

the preprocessor will issue and error which may be
corrected by changing several `int`

s to `long`

s. If
`BLOCK_SIZE`

is made larger than the maximum `size_t`

,
I would expect calls to `fread()`

and `fwrite()`

to
produce errors. Whether the errors are compile-time or run-time is probably
dependent on the compiler.

## Further Information

Michael Burrows and David Wheeler document their algorithm in "A Block-sorting Lossless Data Compression Algorithm". I highly recommend that anybody interested in studying their algorithm make a point of reading this document.

Further information about BWT and other block-sorting algorithms may be found at http://www.datacompression.info/BWT.shtml. The site includes links to discussions of the algorithm, studies, libraries, and tools using BWT.

## Actual Software

I am releasing my implementation of the Burrows-Wheeler Transform and Move-To-Front coding under the GNU LGPL. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I'm retaining the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise.

Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.

Version | Comment |

Version 0.6 | Upgrade to latest optlist library |

Written with tighter adherence to Michael Barr's Top 10 Bug-Killing Coding Standard Rules. | |

Changed the API so that encode and decode routines accept opened file streams instead of file names. | |

Version 0.5 | Replaces getopt with my optlist library. |

Explicitly licensed the library under LGPL version 3. | |

Version 0.4 | Uses faster sorting algorithm discussed in "A Block-sorting Lossless Data Compression Algorithm". A radix sort is used to sort rotations by their first two characters before quicksort is used to further sort rotations. |

Version 0.3 | Blocks are allocated on the heap instead of the stack. This change enabled gcc to use larger blocks. |

Version 0.2 | Comments match code with Burrows-Wheeler paper |

Version 0.1 | Initial release. |

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu