Szymon Grabowski
    Text Preprocessing for Burrows-Wheeler Block Sorting Compression
   _________________________________________________________________

                            VII Konferencja
     "Sieci i Systemy Informatyczne - teoria, projekty, wdrozenia"
                         Lodz, pazdziernik 1999
   _________________________________________________________________

          Katedra Informatyki Stosowanej Politechniki Lodzkiej


   Abstract

   Burrows-Wheeler block sorting algorithm is one of the most attractive
compression methods.  In this paper we present several ideas  attempting
to improve text compression while retaining the speed.  All of them  are
performed before actual coding in  an  independent  preprocessing  phase
which yields flexibility of the coder selection.  Experiments with bzip,
a well known block sorting  implementation,  indicate  2-4%  compression
gains for typical English texts.

   1. Introduction

   The  Burrows-Wheeler  transform  (BWT)  compression  algorithm  is  a
relatively new method, especially appropriate for text  and  other  data
with stable statistics.  Although it has been only five years since  the
method was presented, several outstanding implementations  have  already
been released.  In this paper we refer to the canonical post-BWT  coding
described in [1], however it should be mentioned this is  not  the  only
possible way.
   The BWT transform takes all rotations of a given block of  _n_  bytes
and sorts them lexicographically.  The sorting groups  similar  contexts
together and the last byte of each string is the transform  output.  The
reversibility of the transform is shown eg.  in [1, 3]. It is noteworthy
that in most implementations the contexts consist of the bytes following
the predicted character instead of more traditional preceding bytes.  To
compress the resulting data, the output of the transform is run  through
a Move-To-Front encoder and  finally  submitted  to  the  arithmetic  or
Huffman coder.
   Move-To-Front (MTF) coding outputs the  rank  of  each  byte  in  the
sequence of its predecessors.  Careful study of modelling and coding the
output of BWT transform was presented in [3, 4].
   Although conceptually simple, BWT-based compression  algorithm  (also
called a block sorting coder or  BS,  for  short)  is  not  easy  to  be
improved in general terms, that is, without much  assumption  for  input
data. Its advantage over more traditional Prediction by Partial Matching
(PPM) coders lies in seamless utilization of many  similar  contexts  of
different orders, the price is  however  lack  of  knowledge  of  actual
contexts.  What hurts most in block sorting coding is rapid  changes  in
successive contexts resulting in high and costly MTF  ranks.  To  relief
rapid context changes, contexts with similar  probability  distributions
should be placed close  together.  Ways  to  find  appropriate  alphabet
orders for given data were proposed in [2]  as  well  as  a  hard  coded
heuristic order chosen for English text was there presented.  We use the
heuristic  order,  in  a  slightly  modified  form,  as  one   of    the
enhancements.
   In this paper we also present another technique for removing  certain
rapid context changes introduced by line  breaks  and  punctation  marks
used in typical text.
   Converting capitals to lower case  is  also  tested.  It  results  in
grouping intrinsically similar but formally different (and thus  distant
after BWT) contexts.
   Yet another idea which we examine here is replacing frequent  phrases
in the given text with single characters outside the range  of  actually
used characters.
   Finally, we discuss a little the  chances  of  successful  coding  of
End-Of-Line characters.

   2. Capital conversion

   Capital letters preceding actual contexts usually result in high  MTF
ranks while their lower case equivalents would very often imply  0's  in
MTF ranks. For example, the (following) context "he day..." predicts t's
as well as T's.  We found it beneficial to replace such capitals with an
equivalent lower case letter preceded  by  a  unique  flag.  To  improve
compression, the conversion is omitted if the next character  is  not  a
lower case letter [6].

   Example:

       The Title  -->  #the #title

   but

       THE TITLE  -->  THE TITLE   (# - a flag)

   Why does it work? Instead of quite a high MTF rank, we usually obtain
two lower ranks, with one of them often being zero.
   Below is an example. Table 1 refers to the original text. The current
context is shown in the last column. '_' denotes a blank space.

   Table 1
  ЪДДДДДДДДДДДДДВДДДДДДДДДДДДДВДДДДДДДДДДДї
  і the current і current     і following і
  і character's і (predicted) і context   і
  і predecessor і character   і           і
  ГДДДДДДДДДДДДДЕДДДДДДДДДДДДДЕДДДДДДДДДДДґ
  і     _       і        T    і  hen h... і
  і     _       і        W    і  hen i... і
  і     _       і        w    і  hen i... і
  і     _       і        w    і  hen i... і
  і     _       і        t    і  hen i... і
  і     _       і        t    і  hen s... і
  і     _       і        t    і  hen s... і
  і     _       і        T    і  hen t... і
  АДДДДДДДДДДДДДБДДДДДДДДДДДДДБДДДДДДДДДДДЩ

   The MTF rank for 'T' in the last row is 3.

   After the conversion instead of "Then t...", we obtain "#then  t...",
'#' being a flag (see Table 2).

   Table 2
  ЪДДДДДДДДДДДВДДДДДДДДДДДї          ЪДДДДДДДДДДДВДДДДДДДДДДДї
  і current   і context   і          і current   і  context  і
  і character і           і          і character і           і
  ГДДДДДДДДДДДЕДДДДДДДДДДДґ          ГДДДДДДДДДДДЕДДДДДДДДДДДґ
  і   t       і  hen h... і          і #         і then h... і
  і   t       і  hen i... і          і _         і then i... і
  і   t       і  hen s... і          і _         і then s... і
  і   t       і  hen s... і          і _         і then s... і
  і   t       і  hen t... і          і #         і then t... і
  АДДДДДДДДДДДБДДДДДДДДДДДЩ          АДДДДДДДДДДДБДДДДДДДДДДДЩ

   The resulting MTF ranks are 0 (for 't' on the left) and 1 (for '#' on
the right).  According to Fenwick [3], MTF codes  of  0,  1  and  3  for
English texts cost generally about  0.5,  3  and  almost  5  bit/symbol,
respectively. It is clear now we can expect gain in the presented case.
   We found that following the flag with a blank space in the  described
idea [9] is even slightly better on  the  overall,  the  improvement  is
however inconsistent, for example this is better on book1 and book2  but
worse on bib (all files from Calgary Corpus).  We  decided  to  use  the
extended idea for the tests in Section 7.
   To the best of our knowledge, the described capital  conversion  idea
has never been presented before.

   3. Space stuffing

   Most text files have End-Of-Line  characters  (EOLs)  followed  by  a
letter.  An EOL character (or the latter of the two  EOL  characters  in
CarriageReturn/LineFeed text style) is not predicted very well with  its
context.  To alleviate it, in the preprocessing phase we expand the data
by adding a blank space at the beginning of each line. An added blank is
predicted very well with the context starting with a full word  and  the
character distributions for contexts  starting  with  a  letter  are  no
longer adversely affected with the EOL character.  In other words, it is
better to sacrifice prediction of contexts starting with a  blank  space
for improving prediction of contexts  starting  with  almost  any  other
character.
   An additional little improvement is to  refrain  from  adding  spaces
after EOLs if the next character is neither a letter nor a space [5]. It
can help eg.  on program sources with many lines  starting  with  a  tab
character.
   Some texts have almost all lines started with one or more spaces (eg.
plays, poems). A careful implementation could detect such cases and then
refrain from adding spaces after EOLs for the whole file/block.

   4. Phrase substitution

   Replacing frequent phrases in the original data with new symbols from
an augmented alphabet before actual compression is a known idea [8],  we
however are not aware of  any  application  of  this  concept  to  block
sorting compression.  Teahan reports also increased memory  requirements
for PPM after phrase substitution due to expanded  alphabet  size.  This
should not be a problem for BS, since most sorting  algorithms  for  BWT
transform require memory resources proportional to the block  size,  and
not depending on its  contents  as  long  as  a  single  symbol  in  the
augmented alphabet occupies no more bits than in the original  alphabet.
Plain ASCII text uses 7 bits only, so the  natural  choice  was  not  to
exceed 256 symbols in the expanded alphabet.  Possible  collisions  (for
non-"pure ASCII" data) must be handled by data expansion. In a practical
implementation  for  general  usage,  data  type  detection  should   be
performed before using phrase replacement.  For  example,  English  text
might be detected by finding a reasonable amount of the  word  "the"  in
the given file/block; it is also recommended to  estimate  the  rate  of
non-ASCII characters in the file/block.  If it does not exceed eg. 5 per
cent, we could feel confident the phrase replacement will not hurt much.
The set of phrases may be static (hard coded)  or  selected  dynamically
for  given  data.  From  a  practical  point  of  view,  dynamic  phrase
substitution  offers  little  for  its    increased    complicacy    and
computational costs.  We chose to test static phrase substitution  only.
Our phrase set was selected for English language, since it is  estimated
that 80% of all texts on the Internet are in English [8].
   The advantage of phrase substitution  in  block  sorting  compression
comes from removing certain runs of zeroes in  MTF  values  as  well  as
managing to comprise more actual data in one block.  The negative impact
is context decorrelation.  For example, the  string  "tha..."  might  be
placed far  from  the  string  "the..."  if  "the"  is  in  the  set  of
substituted phrases.  This undesirable phenomenon may  be  mitigated  by
proper alphabet reordering (see Section 5).
   It is not easy to give a rule to finding the most appropriate phrases
to be replaced.  One simple and rather obvious remark  is  that  certain
frequent phrases may not really be good candidates for the  phrase  set.
For example, selecting the phrase "ther" (prefix of  the  word  "there")
seems inappropriate while the phrase "the" is to be selected.
   Our set was found experimentally.  We limited the set to  phrases  of
length two (digrams), three (trigrams) and four.  The text  was  scanned
three times: in the first pass the phrases of length 4 were  substituted
while in the last  pass  digrams  were  replaced.  Using  lookup  tables
enabled to perform the replacement very fast.  All phrases consisted  of
lower case letters only and  the  4-character  phrases  had  all  unique
3-character prefixes, so the memory requirements for the  lookup  tables
were modest.
   Below we present our set of phrases  (Table  3).  It  consists  of  9
quadgrams, 26 trigrams and 88 digrams.

   Table 3
  ЪДДДДДДДДВДДДДДДДДДДДДДДДДВДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДї
  і 4-char і trigrams       і digrams                                 і
  і phrasesі                і                                         і
  ГДДДДДДДДЕДДДДДДДДДДДДДДДДЕДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДґ
  і  that  і all, and, but, і ac, ad, ai, al, am, an, ar, as, at,     і
  і  said  і dow, for, had, і ea, ec, ed, ee, el, en, er, es, et,     і
  і  with  і hav, her, him, і id, ie, ig, il, in, io, is, it,         і
  і  have  і his, man, mor, і of, ol, on, oo, or, os, ou, ow,         і
  і  this  і not, now, one, і ul, un, ur, us, ba, be, ca, ce, co, ch, і
  і  from  і she, the, was, і de, di, ge, gh, ha, he, hi, ho,         і
  і  whic  і wer, whi, whe, і ra,re,ri,ro,rs, la,le,li,lo,ld,ll,ly,   і
  і  were  і wit, you, any, і se, si, so, sh, ss, st, ma, me, mi,     і
  і  tion  і are            і ne, nc, nd, ng, nt, pa, pe,             і
  і        і                і ta, te, ti, to, th, tr, wa, ve          і
  АДДДДДДДДБДДДДДДДДДДДДДДДДБДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДЩ

   Two digrams from the selected set, "th" and "on", were split ie. each
of them was substituted with one of two  symbols  depending  on  if  the
character preceeding the phrase were  a  blank  or  not.  Actually,  the
substitution introduced 9+26+88+2=125 new symbols.

   5. Alphabet reordering

   The degree of similarity  of  neighboring  contexts  depends  on  the
sorting order.  Any alphabet order is allowed as long as both the  coder
and the decoder know it.  It turns out that the alphabetical ASCII order
is not the best choice.  More information on finding a  better  alphabet
order, also dynamically for the given data, can be found in [2], to  the
best of our knowledge the first and so far only study of the  aspect  of
alphabet reordering for BWT.  In our tests we use a modified version  of
the heuristic order for English language taken from that paper. Adaptive
strategies there presented yield less consistant (and not really greater
on the overall) improvements and their usage is limited due to the  long
time of the search.  The heuristic order from the mentioned paper groups
vowels, similar consonants, and punctation.  For the lower case letters,
the order is: "aeioubcdgfhrlsmnpqjktwvxyz"  and  is  analogous  for  the
upper case. The other groupings are "?!" and "+-,.". It performs well on
text, and, since the order is similar to the original  ASCII  order,  it
doesn't usually hurt much on non-text data.
   Our modification to the heuristic order intermingles  digrams  chosen
for the phrase substitution with original symbols (eg. "th" is put close
to 't').  We additionally group "special characters" (punctation  marks,
EOL characters, space, capital flag etc.).
   An intrinsically similar  idea  to  grouping  punctation,  EOLs  etc.
together is adding spaces  before  each  of  several  chosen  punctation
marks.  It is clearly an extension of the idea of  adding  spaces  after
EOLs from Section 3.  The  key  observation  is  that  it  is  a  widely
respected convention not to precede a dot neither a comma with  a  blank
space in typed text;  instead  they  are  followed  with  a  blank.  The
convention  is  not  equally  strict  for  other    punctation    marks,
nevertheless it seems reasonable to precede some of them with a blank as
well.
   Which of those two almost equivalent concepts  is  better?  We  chose
symbol  grouping  and  gave  up  adding  spaces  before  punctation  for
containing  more  actual  data  in  BS  blocks  and   slightly    faster
preprocessing.

   6. EOL coding

   End-Of-Line characters are hard to predict.  If we had spaces instead
of EOLs, most text files would be  noticably  more  compressible.  Hence
comes the idea of replacing EOLs  with  spaces  and  grouping  the  EOLs
together represented as line lengths in words (more precisely: each  EOL
is represented as the number of blank spaces  since  the  previous  EOL)
[7].  Then all data ("real"  data  and  EOL  information)  together  are
submitted to BWT.  This idea has unstable influence: on some files  (eg.
book1 from Calgary Corpus) it  gives  noticable  improvement,  on  other
files it hurts.  The following order-1 entropy measurement heuristic was
tested.  The sum of sizes of order-1  arithmetically  compressed  actual
data  with  EOLs  replaced  with  spaces  and  order-1    arithmetically
compressed EOL information was estimated and compared  to  the  size  of
original data arithmetically compressed in the  order-1  model.  If  and
only if the sum of compressed sizes was lower, then  the  described  EOL
coding and reordering was performed.
   The heuristic gave limited success.  We decided to stop exploring the
EOL coding idea  but  it  looks  promising  for  further  research.  For
example, the EOLs can be encoded separately with eg.  order-1 arithmetic
coder which seems to give better results (we however  limited  ourselves
in this work to mere preproccessing only so that solution is out of  our
scope).  Another interesting question is about proper representation  of
EOL information.  In many text files line lengths  (in  characters)  are
similar (ie.  fixed left and right margins are kept).  This  information
could be explored.
   We would like to mention that our result  on  book1  with  all  ideas
including the EOL coding and submitting all data to one  BWT  block  was
about 220300 bytes with bzip.  This is, as can be seen below, over  2000
bytes additional gain.  Book1 was however the  only  file  from  Calgary
Corpus for which the EOL coding was helpful.

   7. Test results

   The presented filters are applied in the following order:
 - capital conversion (to increase the number of phrase substitutions in
the next step);
 - phrase replacement;
 - alphabet reordering (the new symbols for phrases are intermingled
with original symbols);
 - space stuffing (with respect to the new alphabet order) if EOL coding
is to be omitted,

   otherwise

 - EOL coding if the order-1 entropy check heuristic suggests it. If the
decision is negative, space stuffing (ie.  adding spaces after EOLs) may
then be used.

   We did not use the EOL coding in our test.
   In Table 4 we present the results on 14 text files from  the  Calgary
Corpus obtained with  bzip  and  IMP.  Bzip,  by  Julian  Seward,  is  a
well-known block sorting compressor, based on Fenwick's research in this
area; it is available with sources [10]. IMP, from Technelysium Pty Ltd,
is a very fast  and  relatively  new  implementation  of  block  sorting
compression [11].  The main difference between them is  that  bzip  uses
arithmetic coder while IMP uses Huffman coder.  Each of the tested files
from any of the two following corpora was contained  in  one  bzip's  or
IMP's block.

   Table 4
  ЪДДДДДДДДВДДДДДДДДДДВДДДДДДДДДДДВДДДДДДВДДДДДДДДДДВДДДДДДДДДДДВДДДДДДї
  і file   і bzip     і bzip      і gain і IMP -2   і IMP -2    і gain і
  і        і (originalі (after    і (%)  і (originalі (after    і (%)  і
  і        і data)    і filtering)і      і data)    і filtering)і      і
  ГДДДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДґ
  і bib    і  27097   і  26839    і 0.95 і  27549   і  27264    і 1.03 і
  і book1  і 230247   і 222567    і 3.34 і 232103   і 223962    і 3.51 і
  і book2  і 155944   і 150291    і 3.63 і 157155   і 150933    і 3.96 і
  і news   і 118112   і 114927    і 2.70 і 118598   і 115009    і 3.03 і
  і paper1 і  16360   і  15772    і 3.59 і  16645   і  16073    і 3.44 і
  і paper2 і  24826   і  23839    і 3.98 і  25203   і  24152    і 4.17 і
  і paper3 і  15704   і  15194    і 3.25 і  16005   і  15481    і 3.27 і
  і paper4 і   5072   і   4881    і 3.77 і   5291   і   5089    і 3.82 і
  і paper5 і   4704   і   4570    і 2.85 і   4934   і   4760    і 3.53 і
  і paper6 і  12132   і  11666    і 3.84 і  12480   і  11981    і 4.00 і
  і progc  і  12379   і  12086    і 2.37 і  12668   і  12377    і 2.30 і
  і progl  і  15387   і  15023    і 2.37 і  15774   і  15435    і 2.15 і
  і progp  і  10533   і  10376    і 1.49 і  10855   і  10708    і 1.35 і
  і trans  і  17488   і  17142    і 1.98 і  17801   і  17499    і 1.70 і
  ГДДДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДґ
  і total  і 665985   і 645173    і 3.12 і 673061   і 650723    і 3.32 і
  і avg.   і          і           і 2.86 і          і           і 2.95 і
  АДДДДДДДДБДДДДДДДДДДБДДДДДДДДДДДБДДДДДДБДДДДДДДДДДБДДДДДДДДДДДБДДДДДДЩ

   As it can be seen, compression was never hurt on  tested  files.  The
overall gain is about 3 per cent, a little greater  for  IMP.  On  plain
texts the improvement for IMP was usually  greater  (eg.  book1,  book2,
news), on more structural texts (eg.  progc, progp,  progl,  trans)  the
opposite seemed to be true.  In other words, IMP was more  sensitive  to
text preprocessing.
   With both  compressors  more  gain  was  obtained  for  plain,  human
language files than for more structural, "artificial" files.
   Since the presented filtering was tuned to the Calgary Corpus  files,
with slightly more attention to plain text (book1) than  to  other  text
types, it should be interesting to evaluate the performance of filtering
on another set of files.  We chose  the  Canterbury  Corpus  (Table  5).
Again, we skip the files containing characters with ASCII codes > 127.

   Table 5
  ЪДДДДДДДДВДДДДДДДДДДВДДДДДДДДДДДВДДДДДДВДДДДДДДДДДВДДДДДДДДДДДВДДДДДДї
  і file   і bzip     і bzip      і gain і IMP -2   і IMP -2    і gain і
  і        і (originalі (after    і (%)  і (originalі (after    і (%)  і
  і        і data)    і filtering)і      і data)    і filtering)і      і
  ГДДДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДґ
  і        і          і           і      і          і           і      і
  іalice29 і    42823 і     41159 і 3.89 і    43321 і     41601 і 3.97 і
  іfields  і     2911 і      2829 і 2.82 і     3114 і      3076 і 1.22 і
  іlcet10  і   106629 і    102394 і 3.97 і   107208 і    102711 і 4.19 і
  іplrabn12і   144223 і    141491 і 1.89 і   145467 і    142102 і 2.31 і
  іcp      і     7517 і      7350 і 2.22 і     7750 і      7577 і 2.23 і
  іgrammar і     1184 і      1161 і 1.94 і     1348 і      1338 і 0.74 і
  іxargs.1 і     1655 і      1595 і 3.63 і     1819 і      1769 і 2.75 і
  іasyoulikі    39315 і     38454 і 2.19 і    39823 і     38857 і 2.43 і
  ГДДДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДЕДДДДДДДДДДЕДДДДДДДДДДДЕДДДДДДґ
  іtotal   і   346257 і    336433 і 2.84 і   349850 і    339031 і 3.09 і
  іavg.    і          і           і 2.82 і          і           і 2.48 і
  АДДДДДДДДБДДДДДДДДДДБДДДДДДДДДДДБДДДДДДБДДДДДДДДДДБДДДДДДДДДДДБДДДДДДЩ

   As before, most gain is achieved on plain texts (lcet10, alice29).

   8. Conclusions

   The appeal of BWT-based compression comes from its  speed  comparable
to typical LZ coders combined with superior compression rates  for  text
data, only several percent worse  than  those  achieved  with  the  best
existing text compressors today, RKIVE by Malcolm Taylor [12] and BOA by
Ian  Sutton  [13].  We  showed  three  ways  to  further  improve   text
compression of blocksorters. "Context alleviation" (space  stuffing  and
capital conversion) and phrase substitution are of more  general  usage.
The former helps to PPM and the latter helps to both PPM and, above all,
LZ schemes.  Alphabet reordering, which we used  for  additional  slight
gain,  is  BWT  specific  technique,  although  we  noticed  a    slight
improvement also for the associative coder by George Buyanovsky (ACB), a
very interesting algorithm having much in common with blocksorters [14].
   The overall gain for bzip and for  IMP,  two  notable  block  sorting
implementations, was about 3 percent, which is a significant step toward
the best known compression algorithms.  Our preprocessing was  fast  and
the overall compression speed was almost unaffected.  In special  cases,
the speed could even be increased since the phrase substitution  outputs
a noticably smaller volume of data for further coding.  A too large  set
of substituted phrases may hurt compression while making  it  faster  in
the same time.  It is clear that  fact  raises  a  speed  /  compression
trade-off.  We also would like to stress that the considered  ideas  are
independent from the used block sorting  coder,  in  the  sense  we  may
perceive the actual coder as a black box.  Hence the simplicity and ease
of application comes.
   The main difficulty with our ideas is the fact they are not  general.
Context alleviation helps with most human languages but not  necessarily
with other structured texts like computer programs in various languages.
The alphabet order was optimized for English, although, for the positive
side, some key observations supporting the chosen symbol permutation are
common for many languages. Phrase replacement is also language specific.
Moreover, the most frequent phrases for different texts in one  language
may vary significantly. On the other hand, shorter phrases, particularly
digrams, are more or  less  equally  common  within  one  language.  Our
experiments showed that replacing phrases longer than 3 characters  does
not give much additional benefit, so we restricted ourselves to digrams,
trigrams and a few 4-character phrases only. It also enabled an easy and
fast implementation of this idea.
   In practical application, for each of the ideas  lacking  robustness,
it is recommended to perform a preliminary scan with a  quick  heuristic
trying to determine if the  idea  in  question  is  applicable.  Several
simple hints were given in appropriate parts of this paper but they were
not implemented.  Our purpose was  to  demonstrate  that  the  presented
simple ideas often give significant benefits  for  compression  and  are
worth further  exploration  rather  than  dealing  with  heuristics  for
handling special cases.  Therefore  the  question  about  the  range  of
applicability of those ideas remains for further research.

   Acknowledgements

   The author would like to thank Malcolm Taylor, Uwe  Herklotz,  George
Lyapko, Ian Sutton and Vadim Yoockin for valuable suggestions, ideas and
interesting discussion.

   References

 [ 1] M. Burrows and D. J. Wheeler,
  A block-sorting Lossless Data Compression Algorithm,
  SRC Research Report 124, Digital Systems Research Center,
   Palo Alto, CA, May 1994.
  Also available at
   http://gatekeeper.dec.com
   /pub/DEC/SRC/research-reports/abstracts/src-rr-124.html.

 [ 2] B. Chapin and S. R. Tate,
  Higher Compression from the Burrows-Wheeler
   Transform by Modified Sorting,
  University of North Texas, Department of Computer Science, 1998.
  Also available at
   http://www.cs.unt.edu/ ~srt/papers/bwtsort.pdf.

 [ 3] P. Fenwick,
  Experiments with a Block Sorting Text Compression Algorithm,
   The University of Auckland, Department of Computer Science,
   Tech. Rep. 111, May 1995.
  Also available at
   ftp://ftp.cs.auckland.ac.nz/out/peter-f/TechRep111.ps.

 [ 4] P. Fenwick,
  Block Sorting Text Compression - Final Report,
   The University of Auckland, Department of Computer Science,
   Tech. Rep. 130, April 1996.
  Also available at
   ftp://ftp.cs.auckland.ac.nz/out/peter-f/TechRep130.ps.

 [ 5] U. Herklotz, private correspondence, 1999.

 [ 6] G. Lyapko, private correspondence, 1999.

 [ 7] M. Taylor, private correspondence, 1998.

 [ 8] B. Teahan,
  Modelling English text, PhD thesis,
   Department of Computer Science, The University of Waikato,
   Hamilton, New Zealand, May 1998.

 [ 9] V. Yoockin, private correspondence, 1999.

 [10] Bzip, version 0.21, sources available at
  ftp://ftp.elf.stuba.sk/pub/pc/pack/bzip021.tgz
 or
  ftp://ftp.elf.stuba.sk/pub/pc/pack/bzip021.zip.

 [11] IMP, version 1.1, available at
  http://www.technelysium.com.au/.

 [12] RKIVE, version 1.92 beta 1, available at
  http://www.geocities.com/SiliconValley/Peaks/9463/.

 [13] BOA, version 0.58 beta, available at
  ftp://ftp.elf.stuba.sk/pub/pc/pack/boa058.zip.

 [14] ACB version 2.00c, available at
  ftp://ftp.elf.stuba.sk/pub/pc/pack/acb_200c.zip.