EDWARD R. FIALA and DANIEL H. GREENE

     1. INTRODUCTION
     2. OVERVIEW OF THE DATA STRUCTURE
     3. THEORETICAL CONSIDERATIONS
     4. A SIMPLER DATA STRUCTURE
     5. A MORE POWERFUL ENCODING
     6. A FASTER COMPRESSOR
     7. IMPROVING THE COMPRESSION RATIO
     8. EMPIRICAL STUDIES
     9. CONCLUSIONS
     REFERENCES
     ABOUT THE AUTHORS


ABSTRACT:
   Several  methods  are  presented  for  adaptive,   invertible    data
compression  in  the  style  of  Lempel's  and  Ziv's   first    textual
substitution proposal.  For the first two methods, the article describes
modifications of McCreight's suffix tree  data  structure  that  support
cyclic maintenance of a window on the most recent source  characters.  A
percolating update is used to keep node positions within the window, and
the updating process is shown to have  constant  amortized  cost.  Other
methods explore the tradeoffs between compression time, expansion  time,
data structure size, and amount of  compression  achieved.  The  article
includes a graph-theoretic analysis of the compression penalty  incurred
by our codeword selection policy in comparison with an  optimal  policy,
and it includes empirical studies of the performance of various adaptive
compressors from the literature.

                            1. INTRODUCTION

   Compression is the coding of data to minimize its representation.  In
this article, we are concerned with fast. one-pass, adaptive, invertible
(or lossless) methods  of  digital  compression  which  have  reasonable
memory requirements.  Such methods can be used, for example,  to  reduce
the storage requirements for files, to increase the  communication  rate
over a channel, or to reduce redundancy prior to encryption for  greater
security.
   By "adaptive" we mean that a  compression  method  should  be  widely
applicable to different kinds of source data.  Ideally, it should  adapt
rapidly to the source to achieve significant compression on small files,
and it should adapt to any subsequent internal changes in the nature  of
the source.  In  addition,  it  should  achieve  very  high  compression
asymptotically on large regions with stationary statistics.
   All  the  compression  methods  developed  in  this    article    are
substitutional.  Typically, a  substitutional  compressor  functions  by
replacing large blocks  of  text  with  shorter  references  to  earlier
occurrences of identical text [3, 5, 29, 34, 36, 39,  41-43).  (This  is
often called Ziv-Lempel compression, in recognition of their  pioneering
ideas.  Ziv and Lempel, in fact, proposed two methods.  The  unqualified
use of the phrase  "Ziv-Lempel  compression"  usually  refers  to  their
second proposal [43].  In this article, we will be  primarily  concerned
with  their  first  proposal  [42].)  A  popular  alternative    to    a
substitutional compressor is  a  statistical  compressor.  A  symbolwise
statistical  compressor  functions  by   accurately    predicting    the
probability of individual symbols, and then encoding these symbols  with
space close to -log2 of the predicted  probabilities.  The  encoding  is
accomplished with either Huffman compression  {17]  which  has  recently
been made one-pass and adaptive [II, 22, 37], or with arithmetic coding,
as described in [1, 14, 20, 25. 26, 31-33].  The major  challenge  of  a
Statistical compressor is to predict the  symbol  probabilities.  Simple
strategies, such as keeping zero-order (single  symbol)  or  first-order
(symbol pair) statistics of the input, do not compress English text very
well.  Several  authors  have  had  success    gathering    higher-order
statistics, but  this  necessarily  involves  higher  memory  costs  and
additional mechanisms for dealing  with  situations  where  higher-order
statistics are not available [6. 7. 26.]
   It is hard to give a rigorous foundation to  the  substitutional  vs.
statistical distinction described above.  Several authors have  observed
that statistical methods can be used to simulate  textual  substitution,
suggesting tnat the statistical  category  includes  the  substitutional
category [4, 24].  However, this takes no account of the  simplicity  of
mechanism: the virtue of textual substitution is that it recognizes  and
removes coherence on a large  scale,  oftentimes  ignoring  the  smaller
scale statistics.  As a result, most  textual  substitution  compressors
process their compressed representation  in  larger  blocks  than  their
statistical counterparts, thereby gaining a significant speed advantage.
It was previously believed that the speed gained by textual substitution
would necessarily  cost  something  in  compression  achieved.  We  were
surprised to discover that with careful  attention  to  coding,  textual
substitution compressors can match the compression  performance  of  the
best statistical methods.
   Consider the following scheme, which we will  improve  later  in  the
article. Compressed files contain two types of codewords:

 literal x   pass the next x characters directly into the uncompressed
             output

 copy x, -y  go back y characters in the output and copy x characters
             forward to the current position.

So, for example, the following piece of literature:

 IT WAS THE BEST OF TIMES,
 IT WAS THE WORST OF TIMES

would compress to

 (literal 26)IT WAS THE BEST OF TIMES,
 (copy 11-26)(literal 3)WOR(copy 11-27)

   The compression achieved depends on the space required for  the  copy
and literal codewords. Our simplest scheme, hereafter denoted A1, uses 8
bits for a literal codeword and 16 for a copy codeword.  If the first  4
bits are 0, then the codeword is a literal; the next  4  bits  encode  a
length x in the range [1..16] and the following x characters are literal
(one byte per character). Otherwise, the codeword is a copy; the first 4
bits encode a length x in the range [2..16] and the next 12 bits  are  a
displacement y in the range [1..4096]. At each step, the policy by which
the compressor chooses between a literal and a copy is  as  follows:  If
the compressor is idle (just finished a copy, or  terminated  a  literal
because of the 16-character limit), then the longest copy of length 2 or
more is issued; otherwise, if the longest copy is less than  2  long,  a
literal  is  started.  Once  started,  a  literal  is  extended   across
subsequent characters until a copy of length 3 or more can be issued  or
until the length limit is reached.
   A1 would break the first  literal  in  the  above  example  into  two
literals and compress the source from 51 bytes down to 36. A1  is  close
to Ziv and  Lempel's  first  textual  substitution  proposal  [42].  One
difference is that A1 uses a separate literal codeword,  while  Ziv  and
Lempel combine each copy codeword with a single  literal  character.  We
have found  it  useful  to  have  longer  literals  during  the  startup
transient; after the startup, it is better to have no literals consuming
space in the copy codewords.
   Our empirical studies showed that, for source code and English  text.
the field size choices for A1 are good; reducing the size of the literal
length field by 1 bit increases compression slightly but  gives  up  the
bytealignment property of the A1 codewords.  In short, if one desires  a
simple method based upon the copy and literal idea, A1 is a good choice,
   A1 was designed for 8-bit per character text or program sources, but,
as we will see shortly, it achieves good compression on other  kinds  of
source data, such as compiled code and images, where the word model does
not match the source data particularly well, or where no  model  of  the
source is easily perceived.  A1 is, in fact, an  excellent  approach  to
general purpose data compression.  In the remainder of this article,  we
will study A1 and several more powerful variations.

                   2. OVERVIEW OF THE DATA STRUCTURE

   The fixed window suffix tree of this article  is  a  modification  of
McCreight's suffix tree [23] (see also [21, 34, 381).  which is itself a
modification of Morrison's PATRICIA tree (30), and  Morrison's  tree  is
ultimately based on a Trie data structure [22, page 481]. We will review
each of these data structures briefly.
   A Trie is a tree structure where the branching  occurs  according  to
"digits" of the keys, rather than according to comparisons of the  keys.
In English, for  example,  the  most  natural  "digits"  are  individual
letters, with the l'th level of the tree branching according to the l'th
letter of the words in the tree.
   In Figure 1, many internal nodes are  superfluous,  having  only  one
descendant. If we are building an index for a file, we can save space by
eliminating the superfluous nodes and putting pointers to the file  into
the nodes rather than including characters in  the  data  structure.  In
Figure 2, the characters in parentheses are not actually represented  in
the data structure, but they can be recovered from the (position, level)
pairs in the nodes.  Figure 2 also shows a suffix  pointer  (as  a  dark
right arrow) that will be explained later.

     ..............................
    Ú            **
               A /\ S
             S /    \ T
           T /        \ R
         R /        A / \ I
       A / \ I   STRAY   STRIDE
    ASTRAY     ASTRIDE           Ù
    ..............................
         FIGURE 1. A Trie

   Figure 2  represents  some,  but  not  all,  of  the  innovations  in
Morrison's PATRICIA trees.  He builds the  trees  with  binary  "digits"
rather than full characters, and this allows him to save more  space  by
folding the leaves into the internal nodes.  Our "digits" are bytes,  so
the branching factor can be as large as 256. Since there are rarely  256
descendants of a node, we do not reserve that much space in  each  node,
but instead hash the arcs.  There is also  a  question  about  when  the
strings in parentheses are checked in the  searching  process.  In  what
follows, we usually check characters immediately when we cross  an  arc.
Morrison's scheme can avoid file access by skipping  the  characters  on
the arcs, and doing only one file access and comparison at  the  end  of
the search.  However,  our  files  will  be  in  main  memory,  so  this
consideration is unimportant.  We will use the simplified tree  depicted
in Figure 2.

     ...........................................
    Ú            ( 1, 0 )
              A    /   \    S
         (STR)   /       \   (TR)
           ( 1, 4 )     ( 2, 3 )
        A    /    |     A  |  \  I
     (Y)   /      |    (Y) |    \  (DE)
       ( 9, ~ ) ( 1, ~ ) ( 10, ~) ( 2, ~ )    Ù
    ...........................................
    FIGURE 2. A PATRICIA Tree with a Suffix Pointer

   For A1, we wish to find the longest (up to 16 character) match to the
current string beginning anywhere in the preceding  4096  positions.  If
all preceding 4096 strings were stored in a PATRICIA tree with depth d =
16, then finding this match would be straightforward. Unfortunately, the
cost of inserting these strings can be prohibitive, for if we have  just
descended d levels in the tree to insert the string starting at position
i then we will descend at least d - 1 levels  inserting  the  string  at
i+1. In the worst case this can lead to 0(nd) insertion time for a  file
of size n.  Since later encodings will use much larger values for d than
16, it is important to eliminate d from the running time.
   To insert the strings in 0(n) time, McCreight added additional suffix
pointers to the tree.  Each internal node, representing the string aX on
the path from the root to the internal node.  has a pointer to the  node
representing X, the string obtained by stripping a  single  letter  from
the beginning of aX. If a string starting at i has just been inserted at
level d we do not need to return to the root to  insert  the  string  at
i+1; instead, a nearby suffix pointer  will  lead  us  to  the  relevant
branch of the tree.
   Figure 3 shows how suffix links are created and used. On the previous
iteration, we  have  matched  the  string  aXY,  where  a  is  a  single
character, X and Y are strings, and b is the first  unmatched  character
after Y. Figure 3 shows a complicated case where a new internal node, a,
has been added to the tree, and the suffix link of aX must be  computed.
We insert the next string XYb by going  up  the  tree  to  node  <beta>,
representing the string aX, and crossing its  suffix  link  to  <gamma>,
representing X.  Once we have crossed the suffix link, we descend  again
in the tree, first by "rescanning" the string Y, and then by  "scanning"
from <delta> until the new string is inserted.  The first part is called
"rescanning" because it covers a portion of the string that was  covered
by the previous insert, and so it does not require checking the internal
strings on the arcs. (In fact, avoiding these checks is essential to the
linear time functioning of the algorithm.) The rescan either ends at  an
existing node <delta>, or <delta> is created to insert  the  new  string
XYb; either way we have the destination for the suffix  link  of  a.  We
have restored the invariant that every internal  node,  except  possibly
the one just created, has a suffix link.

     ..................................................
    Ú                 *
                    /   \
            [aX]  /       \  [X]
                /           \
            [beta]        [gamma]      rescan
     [gamma] /                 \  [Y]
           /                     \
       [alpha]                 [delta]
        /    .  .............>      \  scan
      /   [b]  .                      \
    [k]        []                     []             Ù
    ..................................................


   For the A1 compressor, with a 4096-byte fixed window, we need  a  way
to dFIGURE 3. Building a Suffix Tree elete and reclaim the  storage  for
portions of the suffix tree representing strings further back than  4096
in the file.  Several things must be  added  to  the  suffix  tree  data
structure.  The leaves of the tree are placed in a circular  buffer,  so
that the oldest leaf can be identified and reclaimed, and  the  internal
nodes are given "son count" fields.  When an internal "son count"  falls
to one, the node is deleted and two consecutive arcs  are  combined.  In
Section 3, it is shown that this approach will never leave a  "dangling"
suffix link pointing to deleted nodes.  Unfortunately, this is  not  the
only problem in maintaining a valid suffix tree.  The modifications that
avoided a return to the root for each new  insertion  create  havoc  for
deletions.  Since we have not always returned to the root, we  may  have
consistently entered a branch of the tree  sideways.  The  pointers  (to
strings in the 4096-byte window) in the higher levels of such  a  branch
can become out-of-date.  However, traversing the branch and updating the
pointers would destroy any advantage gained by using the suffix links.
   We can keep valid pointers and avoid extensive updating by  partially
updating according to a percolating update.  Each internal  node  has  a
single "update" bit.  If the update bit is true when we are  updating  a
node, then we set the bit false and propagate the update recursively  to
the node's  parent.  Otherwise,  we  set  the  bit  true  and  stop  the
propagation.  In the worst case, a long string of true updates can cause
the update to propagate to the root.  However, when amortized  over  all
new leaves, the cost of updating is constant, and the effect of updating
is to keep all internal pointers  on  positions  within  the  last  4096
positions of the file. These facts will be shown in Section 3.
   We can now summarize the operation of the inner loop, using Figure  3
again.  If we have just created node <alpha>, then we  use  as  parent's
suffix link to find <gamma>.  From <gamma> we move  down  in  the  tree,
first rescanning, and  then  scanning.  At  the  end  of  the  scan,  we
percolate an update from the leaf, moving toward the root,  setting  the
position fields equal to the current position, and  setting  the  update
bits false, until we find a node with an  update  bit  that  is  already
false, whereupon we set  that  node's  update  bit  true  and  stop  the
percolation. Finally, we go to the circular buffer of leaves and replace
the oldest leaf with the new leaf.  If the oldest leaf's parent has only
one remaining son, then it must also  be  deleted;  in  this  case.  the
remaining son is attached to its grandparent,  and  the  deleted  node's
position is percolated upward before, only at  each  step  the  position
being percolated and the position already in the node must  be  compared
and the more recent of these sent upward in the tree.

                     3. THEORETICAL CONSIDERATIONS

   The correctness and linearity of  suffix  tree  construction  follows
from McCreight's original paper [28].  Here we  will  concern  ourselves
with the correctness and  the  linearity  of  suffix  tree  destruction-
questions raised in Section 2.

THEOREM 1.
   Deleting leaves in FIFO order and deleting internal nodes with single
sons will never leave dangling suffix pointers.

PROOF.
   Assume the contrary.  We have a node <gamma> with a suffix pointer to
a node <delta> that has just been deleted. The existence of a means that
there are at least two strings that  agree  for  l  positions  and  then
differ at l+1. Assuming that these two strings start at positions i  and
j, where both i and j are within the window of recently scanned  strings
and are not equal to the current position, then there are are  two  even
younger strings at i+1 and j+1 that differ first at 1. This  contiadicts
the assumption that S has one son. (If either i or j are  equal  to  the
current position, then <alpha> is a new node,  and  can  temporarily  be
without a suffix pointer.)
   There are two issues related to the percolating update: its cost  and
its effectiveness.

THEOREM 2.
   Each percolated update has constant amortized cost.

PROOF.
   We assume that  the  data  structure  contains  a  "credit"  on  each
internal node where the "update" flag is true.  A new leaf can be  added
with two "credits." One is spent immediately to update the  parent,  and
the other is combined with  any  credits  remaining  at  the  parent  to
either: 1) obtain one credit to leave at the parent  and  terminate  the
algorithm or 2) obtain two credits to apply the algorithm recursively at
the parent.  This gives an amortized cost of two updates  for  each  new
leaf.

   For the next theorem, define the "span" of a suffix tree to be  equal
to the size of its fixed window.  So far  we  have  used  examples  with
"span" equal to 4096. but the value is flexible.

THEOREM 3.
   Using the percolating update, every internal node will be updated  at
least once during every period of length "span".

PROOF.
   It is useful  to  prove  the  slightly  stronger  result  that  every
internal node (that remains for an entire period) will be updated  twice
during a period, and thus propagate at least one update to  its  parent.
To show a contradiction, we find the earliest period and the node <beta>
farthest from the root that does not propagate an update to its  parent.
If <beta> has at least two children that have remained  for  the  entire
period, then 13 must have received updates from these  nodes:  they  are
farther from the root.  If <beta> has only one remaining child, then  it
must have a new child, and so it will  still  get  two  updates.  (Every
newly created arc causes a  son  to  update  a  parent,  percolating  if
necessary.) Similarly, two new children also cause two updates. By every
accounting, <beta> will receive two updates during the period, and  thus
propagate an update-contradicting our assumption of <beta>'s failure  to
update its parent.
   There is some flexibility  on  how  updating  is  handled.  We  could
propagate the current position upward before rescanning, and then  write
the current position into those nodes passed during the rescan and scan;
in this case, the proof of Theorem 3 is conservative.  Alternatively,  a
similar, symmetric proof can be  used  to  show  that  updating  can  be
omitted when new arcs are added so long as we propagate an update  after
every arc is deleted. The choice is primarily a matter of implementation
convenience, although the method used above is slightly faster.

   The last major theoretical consideration is the effectiveness of  the
A1 policy in choosing between literal and copy codewords. We have chosen
the following one-pass policy for A1: When the encoder is idle, issue  a
copy if it is possible to copy two or more characters: otherwise,  start
a literal.  If the  encoder  has  previously  started  a  literal,  then
terminate the literal and issue a copy only if the  copy  is  of  length
three or greater.
   Notice that this policy can sometimes go astray. For example, suppose
that the compressor is idle at position i and  has  the  following  copy
lengths available at subsequent positions:

 i  i+1  i+2  i+3  i+4  i+5
 1   3    16   15   14   13

Under the policy, the compressor  encodes  position  i  with  a  literal
codeword, then takes the copy of length 3, and finally takes a  copy  of
length 14 at position i + 4. It uses 6 bytes in the encoding:

 (literal 1)X(copy 3 - y)(copy 14 - y)

If the compressor had foresight it could avoid the  copy  of  length  3,
compressing the same material into 5 bytes:

 (literal 2)XX(copy 16 - y)

   The optimal solution can be computed by dynamic programming [36]. One
forward pass records the length of the longest  possible  copy  at  each
position (as in equation 1) and the displacement for the copy (not shown
in equation 1).  A second backward pass  computes  the  optimal  way  to
finish compressing the file from each position  by  recording  the  best
codeword to use and the length  to  the  end-of-file.  Finally,  another
forward pass reads off the solution and  outputs  the  compressed  file.
However, one would probably never want to use dynamic programming  since
the one-pass heuristic is a lot faster, and  we  estimated  for  several
typical files that the heuristically compressed output was only about  1
percent larger than the  optimum.  Furthermore,  we  will  show  in  the
remainder of this section that the size of the compressed file is  never
worse than 5/4 the size of the optimal  solution  for  the  specific  A1
encoding.  This will require developing  some  analytic  tools,  so  the
non-mathematical reader should feel free to skip to Section 4.
   The following definitions are useful:

Definition
   f(i) is the longest feasible copy at position i in the file.

   Sample f(i)'s were given above in equation 1. They are  dependent  on
the encoding used.  For now, we are assuming that they  are  limited  in
magnitude to 16, and must correspond to copy  sources  within  the  last
4096 characters.

Definition.
   B(i) is the size of the best way to compress  the  remainder  of  the
file, starting at position i.

   B(i)'s would be computed in the reverse pass of the optimal algorithm
outlined above.

   The following Theorems are given without proof:

THEOREM,
   f(i + 1) >= B(i) - 1.

THEOREM.
   There exists an optimal solution where copies  are  longest  possible
(i.e., only copies corresponding to F{i)'s are used].

THEOREM.
   B(i) is monotone decreasing.

THEOREM.
   Any solution can be  modified,  without  affecting  length,  so  that
(literal x1) followed immediately by (literal x2)  implies  that  x1  is
maximum (in this case 16).

   We could continue to reason in this vein, but there  is  an  abstract
way of looking at the problem that is both  clearer  and  more  general.
Suppose  we  have  a  nondeterministic  finite  automaton  where    each
transition is given a cost.  A simple example is shown in Figure 4.  The
machine accepts (a + b)*, with costs as shown in parentheses.  The total
cost of accepting a string is the sum of the transition costs  for  each
character. (While it is  not  important  to  our  problem,  the  optimal
solution can be computed by forming a transition matrix for each letter,
using the costs shown in parentheses, and then multiplying the  matrices
for a given string, treating the coefficients as elements of the  closed
semiring with operations of addition and minimization.) We can obtain  a
solution that approximates the minimum by deleting  transitions  in  the
original  machine  until  it  becomes  a  deterministic  machine.   This
corresponds to choosing  a  policy  in  our  original  data  compression
problem. A policy for the machine in Figure 4 is shown in Figure 5.

    FIGURE 4. A Nondeterministic Automaton with Transition Costs
    FIGURE 5. A Deterministic "Policy" Automaton for Figure 4

   We now wish to compare, in the worst  case.  the  difference  between
optimally accepting a string with  the  non-deterministic  machine,  and
deterministically accepting the same string with the  "policy"  machine.
This is done by taking a cross product of the two machines, as shown  in
Figure 6.
   In Figure 6 there are now two weights on each transition;  the  first
is the cost in the non-deterministic graph, and the second is  the  cost
in the policy graph.  Asymptotically, the relationship  of  the  optimal
solution to the policy solution is dominated by the smallest ratio on  a
cycle in this graph.  In the case of Figure 6, there is a cycle from  I,
I' to 1, 2" and back that has cost in the non-deterministic graph of  2+
1 = 3, and cost in the policy graph of 3+3 = 6, giving a ratio  of  1/2.
That is.  the policy solution can be twice as bad as the optimum on  the
string ababababab... .

    FIGURE 6. The Cross Product

   In  general,  we  can  find  the  cycle  with  the  smallest    ratio
mechanically, using well known  techniques  [8,  27].  The  idea  is  to
conjecture a ratio r and then reduce the pairs of weights (x.  y) on the
arcs to single weights x - ry.  Under this reduction, a cycle with  zero
weight has ratio exactly r.  If a cycle has negative weight, then  r  is
too large.  The ratio on the negative cycle is used as a new conjecture,
and the process is iterated. (Negative cycles are detected by running  a
shortest path algorithm and checking  for  convergence.)  Once  we  have
found the minimum ratio cycle, we can create a worst case string in  the
original automata problem by finding a path from the start state to  the
cycle and then repeating the cycle indefinitely.  The ratio of the costs
of accepting the string non-deterministically and deterministically will
converge to the ratio of the cycle. (The path taken in the cross product
graph will not necessarily bring us  to  the  same  cycle,  due  to  the
initial path fragment; we will, nevertheless,  do  at  least  as  well.)
Conversely, if we have a sufficiently long string with non-deterministic
to deterministic ratio r, then the string will eventually  loop  in  the
cross product graph.  If we remove loops with ratio greater  than  r  we
only improve the ratio of the string, so we must eventually find a  loop
with ratio at least as small as r.
   The above discussion gives us an algorithmic  way  of  analyzing  our
original data compression problem.  The  possible  values  of  F(i)  are
encoded in a 17 character alphabet p0..p16, representing the  length  of
copy available at each position.  The compression algorithm is described
by a non-deterministic machine that accepts strings of pj  this  machine
has costs equal to the lengths of the codewords used by  the  algorithm.
There are two parameterized states in this machine: lx means that  there
is a literal codeword under construction with x spaces still  available;
c, means that a copy is in progress with y characters remaining to copy.
The idle state is  lo  =  Co.  In  the  non-deterministic  machine,  the
possible transitions are:

 start a literal
 -> li-i   continue a literal {x > 1)
 , r.m
 I, -> c,_i  start a copy
 r.(o)
 Cy -> c"-i continue a copy
 P.12)
 P.W

(2)

(An asterisk is used as a wild card to denote any state.) Based  on  the
theorems above we have already eliminated some transitions  to  simplify
what follows.  For example, Cy -* lis start a literal from inside a copy
(y ^ 1) (3) is unnecessary. The deterministic machine, given

(6)

 1   2   3   4   5   6   7
 p10 p10 p9  p8  P7  p6  P5

 8   9   10  11  12  13  14  15
 p4  p3  p2  p1  p2  p10 p10 p9

   (There is nothing special about 10; it was  chosen  to  illustrate  a
long copy and to match the example in  Appendix  A.)  The  deterministic
algorithm takes a copy of length 10 in  the  first  position,  and  then
switches to a literal for positions 11 and 12. Five bytes  are  used  in
each repetition of the pattern. The optimal solution is one position out
of phase.  It takes a copy of length 10 in the second position, and then
finds a copy of length 2 at position 12, fora total  of  four  bytes  on
each iteration.
   We have abstracted the problem so that the possible  copy  operations
are described by a string of pj and we have shown a pathological pattern
of pj that results in % of the optimal encoding.  There might  still  be
some doubt that such a string exists, since the condition that our third
machine (5) guarantees,  F(i+1)  >=  F(i)-1,  is  a  necessary  but  not
sufficient  condition.  Nevertheless,  the  details   of    an    actual
pathological string can be found in Appendix A.

                      4. A SIMPLER DATA STRUCTURE

   Although the quantity of code associated with A1 is not enormous,  it
is complicated, and the  data  structures  are  fairly  large.  In  this
section, we present simpler methods  for  finding  the  suffix  and  for
propagating the window position.
   The alternative to a percolating update is to update the positions in
all nodes back to the root whenever a new  leaf  is  inserted.  Then  no
updates are needed when nodes are  deleted.  The  update  flags  can  be
eliminated.  The alternative to suffix pointers is more complicated. The
cost of movement in a tree is not uniform: moving deeper requires a hash
table lookup, which is more expensive than following  a  parent.pointer.
So we can determine the suffix  by  starting  at  the  suffix  leaf  and
following parent pointers back toward the root until the suffix node  is
reached.  The suffix leaf is known because the string at i  matched  the
string at some earlier window position <the suffix leaf>+1 is  the  next
entry in the leaf array.  With this change, the suffix pointers  can  be
eliminated.
   From a theoretical perspective, these modifications, which have 0(nd)
worst case performance for a file of size n and  cut-off  depth  d,  are
inferior to the 0(n) performance of the suffix  tree.  For  A1,  with  a
cutoff of 16, these modifications improve average performance,  but  the
A2 method discussed in the next section has such  a  deep  cut-off  that
suffix pointers and percolated updates are preferable.

                      5. A MORE POWERFUL ENCODING

   The 4,096-byte window of A1 is roughly optimal for  fixed  size  copy
and literal codewords.  Longer copies would, on average, be found  in  a
larger window, but a larger displacement  field  would  be  required  to
encode them.  To exploit a larger window, we must use  a  variable-width
encoding  statistically  sensitive  to  the  fact  that  recent   window
positions are more likely to  be  used  by  copy  codewords  than  those
positions  further  back.  Similarly,  it  is  advantageous    to    use
variablewidth encodings for copy and literal lengths.
   There  are  several  approaches  we  might  use  for  variable-length
encoding.  We could use fixed or  adaptive  Huffman  coding,  arithmetic
encoding, a variable-length encoding of the integers,  or  a  manageable
set  of  hand-designed  codewords.  We  eliminated  from   consideration
adaptive Huffman and arithmetic coding because they are slow.  Moreover,
we felt they would provide (at  best)  a  secondary  adaptive  advantage
since the "front end" textual substitution is  itself  adapting  to  the
input.  We experimented with a fixed Huffman encoding,  a  hand-designed
family of codewords, and a variable-length encoding of the integers,  so
we will compare these options briefly:

Hand-Designed Codewords.
   This is a direct generalization of A1, with  short  copies  that  use
fewer bits but cannot address the full window, and  longer  copies  that
can address larger blocks  further  back  in  the  window,  With  a  few
codewords, this is fast and relatively easy to implement.  However, some
care must be taken in the choice of codewords to maximize compression.

Variable-Length Integers.
   The simplest method we tried uses  a  unary  code  to  specify  field
width, followed by the field itself. Copy length and displacement fields
are coded independently via this  technique,  so  any  correlations  are
ignored.  There are more elaborate codings of the integers (such as [9],
[10], or [13]),  that  have  been  used  by  [15],  and  [34]  in  their
implementations of Lempel-Ziv compression.  These  encodings  have  nice
asymptotic properties for very large integers, but  the  unary  code  is
best for our purposes since, as we will see shortly,  it  can  be  tuned
easily to the statistics of the application.  The  unary  code  has  the
additional advantage of a simple hardware implementation. We will return
to the unary code in more detail shortly.

Fixed Huffman.
   Ideally,  a  fixed  Huffman  encoder  should  be  applied  to  source
consisting of the copy length and displacement concatenated together (to
capture the correlation of these two fields).  However, since we wish to
expand window size to  16384  and  maximum  copy  length  to  2000,  the
realities of gathering statistics  and  constructing  an  implementation
dictate that we restrict the input of the fixed Huffman compressor to  a
size much smaller than 2000 x 16384  by  grouping  together  codes  with
nearly equal copy lengths and displacements.  To improve  speed  we  use
tables to encode and decode a byte at a time.  Nevertheless,  the  fixed
Huffman approach is the most complex and slowest of  the  three  options
compared here.

   To decide how much  compression  could  be  increased  with  a  Fixed
Huffman approach, we experimented with several groupings of nearly equal
copy lengths and displacements, using  a  finer  granularity  for  small
values, so that the input to the Fixed Huffman compressor had only about
30,000 states, and we computed the entropv to give a  theoretical  bound
on the compression.  The smallest entropy we obtained was only 4 percent
more compact  than  the  actual  compression  achieved  with  the  unary
encoding described below; and any real  implementation  would  do  worse
than an entropy bound.  Consequently, because the Fixed Huffman approach
did not achieve significantly higher compression, we favor  the  simpler
unary code, though this is not an overwhelmingly clear choice.
   Define a (start, step, stop) unary code of the integers  as  follows;
The n'th codeword has n ones followed by a zero followed by a  field  of
size start + n step.  If the field  size  is  equal  to  stop  then  the
preceding zero can be omitted.  The integers are laid  out  sequentially
through these codewords. For example, (3, 2, 9) would look like:

    Range      Codeword

      0-7        Oxxx
      8-39      1Oxxxxx
     40-167    11Oxxxxxxx
    168-679    111xxxxxxxxx

Appendix B contains a simple procedure that generates unary codes.
   The A2  textual  substitution  method  encodes  copy  length  with  a
(2,1,10) code, leading to a maximum copy length of 2044. A  copy  length
of zero signals a literal, for which literal length is then encoded with
a (0,1,5) code, leading to a maximum literal length of 63 bytes. If copy
length is non-zero, then copy displacement is encoded with  a  (10,2,14)
code.  The exact maximum copy and literal lengths are  chosen  to  avoid
wasted states in the unary progressions; a maximum copy length  of  2044
is sufficient for the kinds of data studied in Section 8. The A1  policy
for choosing between copy and literal codewords is used.
   Three  refinements  are  used  to  increase  A2's   compression    by
approximately 1 to 2 percent. First, since neither another literal nor a
copy of length 2 can immediately follow a literal of less  than  maximum
literal length, in this situation, we shift copy length codes down by 2.
In other words, in the (2,1,10) code for copy length,  0  usually  means
literal, 1 means copy length 2, etc.; but after a literal of  less  than
maximum literal length, 0 means copy length 3, 1 means  copy  length  4,
etc.
   Secondly, we phase-in the copy displacement encoding for small files,
using a (10-x, 2, 14-x) code, where x starts at 10 and descends to 0  as
the number of window positions grows: for example, x = 10 allows
2^1 + 2^2 + 2^4 = 21 values to be coded, so when the  number  of  window
positions exceeds 21, x is reduced to 9; and so forth.
   Finally,  to  eliminate  wasted  states  in  the  copy   displacement
encoding, the largest field in the (10-x,2,14-x) progression  is  shrunk
until it  is  just  large  enough  to  hold  all  values  that  must  be
represented; that is, if v values remain to be encoded  in  the  largest
field then smaller values  are  encoded  with  Floor(log2(v))  bits  and
larger values with Ceil(log2(v)) bits rather than 14-x bits.  This trick
increases compression during startup, and, if the window size is  chosen
smaller than the number of values in the  displacement  progression,  it
continues to be useful thereafter.  For example, the compression studies
in Section 8 use  an  A2  window  size  of  16,384  characters,  so  the
(10,2,14) code would waste 5,120 states in the 14-bit field without this
trick.
   Percolating update seems preferable  for  the  implementation  of  A2
because  of  the  large  maximum  copy  length;  with    update-to-root,
pathological input  could  slow  the  compressor  by  a  factor  of  20.
Unfortunately, the percolating update does not guarantee that the suffix
tree will report the nearest position for a match, so  longer  codewords
than necessary may sometimes  be  used.  This  problem  is  not  serious
because the tree is often shallow, and nodes near the root usually  have
many sons, so updates propagate much more rapidly than  assumed  in  the
analysis of Section 3. On typical  files,  compression  with  percolated
update is 0.4 percent less than with update-to-root.

                         6. A FASTER COMPRESSOR

   A2 has very fast expansion with a  small  storage  requirement,  but,
even though compression has constant  amortized  lime,  it  is  5  times
slower than expansion.  A1 and A2 are most appropriate  in  applications
where compression speed is not critical and where the performance of the
expander needs to be optimized, such as the mass release of software  on
floppy disks.  However, in applications such as file  archiving,  faster
compression is needed.  For this reason, we have developed the B1 and B2
methods described here, which use the  same  encodings  as  A1  and  A2,
respectively,  but  compute  window  displacement   differently.    Copy
codewords are restricted to start at the beginning of the  yth  previous
codeword or literal character emitted: they can no longer address  every
earlier character, but only those where literal characters  occurred  or
copy codewords started; we refer to displacements computed this  way  as
"compressed displacements" throughout.  Copy length is still measured in
characters, like A1. By  inserting  this  level  of  indirection  during
window access, compression speed typically triples, though expansion and
the rate of adaptation are somewhat slower.
   With  "compressed  displacements"  suffix   pointers    and    update
propagation are unnecessary and a simpler PATRICIA tree can be used  for
the  dictionary.  Entries  are  made  in  the  tree  only  on   codeword
boundaries, and this can be done in linear time by starting at the  root
on each iteration.  It is useful to create an array of  permanent  nodes
for all characters at depth 1. Since copy  codewords  of  length  1  are
never  issued,  it  doesn't  matter  that  some  permanent  nodes  don't
correspond to any window character.  Each iteration begins  by  indexing
into this node array with the next character.  Then hash  table  lookups
and arc character comparisons are used to descend deeper, as in A1.  The
new window position is written into nodes passed on  the  way  down,  so
update propagation is unnecessary.
   In short, the complications necessary  to  achieve  constant  average
time per source character with  A2  are  eliminated.  However,  one  new
complication is  introduced.  In  the  worst  case,  the  16,384  window
positions of B2 could require millions of characters,  so  we  impose  a
limit of 12 X 16,384 characters; if the full window exceeds this  limit,
leaves for the oldest window positions are purged from the tree.
   Because of slower adaptation, B2  usually  compresses  slightly  less
than A2 on small files.  But  on  text  and  program  source  files,  it
surpasses A2 by 6 to 8 percent asymptotically; the crossover from  lower
compression to higher occurs after about 70,000 characters! A2 codewords
find all the near-term context, while  B2  is  restricted  to  start  on
previous codeword boundaries but can consequently reach further back  in
the file.  This gives B2 an advantage  on  files  with  a  natural  word
structure, such as text,  and  a  disadvantage  on  files  where  nearby
context is especially important, such as scanned images.
   We also tried variations where the tree is  updated  more  frequently
than on every codeword boundary and literal character. All variations up
to and including A2 can be implemented within the general  framework  of
this method, if speed is not an issue.  For example, we found that about
1 percent higher  compression  can  be  achieved  by  inserting  another
compressed position between  the  two  characters  represented  by  each
length 2 copy  codeword  and  another  0.5  percent  by  also  inserting
compressed positions after each character represented by length  3  copy
codewords. However, because these changes slow compression and expansion
we haven't used them.

                   7. IMPROVING THE COMPRESSION RATIO

   In Section 6 we considered ways to speed up compression at  the  cost
of slower adaptation and expansion.  In this section we will explore the
other direction: improving the compression ratio with a slight  cost  to
the running time of the algorithm.
   When a string occurs frequently in a file, all the  methods  we  have
considered so far waste space in their encoding: when they are  encoding
the  repeating  string,  they  are  capable  of  specifying  the    copy
displacement to multiple previous occurrences of the  string,  yet  only
one string needs to be copied.  By contrast, the data structures we have
used do not waste space.  The repeating strings share a common path near
the root.  If we base the copy codewords directly on the data  structure
of the dictionary, we can improve the compression  ratio  significantly.
(This brings us closer to the second style of Ziv and  Lempel's  textual
substitution work [19, 29, 43), where a dictionary is maintained by both
the compressor and expander. However, since we still use a window and an
explicit  copy  length  coding,  it  is  natural  to  view  this  as   a
modification of our  earlier  compressors,  in  the  style  of  Ziv  and
Lempel's first textual substitution work.)
   The C2 method uses the same PATRICIA tree data structures  as  B2  to
store its dictionary. Thus it takes two pieces of information to specify
a word in the dictionary: a node, and a location along the  arc  between
the node and its parent (since PATRICIA  tree  arcs  may  correspond  to
strings with more than one character). We will distinguish two cases for
a copy: if the arc is at a leaf of the tree, then we will use a LeafCopy
codeword, while if the arc is  internal  to  the  tree  we  will  use  a
NodeCopy codeword.  Essentially, those strings  appearing  two  or  more
times in the window are coded with NodeCopies, avoiding  the  redundancy
of A2 or B2 in these cases.
   The C2 encoding begins with a single prefix  bit  that  is  0  for  a
NodeCopy, 1 for a LeafCopy or Literal.
   For NodeCopy codewords, the prefix is followed by a  node  number  in
[0..maxNodeNo], where maxNodeNo is the largest node  number  used  since
initialization: for most files tested, maxNodeNo is about 50 percent the
number of leaves.  Following the node number, a displacement  along  the
arc from the node to its parent is encoded; for most NodeCopy  codewords
the incoming arc is of length 1, so no length field is  required.  If  a
length field is required, 0 denotes a match exactly at  the  node,  1  a
displacement 1 down the arc from the parent node,  etc.  Rarely  is  the
length field longer than one or two bits because  the  arc  lengths  are
usually short, so all possible displacements can be enumerated with only
a few bits.  For both the node number and the incoming arc displacement,
the trick described in Section 5 is used to eliminate wasted  states  in
the field; that is, if v values must be encoded, then the smaller values
are encoded with Floor(log2(v)) bits and larger values with Ceil(log2(v))
bits.
   LeafCopies are coded with unary progressions like those of A2 or  B2.
A (1. I, II) progression is used to specify the distance of the  longest
match down the leaf arc from its parent node. with 0 denoting a literal;
this progression leads to a maximum copy length  of  4094  bytes.  Since
another literal never occurs immediately after a literal  of  less  than
maximum literal length, the LeafCopy arc distance progression is shifted
down by  1  when  the  preceding  codeword  was  a  literal  (i.e.,  arc
displacement 1 is coded as 0, 2 as 1, etc.). On a cross section of files
from the data sets discussed later,  distance  down  the  leaf  arc  was
highly skewed, with about  half  the  arc  displacements  occurring  one
character down the leaf arc.  Because of this probability spike at 1 and
the rapid drop off at larger distances,  the  average  length  field  is
small.  Following the length field, the  window  position  is  coded  by
gradually phasing in a (10,2,14) unary progression exactly like B2's.
   Literals are coded by first coding a LeafCopy arc displacement  of  O
and then using a  (0,1,5)  unary  progression  for  the  literal  length
exactly like B2.
   Unlike A2 and B2, the expander for C2 must maintain a dictionary tree
exactly like the compressor's tree to permit decoding.  Notice that this
is not as onerous as it might seem.  During compression,  the  algorithm
must search the tree downward (root toward leaves) to find  the  longest
match, and this requires a hash table access at each node.  By contrast,
the expander is told which node was matched, and can recover the  length
and window position of the  match  from  the  node.  No  hash  table  is
required, but the encoding is restricted: a copy  codeword  must  always
represent the longest match  found  in  the  tree,  in  particular,  the
superior heuristic used  by  B2  to  choose  between  Literal  and  Copy
codewords must be discarded; instead,  when  the  longest  match  is  of
length 2 or more, a copy codeword must always  be  produced.  With  this
restriction, the expander  can  reconstruct  the  tree  during  decoding
simply by hanging each new leaf from the node or arc  indicated  by  the
NodeCopy or LeafCopy codeword, or in the case of  literals,  by  hanging
the leaf from the permanent depth I node for each literal character.

                          8. EMPIRICAL STUDIES

   In this section, we compare the  five  compression  methods  we  have
developed with other one-pass, adaptive methods. For most other methods,
we do not have welltuned implementations  and  report  only  compression
results. For implementations we have tuned for efficiency, speed is also
estimated (for our 3 MIP, 16-bit word size.  8  megabyte  workstations).
The execution times used to determine speed include the  time  to  open,
read, and write files on the local disk (which has  a  relatively  slow,
maximum transfer rate of 5 megabits per second): the speed  is  computed
by dividing the uncompressed file size by the execution time for a large
file.
   We tested file types  important  in  our  working  environment.  Each
number in Table 1 is the sum of the compressed file sizes for all  files
in the group divided by the sum of the original file sizes.  Charts  1-3
show the  dependency  of  compression  on  file  size  for  all  of  the
compression methods tested on the source code (SC) data set.

                               Data Sets

SC Source Code.
   All 8-bit Ascii source  files  from  which  the  boot  file  for  our
programming environment is built.  Files include some English  comments,
and a densely-coded collection of formatting information at the  end  of
each file reduces compressibility.  The files themselves are written  in
the Cedar language. (1185 files, average size 11 Kbytes, total size 13.4
Mbytes)

TM Technical Memoranda.
   All files from a directory where computer science technical memoranda
and reports are filed, excluding those containing  images.  These  files
are 8-bit Ascii text with densely-coded formatting  information  at  the
end (like the source code). (134 files, average size  22  Kbytes,  total
size 2.9 Mbytes)

NS News Service.
   One file for each work day of a week from a major wire service: these
files are 8-bit Ascii with  no  formatting  information.  Using  textual
substitution methods, these do not compress as  well  as  the  technical
memoranda of the previous study group, even though they are much  larger
and should be less impacted by startup  transient:  inspection  suggests
that the larger vocabulary and extensive use of proper  names  might  be
responsible for this. (5 files, average size 459 Kbytes, total size  2.3
Mbytes)

CC Compiled Code.
   The compiled-code files produced from the  SC  data  set.  Each  file
contains several  different  regions:  symbol  names,  pointers  to  the
symbols, statement boundaries and source positions for the debugger, and
executable code.  Because each region is  small  and  the  regions  have
different  characteristics,  these  files  severely  test  an   adaptive
compressor. (1220 files, average size 13 Kbytes, total size 16.5 Mbytes)

BF Boot File.
   The boot file for our programming environment, basically a core image
and memory map. (1 file, size 525 Kbytes)

SF Spline Fonts.
   Spline-described character fonts used to  generate  the  bitmaps  for
character sets at a variety of resolutions. (94 files, average  size  39
Kbytes, total size 3.6 Mbytes)

RCF Run-coded Fonts.
   High-resolution character fonts, where the original bitmaps have been
replaced by a run-coded  representation.  (68  files,  average  size  47
Kbytes, total size 3.2 Mbytes)

SNI Synthetic Images.
   All 8 bit/pixel synthetic  image  files  from  the  directory  of  an
imaging researcher.  The 44 files are the red,  green,  and  blue  color
separations for 12 color images, 2 of which also have an extra  file  to
encode background transparency; in addition,  there  are  6  other  grey
scale images. (44 files,, average  size  328  Kbytes,  total  size  14.4
Mbytes)

SCI Scanned Images.
   The red separations for all 8 bit/pixel scanned-in color images  from
the directory of an imaging researcher. The low-order one or two bits of
each pixel are probably  noise,  reducing  compressibility.  (12  files,
average size 683 Kbytes, total size 8.2 Mbytes)

BI Binary Images.
   CCITT standard images used to evaluate binary  facsimile  compression
methods.  Each file consists of a 148-byte header followed by  a  binary
scan of I page (1728 pixels/scan line x 2376  scan  lines/  page).  Some
images have blocks of zeros more than 30,000 bytes long.  Because  these
files are composed of 1-bit rather than 8-bit items, the general-purpose
compressors do worse than they otherwise might. (8 files,  average  size
513 Kbytes, total size 4.1 Mbytes)

   The special-purpose CCITT 1D and 2D compression methods  reported  in
[18] achieve, respectively, 0.112 and 0.064 compression ratios on  these
standard images when the extraneous end-of-line codes  required  by  the
facsimile standard are removed and when the extraneous  148-byte  header
is removed.  The special-purpose CCITT 2D result is  significantly  more
compact than any general purpose method we tested, and only  CW  and  C2
surpassed the ID result.

(7)
(8)

                  Measurements and Compression Methods

H0 and H1.
   These are entropy calculations made on a per file basis according to:

  H0 = -Sum( P{x=Ci}*log2(P{x=Ci}), i=0..? )
  H1 = -Sum( P{x=Ci | y=Cj}*log2(P{x=Ci | y=Cj}), i=0..?, j=0..? )

where x is a random symbol of the source, xy is a randomly  chosen  pair
of adjacent source characters, and Ci ranges over all possible  symbols.
Because of the small file size, the curves in Charts 1 to 3 drop off  to
the left.  In theory, this  small  sampling  problem  can  be  corrected
according to [2], but we have found it difficult to estimate  the  total
character set size in order to apply  these  corrections.  Nevertheless,
Chart 1 shows that H0 is a good estimator  for  how  well  a  memoryless
(zero-order) compressor can do when file size is a large multiple of 256
bytes, and H1 bounds the compression for a  first-order  Markov  method.
(None of  our  files  were  large  enough  for  H1  to  be  an  accurate
estimator.)

KG and V.
   These adaptive methods maintain a Huffman tree based on the frequency
of characters seen so far in a file.  The compressor and  expander  have
roughly equal performance.  The theory behind the KG approach appears in
[11] and [23]. The similar V method, discussed in [37] should get better
compression during the startup transient at the expense of  being  about
18 percent slower.  It is also possible  to  bound  the  performance  of
Vitter's scheme closely to that  of  a  fixed  non-adaptive  compressor.
Except on the highly compressible CCITT images,  these  methods  achieve
compression slightly worse than HO, as  expected.  But  because  of  bit
quantization, the compression of the  CCITT  images  is  poor-arithmetic
coding would compress close to H0  even  on  these  highly  compressible
sources.

CW.
   Based on [6], this method gathers higher-order statistics than KG  or
V above (which we ran only on zeroth-order statistics).  The method that
Cleary and Witten describe keeps statistics to some order o and  encodes
each new character based on the context of the o  preceding  characters.
(We've used o = 3, because any higher order exhausts storage on most  of
our data sets.) If the new character has never before  appeared  in  the
same context, then an escape mechanism is used to back down  to  smaller
contexts to encode the character using  those  statistics.  (We've  used
their escape mechanism A with  exclusion  of  counts  from  higher-order
contexts.) Because of high event probabilities  in  some  higher-ordered
contexts and the possibility of multiple escapes before a  character  is
encoded, the fractional bit loss of Huffman encoding is  a  concern,  so
[6] uses arithmetic encoding.  We have used the  arithmetic  encoder  in
[40].

TABLE 1. Comparison of Compression Methods

Method    Text       Binary    Fonts        Images
      SC   TM   NS   CC   BF   SF   RCF  SNI  SCI  BI
HO   .732 .612 .590 .780 .752 .626 .756 .397 .845 .148
H1 . .401 .424 .467 .540 .573 .380 .597 .181 .510 .101
KG   -751 .625 .595 .804 .756 .637 .767 .415 .850 .205
V    .749 .624 .595 .802 .756 .637 .766 .414 .850 .205
CW   -369 .358 .326 .768 .544 .516 .649 .233 .608 .106
MW1  .508 .470 .487 .770 .626 .558 .705 .259 .728 .117
MW2  -458 .449 .458 .784 .594 .526 .692 .270 .774 .117
UW   .521 .476 .442 .796 .638 .561 .728 .255 .697 .118
BSTW .426 .434 .465   -  .684   -  .581   -    -    -
A1   .430 .461 .520 .741 .608 .502 .657 .351 .766 .215
A2   .366 .395 .436 .676 .538 .460 .588 .259 .709 .123
B1   -449 .458 .501 .753 .616 .505 .676 .349 .777 .213
B2   .372 .403 .410 .681 .547 .459 .603 .255 .714 .117
C2   .360 .376 .375 .668 .527 .445 .578 .233 .662 .105

   As Table 1  shows,  CW  achieves  excellent  compression.  Its  chief
drawbacks are its space and time performance.  Its space requirement can
grow in proportion to file size; for example, statistics for o  =  3  on
random input could require a tree with 256^4 leaves, though English text
requires much less.  The space (and consequently time) performance of CW
degrades dramatically on "more random" data sets like  SNI  and  SCI.  A
practical implementation would have to limit storage  somehow.  Even  on
English,  Bell,  Cleary,  and  Witten  estimate  that  Moffat's    tuned
implementation of CW is 3 times slower compressing and  5  times  slower
expanding than C2 [41].

MW1.
   This method, described in [29], is related to  the  second  style  of
Lempel-Ziv compression, alluded to in the introduction, it uses  a  Trie
data structure and 12-bit codes.  Initially (and always) the  dictionary
contains 256 one-character strings.  New material is encoded by  finding
the longest match in the dictionary, outputting the associated code, and
then inserting a new dictionary entry that is the longest match plus one
more character. After the dictionary has filled, each iteration reclaims
an old code from among dictionary leaves, following  a  LRU  discipline,
and reuses that code for the new dictionary entry.  The  expander  works
the same way. MW1 is simple to implement and is balanced in performance,
with good speed both compressing and expanding (250,000 bits/sec and 310
bits/sec respectively). The original method used 12-bit codes throughout
for simplicity and efficiency.  However, our  implementation  starts  by
using 9-bit codewords, increasing to 10, 11, and finally to 12  bits  as
the dictionary grows to its maximum size: this saves up to 352 bytes  in
the compressed file size.  On text and source code,  Miller  and  Wegman
determined that the 12-bit codeword size is close to  optimal  for  this
method.

MW2.
   One drawback of MW1  is  the  slow  rate  of  buildup  of  dictionary
entries.  If, for example, the word abcdefghi appears  frequently  in  a
document, then ab will be in the dictionary after the first  occurrence,
abc after the second, and so on, with the full word present only after 8
occurrences (assuming no help from similar words in  the  document).  A1
below, for example, would be able to copy the whole word abcdefghi after
the first occurrence, but it pays a penalty for the  quick  response  by
having a length field in its copy codeword.  The idea of MW2 is to build
dictionary entries faster by combining adjacent  codewords  of  the  MW1
scheme.  Longer words like abcdefghi are  built  up  at  an  exponential
rather than linear rate.  The chief disadvantage of MW2 is its increased
complexity  and  slow  execution.  Our  implementation    follows    the
description in [29] and uses an upper limit of 4096  dictionary  entries
(or 12-bit codewords).  We did not implement the 9-12 bit  phasein  that
was used  in  MW1  so  the  size-dependent  charts  underestimate  MW2's
potential performance on small files.

UW.
   This is the Compress utility found in the Berkeley  4.3  Unix,  which
modifies a method described in a paper by Welch  [39]:  the  authors  of
this method are S.Thomas, J.McKie, S.Davies, K.Turkowski,  J.Woods,  and
J.  Orost. It builds its dictionary like MW1.  gradually  expanding  the
codeword size from 9 bits initially up to 16  bits.  The  dictionary  is
frozen  after  65,536  entries,  but  if  the  compression  ratio  drops
significantly, the dictionary is discarded and rebuilt from scratch.  We
used this compressor remotely on  a  Vax-785,  so  it  is  difficult  to
compare its running time and implementation difficulties with the  other
methods we implemented.  Nevertheless, because it does not use  the  LRU
collection of codes, it should be faster than MW1.  However,  it  has  a
larger total storage requirement and gets worse compression than MW1  on
most data sets studied.

BSTW.
   This  method  first  partitions  the  input  into  alphanumeric   and
non-alphanumeric "words", so it is specialized for text, though we  were
able to run it on some other kinds of data as  well.  The  core  of  the
compressor is a move-to-front heuristic.  Within each  class,  the  most
recently seen words are kept on a list (we have used list size 256).  If
the next input word is already in the word  list,  then  the  compressor
simply encodes the position of the word in the list and then  moves  the
word to the front of the list.  The move-to-front heuristic  means  that
frequently used words will be near the front of the list, so they can be
encoded with fewer bits.  If the next word in the input stream is not on
the word list, then the new word is added to  the  front  of  the  list,
while another word is removed from the end of the list, and the new word
must be compressed character-by-character.
   Since the empirical results in [5] do not actually give  an  encoding
for the positions of words in the list or  for  the  characters  in  new
words that are output,  we  have  taken  the  liberty  of  using  the  V
compressor as a  subroutine  to  generate  these  encodings  adaptively.
(There are actually four copies of Vitter's algorithm  running,  one  to
encode  positions  and  one  to  encode  characters  in  each  of    two
partitions.) Using an adaptive Huffman is slow; a fixed  encoding  would
run faster, but we expect that a fixed encoding  would  slightly  reduce
compression on larger files  while  slightly  improving  compression  on
small files.  We could not run BSTW for all of the data sets, since  the
parsing mechanism assumes human-readable text and long "words" appear in
the other data sets.  When the unreadable input parsed well, as  in  the
case of run-coded fonts, the compression was very good.

A1.
   This is our basic method described earlier.  It has a fast and simple
expander (560,000 bits/sec) with a  small  storage  requirement  (10,000
bytes).  However, the compressor  is  much  slower  and  larger  (73,000
bits/sec, 145,000 bytes using scan-from-leaf  and  update-to-root).  The
encoding has a maximum compression to 1/8 = 12.5 percent of the original
file size because the best it can do is copy 16 characters with a 16-bit
codeword,
   Caveat: As we mentioned above, the running times reported include the
file system overhead for a relatively slow disk.  To provide a baseline,
we timed a file copy without compression and obtained a rate of  760,000
bits per second.  Thus, some of the faster expansion rates we report are
severely limited by the disk. For example, we estimate that without disk
overhead the A1 expander would be about twice  as  fast.  On  the  other
hand, removing disk overhead would hardly affect the  compression  speed
of A1.

A2.
   This method, discussed in Section 5, enlarges the  window  to  16,384
characters  and  uses  variable-width  unary-coded  copy  and    literal
codewords to significantly increase compression.  The running  time  and
storage requirements are 410,000 bits/sec and 21,000 bytes for expansion
and 60,000 bits/sec and 630,000  bytes  for  compression  (using  suffix
pointers and percolated update).

B1.
   This method, discussed in Section 6, uses the A1 encoding but triples
compression speed by updating the tree only at codeword  boundaries  and
literal characters.  The  running  time  and  storage  requirements  are
470,000 bits/sec and 45,000 bytes for expansion and 230,000 bits/sec and
187,000 bytes for compression.

B2.
   This method, discussed in Section 6. uses the same encoding as A2 but
triples  compression  speed  by  updating  the  tree  only  at  codeword
boundaries and literal characters.  The compressor and expander  run  at
170,000  and  380,000  bits/sec,  respectively,   and    have    storage
requirements of 792,000 and 262,000 bytes.

C2.
   This method, discussed in Section 7, uses the same data structures as
B2 but a more powerful encoding based directly upon the structure of the
dictionary tree.  Compression is about the same and expansion  about  25
percent slower than B2; the compressor uses about the  same  storage  as
B2, but the expander uses more (about 529,000 bytes).

   Table 1 highlights  some  differences  between  textual  substitution
methods like C2  and  statistical  methods  like  CW.  (Time  and  space
performance differences have been discussed earlier.) There are  several
data sets  where  these  methods  differ  dramatically.  On  NS,  CW  is
significantly better than C2. We believe that this is because  NS  shows
great diversity in  vocabulary:  a  property  that  is  troublesome  for
textual substitution,  since  it  cannot  copy  new  words  easily  from
elsewhere in the document, but is benign for CW,  since  new  words  are
likely to follow the existing English statistics. On CC, for example, C2
is significantly better than CW.  We believe that  this  is  because  CC
contains several radically different  parts,  e.g.  symbol  tables,  and
compiled code.  C2 is able to adjust to dramatic shifts within  a  file,
due to literal codewords and copy addressing that favors nearby context,
while CW has no easy  way  to  rapidly  diminish  the  effect  of  older
statistics.
   For all of our methods, A2, B2, and C2. window size is a  significant
consideration because it determines  storage  requirements  and  affects
compression ratios.  Chart 4 shows compression as a function  of  window
size for the NS data set (concatenated  into  a  single  file  to  avoid
start-up effects), and for the BF boot file.  These two data  sets  were
typical of the bimodal behavior we observed  in  our  other  data  sets:
large human-readable files benefit greatly from increasing window  size,
while other test groups show little improvement beyond a window size  of
4K.

                             9. CONCLUSIONS

   We  have  described  several  practical  methods  for  lossless  data
compression and developed data structures to support them. These methods
are strongly adaptive in the sense  that  they  adapt  not  only  during
startup but also to context changes occurring later.  They are  suitable
for most high speed applications because they make only  one  pass  over
source data, use only a conslant amount of storage,  and  have  constant
amortized execution time per character.
   Our empirical studies point to several broad generalizations.  First,
based on the HO and H1 theoretical limits, textual substitution via  A2,
B2, or C2 surpasses memoryless or first-order Markov methods applied  on
a character-by-character basis on half the data sets. On the other half,
even the CW third-order method can't achieve the H1 bound. This suggests
that.  to surpass textual substitution for general purpose  compression,
any Markov method must be at least second-order, and to date,  all  such
methods have poor space and time performance.
   Secondly, the methods we've developed adapt  rapidly  during  startup
and at transitions  in  the  middle  of  files.  One  reason  for  rapid
adaptation is the use of smaller representations  for  displacements  to
recent positions in the window.  Another  reason  is  the  inclusion  of
multi-character literal  codewords.  Together  the  literals  and  short
displacements allow our methods to perform well on  short  files,  files
with major internal shifts of vocabulary or statistical properties,  and
files with bursts of poorly compressing material - all properties  of  a
significant number of files in our environment.
   Thirdly, it appears  that  the  displacement-and-length  approach  to
textual  substitution  is  especially  effective  on  small  files.   On
11,000-byte program source files, for example, A2 and B2  were  over  20
percent more compact than textual substitution methods which did not use
a length field (UW.  MWI, and MW2). This is not surprising  because  the
particular advantage of the length field  in  copy  codewords  is  rapid
adaptation on small files. However, even on the largest files tested, A2
and B2 usually achieved significantly higher compression. Only on images
did other methods compete with  them;  our  most  powerful  method,  C2,
achieved higher compression than any other textual  substitution  method
we tested on all data sets.  The effect of a length field is to  greatly
expand dictionary  size  with  little  or  no  increase  in  storage  or
processing time; our results suggest that textual  substitution  methods
that use a length field will work better than those which do not.
   Fourthly, studies of A2, B2, and  C2  using  different  window  sizes
showed that, for human-readable input (e.g., English, source code), each
doubling of window size improves the  compression  ratio  by  roughly  6
percent (for details see Chart  4).  Furthermore,  the  data  structures
supporting these methods scale well:  running  time  is  independent  of
window size, and memory usage grows  linearly  with  window  size.  Thus
increasing window size is an easy way to improve the  compression  ratio
for large files of human-readable input.  For other types of  input  the
window size can be  reduced  to  4096  without  significantly  impacting
compression.
   Going  beyond  these  empirical  results,  an  important    practical
consideration is the  tradeoff  among  speed.  storage,  and  degree  of
compression;  speed  and  storage  have  to  be  considered  for    both
compression and  expansion.  Of  our  own  methods,  A2  has  very  fast
expansion with a minimal  storage  requirement;  its  weakness  is  slow
compression; even though the suffix tree data structure  with  amortized
update has constant amortized time per character, compression  is  still
seven times slower than expansion.  However, in applications  which  can
afford relatively slow compression, A2 is  excellent;  for  example,  A2
would be good for mass distribution of software on floppy disks  or  for
overnight compression of files on a file  server.  Furthermore,  if  the
parallel matching in the compression side  of  A2  were  supported  with
VLSI, the result would be a  fast,  powerful  method  requiring  minimal
storage both compressing and expanding.
   B2 provides nearly three times faster compression  than  A2  but  has
somewhat slower expansion and adaptation.  Thus B2 is  well  suited  for
communication and archiving applications.
   Al and B1 do not compress as well as A2  and  B2,  respectively,  but
because of their two-codeword, bytealigned  encodings  they  are  better
choices for applications where simplicity or  speed  is  critical.  (For
example, ).  Gasbarro has designed and implemented an  expansion  method
like A1 to improve the bandwidth of a VLSI circuit tester [12].)
   C2  achieves  significantly  higher  compression  than  B2,  but  its
expander is somewhat slower and has a larger storage requirement. In the
compression study  reported  in  Section  8,  C2  achieved  the  highest
compression of all methods tested on 6 of the 10 data sets.  We  believe
that our implementations and empirical results demonstrate the value  of
window-based textual substitution. Together the A, B and C methods offer
good options that can be chosen according to resource requirements.
size of the arithmetic progression.

                            Acknowledgments.

   We would like to thank Jim Gasbarro, John Larson,  Bill  Marsh,  Dick
Sweet, lan Witten, and the anonymous referees for helpful  comments  and
assistance.

                               REFERENCES

1. Abramson, N. Information Theory and Ceding. McCraw-Hill. 1963, 61.
2. Basharin, G.P. On a stalistical estimate for the entropy of a
   sequence of independent random variables. Theory Prob. Appl. 4,
   (1959), 333-336.
3, Bell, T.C. Better OPM/L text compression. IEEE Trans. Commun.
   COM-U, 12 (1980), 1176-1182.
4. Bell, T.C., Cleary, ).G.. and Witten, i.H. Tell Compression. In press
   with Prentice-Hall.
5. Bentley, ).L., Sleator, D.D.. Tarjan, R.E.. and Wei. V.K. A locally
   adaptive data compression scheme. Commun. ACM 29, 4 (1885), 320-330.
6. Cieary, ).G., and Witten. I.H. Data compression ilsing adaptive
   coding and partial string matching. IEEE Trans. Commun. COM-32, 4
   (1984), 396-402.
7. Cormack, G.V., and Uorspool. R.N.S. Data compression using dynamic
   markov modelling. The Computer fournal, 30, 6 (1987), 541-550.
8. Dantzig, G.B., Blattnct. W.O.. and Rao, M.R. Finding a Cycle in a .
   Graph with Minimum Cost to Time Ratio with Application to a Ship
   Routing Problem. Theory of Graphs, P. Rosenstiehl, ed. Gordon and
   Breach, 1966.
9. Elias, P. Universal codeword sets and representations of the
   integers. IEEE Trans. Info. Theon, IT-11. 1 (1975), 194-203.
10. Even, S., and Rodeh, M. Economical Encoding of Commas Between
    Strings. Commun. ACM 21 (1978), 315-317.
11. Gallager, R.G. Variations on a theme by Huffman. IEEE Trans. info.
    Theory IT-24, 6 (1378), 668-674.
12. Gasbarro, ). An Architecture for High-Performance VLSI Testers.
    Ph.D. dissertation, Department of Electrical Engineering, Stanford
    University, 1988.
13. Golomb, S.W. Run-Length Encodings. IEEE Trans. Info. Theory IT-12
    (1966), 399-401.
14. Gua^zo, M. A General Minimum Redundancy Source-coding Algorithm.
    IEEE Trans. Info. Theory IT-26. 1 (1980), 15-25.
15. Guoan, G., and Hobby. ). Using string matching to compress Chinese
    characters. Stanford Tech. Rep. STAN-CS-82-914, 1982.
16. Horspool, R.N., and Cormack, G.V. Dynamic Markov Modelling - A
    Prediction Technique. In Proceedings of the 19th Annual Hawaii
    Mernational Conference on System Sciences (1986), 700-707.
17. Huffman. D.A. A method for the construction of minimum-redundancy
    codes. In Procefiluigs of the 1.R.E. 40 (1952), 1098-1101.
18. Hunter, R., and Robinson, A.H. International digital facsimile
    coding standards. In Proceedings of the IEEE 68. 7 (1980). 854-867.
19. Jakobsson, M. Compression of character strings by an adaptive
    dictionary. B;T 25 (1985), 593-603.
20. Jones, C.B. An efficient coding system for long source sequences.
    IEEE Trans. Info. Theom IT-27. 3 (1981), 280-291.
21. Kempt, M., Bayer, R.. and Cuntzer, U. Time Optimal l.eft to Right
    Construction of Position Trees. Ada Informatica. 24 (1987), 461-474.
22. Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and
    Searching. Addison-Wesluy, second printing. 1975.
23. Knuth, D.E. Dynamic Huffman coding. /. Algo. 6 (1985), 163-180.
24. Langdon, G.G., )r. A note on the Ziv-Lempel model for compressing
    individual sequences f?E? Trail-. Info. Theory IT-29. 2 (1983),
    284-287.
25. Langdon, G.G., )r., and Rissanen. ). Compression of Black-White
    Images with Arithmetic Coding. IF-EE Trans. Commun. COM-29. 6
    (1981), 858-867.
26, Langdon, C.G., )r.. and Rissanen. ). A Double Adaptive File
    Compression Algorithm. IF.[:E Trans. Commun. COM-31, II (1983),
    1253-1255.
27. Lawler, E.L. Coml'inaiifii.il Optimization: Nelu'orks and Malroids.
    Holt. Rinehart and Winslon. 197b.
28, McCreight. E.M. A sp.K:e-ccoiiomical suffix tree construction
    algorithm. /. ACM 23. 2 (1976). 262-272.
29. Miller, V.S.. and VVegman, M.N. Variations on a theme by Ziv and
    Lempel. IBM Res. Rep. RC 10630 (#47798). 1984. Combinatorial
    Algorithms on Words, NATO ASI Series F. 12 (1985), 131-140.
30. Morrison, D.R. PATRICIA-Practical Algorithm To Retrieve information
    Coded in Alphanumeric. /. ACM 15, 4 (1968), 514-534.
31. Pasco, R.C. Source Coding Algorithms for Fast Data Compression.
    Ph.D. dissertation, Department of Electrical Engineering, Stanford
    University, 1976.
32. Rissanen, )., and Langdon, G.G., )r. Arithmetic Coding. IBM lournal
    of Research and Development 23. 2 (1979), 149-162.
33. Rissanen, )., and Langdon. G.G., )r. Universal Modeling and Coding.
    IEEE Trans. Info. Theory IT-27. 1 (1981), 12-23.
34. Rodeh, M., Pratt, V.R., and Even. S. Linear algorithm for data
    compression via string matching, f. ACM 28. 1 (1981), 16-24.
35. Shannon, C.E. A mathematical theory of communication. The Bell
    System Technical journal 27,3. 379-423 and 27, 4 (1948), 623-656.
36. Storer, ).A., and Szymanski, T.G. Data compression via textual
    substitution. /. ACM 29. 4 (1982). 928-951.
37. Vitter, ).S. Design and analysis of dynamic Huffman coding. Brown
    University Technical Report No. CS-85-13. 1985.
38. Weiner, P. Linear Pattern Matching Algorithms. Fourteenth Annual
    Symposium on Switching and Automata Theory, 1-11,1973.
39. Welch, T.A. A technique for high performance data compression.
    IEEE Comp. 17. 6 (1984). 8-19.
40, Witlen, I.H.. Neal, R.M., and Cieary ).G. Arithmetic coding for data
    compression. Commun. ACM 30. 6 (1987), 520-540.
41. Ziv, ). Coding theorems for individual sequences IEEE Trans. Info.
    Theory IT-24, 4 (1978), 405-412.
42. Ziv, )., and Lempel, A. A universal algorithm for sequential data
    compression. IEEE Trans. Info. Theory IT-23. 3 (1977), 337-343.
43. Ziv, ).. and Lempel, A. Compression of individual sequences v'a
    variable-rate coding. IEEE Trails. Info. Theory IT-24. 5 (1978),
    530-536.

CR  Categories  and  Subject  Descriptors:  E.4  [Data]:   Coding    and
Information Theory-data compaction and compression: F.2.2  [Analysis  of
Algorithms  and  Problem  Complexity):  Nonnumerical   Algorithms    and
Problems-computations on discrete structures, pattern  matching  General
Terms: Algorithms, Design, Experimentation, Theory Additional Key  Words
and Phrases: Amortized efficiency, automata theory, minimum cost to time
ratio cycle, suffix trees, textual substitution

                           ABOUT THE AUTHORS:

EDWARD R. FIALA
   is a member of the research staff in the Computer Sciences Laboratory
at the Xerox Palo Alto Research Center.  He has general system interests
including data compression and digital video.  Author's Present Address:
Xerox PARC, 3333 Coyote Hill Rd..  Palo Alto, CA 94304:  ARPA  Internet:
Fiala.pa@Xerox.com

DANIEL H. GREENE
   is presently a member of the research  staff  at  Xerox's  Palo  Alto
Research Center (PARC).  He received his Ph.D.  in  1983  from  Stanford
University.  His current  research  interests  are  in  the  design  and
analysis of algorithms for data compression, computational geometry, and
distributed computing.  Author's  Present  Address:  Xerox  PARC:,  3333
Coyote Hill Rd.. Palo Alto, CA 94304: Crcene.pa@Xerox.com