---------------------------------------------------------------------------
On a new LZ codec implementation

1. Rangecoder optimizations

These days, it just doesn't make sense (not for PC use anyway) to avoid
arithmetic coding, so even LZ should use it, of which LZMA is a great example.

As it seems, the usual rangecoding vs bitcoding issues, like multiplication,
renormalization and overall complexity, don't really matter anymore, because
of the probability sorting idea which allows to run decoding in parallel
asynchronously on another core.

Note that its always been possible to run _encoding_ in parallel - the
model outputs a stream of symbols and intervals, and coder just has
to combine these intervals into a solid code.
However, probability sorting is just another step along that line - 
having an array of probabilities;bit pairs, we can sort it by probability,
and then recover the bit sequence by looking up the probabilities produced
by decoder's model (a bit paired with x-th occurence of given probability
gives context for the next probability, which allows to look up another bit etc).
And then the reason for such sorting would be the fact that the sequence
of bits with the same probability can be entropy-coded without a model -
in other words, to decode all the data bits we would only need to acquire
the probability counts, and the model would be only needed to _unsort_ these
bits into a copy of input data, thus the model doesn't have to sync with
entropy coding.

Note also that symmetry (probability p for bit b is the same as 1-p for !b,
thus p>0.5 can be avoided in entropy coding) and quantization (probability
values don't really have to be that precise - huffman approximates them by
2^-l and it still works) can be used to reduce the number of bit pools
and simplify sorting and reduce the redundancy from counts encoding, also
more efficient coding methods can be applied for specific probability ranges -
ie RLE for p near 0 and plain storing for p near 0.5.
Also, for speed and simplicity, huffman coding can be used for RLE counts
and bit strings instead of rc, although that doesn't really make much sense
if we intend to run it in parallel with the model.

Anyway, let's suppose that we completely eliminated the entropy coding
problem and only have to consider the model for further speed optimization.
Using the described method, the model would have to look up bits with 
computed probabilities - hopefully it would be possible to implement
symmetry, quantization and different bit unpacking schemes 
(RLE and bits packed into uint32s, most likely) without branches.
But normally this would be already slow - if we'd have to look up per
each input data bit, as many modern CM models would do (including ccm).
Well, this can be improved by using bitcodes before modelling
(especially dynamic ones, like unary coding by MtF rank), but its
still a question whether its possible to optimize the lookup itself,
so that it could output multiple bits at once.

 - Automated probability quantization? (it can actually improve the compression)
 - How to read a bit by probability w/o branches? (read bits from uint, then refill it from RLE etc?)
 - Efficient bitcoding, specially dynamic? (freq/total to quant.prob. LUT conversion?)

2. LZ implementation

The main reason to use LZ is because its decoder can be fast and simple,
and its fast because it can output whole strings at once - which is good,
especially comparing to modern CMs which have to process all bits in input data.
But usual ways to encode these strings (by offset;length in LZ77 or context;offset;length in ROLZ,
or context;length in LZP) allow to encode the same string in multiple ways,
and there're even more ways for splitting the data into strings and literals,
so LZ coding is redundant and commonly has worse compression than CM/PPM.

Considering this, the least redundant LZ algorithm is LZ78 - we can guarantee
that different string codes won't encode the same strings, and other sources
of redundancy (like encoding a dictionary string with another string and a literal)
can be masked.

So next questions are:
 - How to build/maintain the dictionary(-ies) for indefinite volumes of data?
 - How to efficiently compress dictionaries?
 - How to compress word/literal sequences, taking into account the word clustering?

2.1. Building a dictionary.

 - BWT for finding strings with enough occurences
 - text handling
 - in-memory compression of LZ window
  - interval coding for radix sorting and variable-order ST
  - good static compression method for intermediate coding
  - detecting blocks with different low-order stats (segmentation)

2.2. Compressing a dictionary

 - dictpack attempts
 - semantic dependencies
 - morphology dependencies

2.3. Compression of word/literal sequences

 - Masking
 - Contexts (word code suffixes and prefixes)