---------------------------------------------------------------------------
Practical CM design

 - Its much easier to parallelize the encoding, but so be it, 
the coder with 10 mb/s encoding and 5mb/s decoding is better
than a symmetric coder with 5mb/s encoding and decoding.

 - The main brake is the memory, but its only slow when
an instruction depends on a value from memory, and at least
in encoding its always possible to prefetch as many locations as we want.
In decoding its actually also possible, at least a symbol ahead -
when a symbol is already known to decoder, there can be still a bunch
of counters (etc) to update.

 - Storing sequentially accessed structures also sequentially is very good too.
Its kinda impossible with hashtables, but is readily available for trees...
I'd imagine reading context nodes, updating and rewriting to a buffer
(eg. 1M in size). This is also very different from allocators used in ppmd or ash,
which is good I guess.

 - [Weeding] This is a method for discarding of outdated statistics, similar
to what lzma and zlib do, but continuous instead of all-at-once.
The idea is to do tree cleanup all the time (possibly in sync with i/o)
by small blocks, instead of processing the whole tree at once, which
makes a noticeable delay.
   For the tree with sequentially written nodes, as described above,
it would mean that the nodes would have to include a parent pointer
(an address of parent's pointer to oneself, to be precise).
Which is not very efficient, but doable.
   Thus we get a tree design like this: there're variable-size nodes,
but they're just rewritten sequentially to a buffer block, while
original instances are "freed". When the block fills up, we get the
next one and continue. While in the background, there's a "garbage collector"
which processes chunks (the same blocks? then they have to be more like 64k, not 1M) 
of the tree (sequential storage makes it easy to parse variable-size nodes).
This cleanup process simply shifts the nodes, while removing unused ones,
then some free space appears at the end of a block... though I guess there's
no need to shift stuff on block basis - it can just continue until there's
a whole free block, at which point it can be given to the main thread...
but on other hand, having to allocate a whole block of free space may take
much more time than allocating just some space.
 * The same "garbage collector" can also delete too old or low-probability blocks
 * It would make sense to collect some allocation statistics (how much tree memory
is consumed by a block of input) which would allow to calculate a "window size"
(out-of-window nodes are deleted), such that the nodes written while processing 
the next input block would be balanced by nodes deleted by "garbage collector" at the same time.
 * There's a problem of two processes writing to the same memory. It can be avoided
either by interleaving the "weeding" iterations between processing iterations,
but being able to run the "garbage collector" as a async thread also sounds good.
One solution is to not perform the "weeding" of the working block (where new nodes are written),
and to let the weeding thread to update the parent-child pointers (added to a special queue).
Though then again, presuming that most of the parent nodes would be also stored in the work block...

 - [Hashcoding] The idea is that the nodes would contain some kind of binary trees
with shorter paths than full symbol codes (for redundant data at least).
It may be possible to use multiple huffman tables and specify the table index in the node,
but table allocation would also become a problem, and there's a good reason to require
that a "suffix node" would use the same code, which would be hard to maintain.
   Anyway, an interesting alternative is to hash the symbols (well, xor with a hash of
context or something) and then stop branching when an unique suffix (bits of symbol code)
is reached. Like that, supposedly, we would have log2(alphabet_size) bits per symbol
on average, and with a completely static code (while original symbol codes generally
require full 8 bits to differentiate symbols even there're only 2 in alphabet).
   However, stuff like o1 huffman coding would be much more efficient in that sense...
but also more complicated and not static.
Maybe implement the hashcoding first, then compare the performance with static o1 huffman,
then find a way to implement a somewhat dynamic huffman (blockwise-static), if huffman
version would really appear that much better.
   There's a problem with hashcoding context though - we need o2 at least there, which
would mean that contexts which don't include that o2 won't use the same coding.
Well, there's no way to directly map o2 hashcodes to original ones, so for o1 and o0
we'd either have to switch to masking an a linear scan (which would be bad), or just
avoid them completely and use SSE or something instead.

 - [FSM] Counters have to be bytewise fsms. Or 16-bit fsms if it appears possible
to find a way to make them much more efficient than 8-bit ones.
(Note that a full 16-bit state lookup won't fit into the cache, so the real
state only can be ~12 bits at most; but it may be possible to store history
bits in addition to state and use them somehow).
   Also it could be cool to make a fsm logistic mixer, but its only feasible
by completely avoiding 3D tables somehow (do the linear mixing in logistic domain
explicitly; update the mixer weight using a function of inputs), which might
hurt the precision too much. Though well, mixing doesn't use much memory,
so it may be good as is in fact - as a stuff to do while prefetching the counters.

 - Encoding threads:
0. Read/write
1. Weeding (garbage collection)
2. Context lookups (nodes copied to work buffer in order of access)
3. Probability estimation (fsm lookups, mixing)
4. Context updates
5. psrc (secondary model + rc)
6. [opt] LZ matchfinder thread
7. [opt] LZ parsing optimizer thread

 - Decoding threads:
0. Read/write
1. Weeding (garbage collection)
2+3+4. Merged model, with some possible prefetching
5. psrc

 - The list of proposed improvements
1. New cache-friendly structure for counter storage
2. Hashcoding of huffman coding for symbols in tree nodes (faster lookup)
3. fsm counters (speed+compression)
4. Bitwise escaping to lower orders
5. Length coding instead of unary escape flags
6. Stats prefetching (especially at encoding)
7. MT-friendly design
8. psrc
9. [opt] LZ77 with optimal parsing as a base format (just with support for content-coded matches)
A. Explicit alignment context in context model

 - [psrc improvement]
 - [LZ77 base]
 - [FMA]
 - [Number models]