---------------------------------------------------------------------------
Key points for recompressor implementation, based on lzmarec experience.

1. stream interleaving
>strsign.txt
>http://stackoverflow.com/questions/5145558/adding-control-symbols-to-a-byte-stream

Its still a question how to encode the size of data block within a stream.
Most easy and obvious solution - adding a length prefix - doesn't really
work with streams, because when blocks longer than buffer size are possible,
encoding of block size after reaching block end requires random access to
output stream.

In that sense, the approach with escape markers (used in lzmarec) seems
like a good solution, because encoding is really simple (just write a marker),
and there're no block size limits, and redundancy is minimum (normally).
Unfortunately there're issues too:
 - only a GetByte function is currently implemented (for sequential reading
of masked stream), but no buffer processing; and that GetByte contains a
complex state machine, where two bugs already had to be fixed.
 - there remains a possibility of an "exploit", where output stream would
be expanded by 33% if it happens to be a run of escape markers.
To prevent that its possible to use dynamic markers (or fast encryption of
unencoded input), but that would considerably complicate things.
 - it hurts speed, because the only easy way is to add masking to byte i/o
functions. While if we'd try to normally filter whole buffers after coroutine yield,
it would be able to produce more output than necessary (up to 33%), also impossible
to do inplace because of data expansion. Well, seems like a lot of moving the data round,
while ideal solution is when its possible to read a buffer from one file and write to
another without any data movement.

Thus its necessary to either find an efficient implementation of masking with
buffered processing, or consider something different.

The solution with 1-2-3 byte buffer length prefix seems like it could be faster, but 
 - its not easy to implement;
 - there's that annoying fixed redundancy proportional to file size 
(while ideally it should be possible to make it 0, or constant at least, for unprocessed files)
 - its too buffer-specific. Having to change the size of output buffer would make it really ugly.

The EOF-finding problem only applies to interleaving of raw streams 
like in lzmarec v2. But v3+ has EOFs in recompressed streams
anyway, so technically the marker after EOF in recompressed stream is
not necessary, and then there's no need to mask the recompressed stream either.
Btw, seems that enabling EOF coding for lzma_sample increases the output size by 27 bytes,
while there're actually 30 streams, so it seems pretty efficient.

1.1. end-of-stream coding ideas not mentioned in SO post

 - dynamic markers for small blocks

With 64k blocks its always possible to find 1 unused 16-bit value,
so we can encode it at block start and block end.
Its certainly different from other ideas, but in the end it just
adds more overhead than buffer length coding w/o any real benefits.

 - arithmetic coding of whole output stream

Looks like the real solution in the end. Its a "slow" idea by itself,
but if we're supposed to use custom rc-based coders for most of the data anyway,
it looks like the most efficient. Although in lzmarec case it requires to pass
the "skipped bytes" to rc as well, which is deadly for "open" recompression
(where external codecs are used, like implied precomp usage), because rc would
garble the data. But an integrated solution would be another matter I guess.

Another issue is related to simple parsing of interleaved stream.
Escape markers would make it easy (so at least "skipped bytes" would
be always accessible) while wrapping it all into arithmetic code
would make it completely solid, no error tolerance etc.
Also fast parsing is helpful for recompressed audio files, for seek
implementations.
But then arithmetic coding makes it all much easier to process, and
removes quite a few sources of bugs, and seek support is best implemented
with an index table anyway... though in this case we'd have to store
whole rc state to index entries.

1.2. deflate/EOF problem

Frequently recompressor is just a custom CM coder, in which case explicit
EOF coding via rangecoding is okay.
But deflate recompression requires processing of "skipped bytes" and bytes
decoded from deflate with the same coder.
However, somehow that seems okay too, if there would be a control stream
for deflate recompression (aside from main codec which would process "skipped bytes"
and deflate data). I guess we'd just have to mix rc calls related to multiple
streams within the same code, but its intended anyway (psrc and all).

2. recompressed stream termination

This problem somehow reappears once in a while.
Like, when file creation (and thus index parsing) lagged behind decoding and
EOF had to be added to the stream, to let decoder know where to stop.
Thus redundant information (EOF flag) had to be added to the code, while
similar information was already stored in other place (index).
I guess in that case the better way was to unsync index parsing from file creating
and sync it with decoding instead.

Now, a similar thing happened in lzmarec - the stream already had a terminator (marker),
but we needed a different terminator (EOF in the model) to properly terminate the
output stream.
Its a little different though (EOF really encodes some unique information here, while
output size was simply equal to input size in uncompressed case), but its certainly
possible to reduce the overhead here (and maybe improve speed somehow) by only enabling
EOF coding after input EOF is reached. Also additionally rc flush could be cut down.

But then again, if the whole output stream is produce by the same rc instance, it
also means no flush optimizations necessary.

3. coroutine linking

 - there's a problem with coroutine implementation, specifically with the stack -
stack copying (presumably) adds noticeable extra complexity, so current workaround
is to explicitly set stack offsets for different coroutine instances.

 - one case is related to coroutine/thread pairing, like what's done in psrc.
There's already a class for thread objects, and coroutine interface is used
to enhance simple coroutines with buffer queues, synchronization with other threads, etc.
The concept seems surprisingly solid even now - there's really no place for
buffer management code within processing functions.
   But still, its clearly necessary to rework the buffer ring implementation
(move buffer content sizes into a separate table, instead of interleaving them
with buffers; also deal with buffer index wraps; actually its probably a good
idea to use buffer pointers here - it might allow to avoid data copying somewhere). 
   However the main question is still how to implement a generalized "coroutine manager"
object, which would pass input buffers from input ring to coroutine, accept filled output
buffers from coroutine, wait for i/o, etc. The 4 thread implementations in psrc had
a lot in common, but there were subtle differences too.
   Threads surely solve the coroutine stack problem, though...

 - Its also a question how make an efficient single-threaded version from a bunch
of linked "coroutine manager thread" objects. Well, I guess they have to be 
"coroutine manager routines" first, with threads added when necessary.
It could be cool if its possible to enable/disable MT just by renaming a base
class or something like that.

 - Btw the right way probably is to make the thread interface look like a coroutine
call somehow. I guess its possible to do stuff like queue buffers for coroutine
and immediately return r1 instead of it, same with output, and only wait when there's
no i/o.
   Also coroutine states actually can be better checked with r&mask instead of r==x
(because r&(4|1) is faster to test than (r==2)||(r==0)), so its better to take that
into account.

 - So "coroutine manager thread" class should overload the coroutine interface
(process,addinp,addout), to make it easy to replace a simple coroutine with it.
Then, maybe it also makes sense to provide a coroutine linker class, so that
two coroutines wrapped in it would look like one.
Plus, a third "coroutine-like" class, which would just always return normally
(w/o any setjmp overhead though), but look like a coroutine too.
Its a bit hard to design tests for that though.

4. input buffering

 - lzmarec required having 64k ahead and behind; lzmarec implementation seems
quite ok actually, but these memcpys are not very nice.
   Its an important thing though, because in most cases start detection requires
some lookahead, and end detection - some unget.
   Lookahead/behind probably can be implemented with a special buffer ring
with a "fake" last buffer, which mirrors the first one (* check whether its possible
to map pages like that *), same with first buffer i guess (current pos pointer can't
be in first or last buffer, except for file start/EOF maybe).
   It requires buffer copying, but it should ok if the ring is large enough.
   Also an important point is that such a buffer sometimes has to work as LZ window.

5. resync after stream termination

Sometimes, like with lzma streams, end of stream is defined by a length field,
which is not a part of stream. In this case, decoding would continue after
actual end of stream, and until some invalid code occurs.
Its uncertain whether this applies to deflate (probably not), and it doesn't
apply to jpeg, but maybe does for mp3 (eg. in a video container, stuff following
a mp3 chunk may start with something that looks like a valid mp3 header), and
probably other formats.
Also there's always a case of incomplete streams.

It should be easier with huffman streams though, because its a fixed-delay code,
not like arithmetic with carries, where delay can be infinite.

So here's what had to be done to lzmarec to stabilize tail handling:
 - encoder simulation, just to precisely count the number of bytes which
rc encoder would produce at each point of decoding.
Ideally there should be a way to calculate that with only decoder state, somehow.
 - keep a 8-byte fragment of code (with 4-byte lookahead) and match it against
the possible flush code at each iteration.
 - atomic output per iteration. It would be much harder to capture and reproduce
all the possible partial codes and decoding errors.
 - caching of a large enough chunk of recent iteration outputs (2048 atm),
with correspoding encoder output sizes, and dumping it only when flush is detected
(because there its certainly that part until then can be correctly reproduced).

Caching is mainly required because of error termination of all streams.
Even after flush detection, decoders proceeds further (because there're accidental
flush matches too), and if we'd actually pass the data after flush to
the coder, there's no return already.
So this trick would probably apply to deflate as well.

Although on other hand, lzmarec stores decoder output to a relatively large buffer,
and only passes it to coder after checking for minimum stream size.
So it could be probably merged with the cache somehow... though rcount values
(encoder output sizes) are not present in output.

Anyway, so it looks like decoder's output is not that simple either - it should
be possible to cleanly discard all effects of a single decoder call, and have
enough of fixed delay to fall back to flush after termination on error.
Its tough and all applies to deflate, afaik.

An interesting possibility is measurement of coder output rate - to prevent
recompression from expanding the data, and also to improve the tail detection
(if coding rate drops after some point, then its likely random data after that point).
The problem is that ideally this requires the coder model to support undo to some extent...
ie logging of statistic updates etc, which is slow and ugly.
On other hand though, the coder won't be resumed anyway, in which case its only necessary
to delay the coder output, and to track input-to-output_size log for recent input.
This way it might actually work (for deflate etc), although additional tracking still
would slow down encoding (a little).
Even if backend is psrc, encoder still would write quant bytes for it sequentially,
so it would be possible to discard a part of it even easier than with normal rc.

In fact maybe all this entropy control thing has to be only implemented with psrc,
because quant output from the model also helps with entropy control (can be implemented
simply by summing codelengths looked up by quants) - it would be imprecise,
but expansion control seems like a nice asymmetric feature.