---------------------------------------------------------------------------

Backup service / Archiver implementation notes
2009-07-27 - v0 - initial version

1. Far Match Analysis (FMA; aka REP filter) / file sorting

 - Blockwise hash calculation is required both for far-match
finding and version difference analysis.

 - It seems reasonable to use crc32 as a main hash, as its
fairly fast (2 xors + 1 shift + small table lookup per
byte), standard (there're releases with crc32 in the name,
so per-file crc32 is useful to have), and its "sliding"
version is also available (crc32 of string suffix equals 
to crc32 of string xor crc32 of string prefix).
 > It might be sensible to extend the crc32 hash with a
few bytes of the block data (to avoid collisions).

 - Bounds of a match should be extended by comparison with
actual occurrence(s) 
 > ...and then encoded as RLE if its an overlapped match?
 > seems like there'd be two match lengths - before and after
the base (hashed) block.

 - So it seems that match/literal file structure file would
be required, in addition to the filtered file content with
removed matches,

 - Parallel analysis might be possible, as filtered data are
still usable for secondary processing (so by using
overlapped blocks it would be possible to process whole data
almost in parallel without noticeable compression drop).
 > Although i/o would be a bottleneck here probably.

 - Its reasonable to expect that it would be faster to read
the data in order of name occurrence in the (depth-first)
directory scan. Then, while reading the data, it won't be
much of a slowdown to also perform FMA. Usually directory
structure is scanned and sorted, and files with the same
size and names are expected to be the same, but that has to
be checked anyway, which would be done during FMA anyway.

 - So we can have a directory scan + file read thread, then
FMA thread, and then segmentation thread (over FMA output?).

 - Its not necessary to really generate any FMA-filtered
data, as it would be probably faster to repeat the scan
based on FMA structure info, than first write the filtered
data, and then read and process them again (that's even not
considering disk usage issues).

 - FMA should also produce the blockwise hashes for file
version comparison on archive updates - either fixed or
anchored (preferably).

2. Segmentation Analysis (SA)

 - Its normally very useful to estimate the data redundancy
first, before actually compressing them (and then separately
process the different data types).

 - Supposing that especially redundant data (literal runs and
long matches) would be taken care of by FMA, we only need to
pay attention to random-like "incompressible" data.

 - SA can be performed by processing data with a reasonably
good (but really fast) compressor with disabled output. A
good candidate would be a CM based on static LUT mapping
(without interpolation) of two fsm hashed counters (produced
from optimized DC) - a single counter would be faster but
much less stable.
 > This analysis can be performed in parallel with FMA / initial scan
 > Actually there might be a sense to run even more counters
with different contexts in parallel - to determine the
optimal PPMd orders for data fragments.

 - It seems reasonable to use a different approach from
Shkarin's: FMA references can be used to determine the
segment similarity, and average redundancy fluctuations
should allow to detect the borders between different data
types.

 - It also seems reasonable to perform _double_ redundancy
estimation (from start to end and from end to start) - that
would really improve the detection precision and might allow
even to reverse some blocks for compression if that would
seem beneficial.
   But a proper implementation of such scan would be considerably slow.
 > Still, there might be a sense to do this using blockwise
data reverse - that's not really the same, but still would
provide the mentioned benefits.

 - Its should be possible to just use best of counter
estimations in place of "perfect" mixing.

3. Dictionary preprocessing

 - In theory it seems like a good idea, but unfortunately it
requires a lot of research and non-trivial compression
algorithms with support for large alphabets.
   So its hard to say anything about implementation details atm.

4. Actual Data Compression

 - For backup purpose its in fact more reasonable to use
PPM/CM instead of LZ, as for long matches we'd have FMA
anyway, compression speed of LZMA and PPMd is basically the
same (or PPMd is even faster), and compression ratio of
PPM/CM is generally better (and its much more flexible with
data-specific models etc).
   Anyway, encoding task is the main one in backup/restore
application, as its normally performed much more frequently;
also internet transfer speed is usually much slower than any
reasonably fast compression algorithms, thus LZ has no
benefits here (supposing that (de)compression would be 
overlapped with network i/o).

 - Thus it seems obvious to use ppmd_sh as the main codec,
though tree shrinking functionality probably has to be
removed first, and maybe other planned modifications applied
too (tree and parsing redesign, rangecoder reattaching etc).

 - Instead of tree shrinking / flushing, memory overflows can
be handled by overlapped data processing (like in ash).
 > It actually might be a good idea to also use such points
to split the data into solid blocks (not sure whether
overlapped coding can be used in such case though).

 - So, after FMA+SA+sort on the first pass, we'd start
actually compressing the data, taking into account the
file/segment reordering info from analysis phase. Blocks
matched by FMA or known from the versioning mechanism should
be simply skipped.
 
5. Archive Format

 - So there'd be file system structure info, with data hashes
for each file, then FMA,SA and solid-block info.

 - FS structure should include unicode filename, 64-bit size,
windows FILETIME struct, attribute word, crc32 file hash.

 - In case of a local archive, the structural info should be
also compressed and appended to the archive. But for a
remote backup its better to keep it in separate files, and
probably uncompressed too.
 > Filesystem structures for different users should be
probably kept as separate files as well (or whole structural
info then?).

 - For archive update, we'd perform FMA of supplied files,
then filter the result using the known blockwise hashes from
archive, and add the unmatched content to archive as new
file versions.
 > "file versions" are the same as any files, just have the same name.

 - Unfortunately, all the numeric values in archive would
have to be 64-bit (probably including such things as number
of file versions etc).

6. Implementation Plan

 - FMA has to be implemented first
  - archored crc32
  - single file processing (filtering with filtered data on output)
   - encoding
   - decoding
  - multiple file processing (with single content stream on output)
   - file archiving utility based on "scan5" source
   - FMA archive encoding
   - FMA archive decoding

 - Segmentation
  - Fast CM implementation development / optimization
   > should have a very small memory footprint - like 1M maybe
  - Segmentation specific speed optimization (rc removal, counter paralleling)
  - Average redundancy analysis efficiency investigation
  - Preprocessed data 
   - reversed blocks
   - E8
   - text filters?

 - ppmd_sh
  - replacement of tree shrinking with overlapped coding
  - removal of functions/structures only necessary for tree shrinking
  - parser redesign using ash-style context array
  - tree simplification
   - simple new/delete-based allocation
   - suffix pointer removal
   - context node merge with counter table (counter table pointer removal)
   - new custom node allocation (ash-style)
  - rangecoder reattachement
   - separate compression and rangecoding
   ? probability sorting