--------------------------------------------------------------------------- Backup service / Archiver implementation notes 2009-07-27 - v0 - initial version 1. Far Match Analysis (FMA; aka REP filter) / file sorting - Blockwise hash calculation is required both for far-match finding and version difference analysis. - It seems reasonable to use crc32 as a main hash, as its fairly fast (2 xors + 1 shift + small table lookup per byte), standard (there're releases with crc32 in the name, so per-file crc32 is useful to have), and its "sliding" version is also available (crc32 of string suffix equals to crc32 of string xor crc32 of string prefix). > It might be sensible to extend the crc32 hash with a few bytes of the block data (to avoid collisions). - Bounds of a match should be extended by comparison with actual occurrence(s) > ...and then encoded as RLE if its an overlapped match? > seems like there'd be two match lengths - before and after the base (hashed) block. - So it seems that match/literal file structure file would be required, in addition to the filtered file content with removed matches, - Parallel analysis might be possible, as filtered data are still usable for secondary processing (so by using overlapped blocks it would be possible to process whole data almost in parallel without noticeable compression drop). > Although i/o would be a bottleneck here probably. - Its reasonable to expect that it would be faster to read the data in order of name occurrence in the (depth-first) directory scan. Then, while reading the data, it won't be much of a slowdown to also perform FMA. Usually directory structure is scanned and sorted, and files with the same size and names are expected to be the same, but that has to be checked anyway, which would be done during FMA anyway. - So we can have a directory scan + file read thread, then FMA thread, and then segmentation thread (over FMA output?). - Its not necessary to really generate any FMA-filtered data, as it would be probably faster to repeat the scan based on FMA structure info, than first write the filtered data, and then read and process them again (that's even not considering disk usage issues). - FMA should also produce the blockwise hashes for file version comparison on archive updates - either fixed or anchored (preferably). 2. Segmentation Analysis (SA) - Its normally very useful to estimate the data redundancy first, before actually compressing them (and then separately process the different data types). - Supposing that especially redundant data (literal runs and long matches) would be taken care of by FMA, we only need to pay attention to random-like "incompressible" data. - SA can be performed by processing data with a reasonably good (but really fast) compressor with disabled output. A good candidate would be a CM based on static LUT mapping (without interpolation) of two fsm hashed counters (produced from optimized DC) - a single counter would be faster but much less stable. > This analysis can be performed in parallel with FMA / initial scan > Actually there might be a sense to run even more counters with different contexts in parallel - to determine the optimal PPMd orders for data fragments. - It seems reasonable to use a different approach from Shkarin's: FMA references can be used to determine the segment similarity, and average redundancy fluctuations should allow to detect the borders between different data types. - It also seems reasonable to perform _double_ redundancy estimation (from start to end and from end to start) - that would really improve the detection precision and might allow even to reverse some blocks for compression if that would seem beneficial. But a proper implementation of such scan would be considerably slow. > Still, there might be a sense to do this using blockwise data reverse - that's not really the same, but still would provide the mentioned benefits. - Its should be possible to just use best of counter estimations in place of "perfect" mixing. 3. Dictionary preprocessing - In theory it seems like a good idea, but unfortunately it requires a lot of research and non-trivial compression algorithms with support for large alphabets. So its hard to say anything about implementation details atm. 4. Actual Data Compression - For backup purpose its in fact more reasonable to use PPM/CM instead of LZ, as for long matches we'd have FMA anyway, compression speed of LZMA and PPMd is basically the same (or PPMd is even faster), and compression ratio of PPM/CM is generally better (and its much more flexible with data-specific models etc). Anyway, encoding task is the main one in backup/restore application, as its normally performed much more frequently; also internet transfer speed is usually much slower than any reasonably fast compression algorithms, thus LZ has no benefits here (supposing that (de)compression would be overlapped with network i/o). - Thus it seems obvious to use ppmd_sh as the main codec, though tree shrinking functionality probably has to be removed first, and maybe other planned modifications applied too (tree and parsing redesign, rangecoder reattaching etc). - Instead of tree shrinking / flushing, memory overflows can be handled by overlapped data processing (like in ash). > It actually might be a good idea to also use such points to split the data into solid blocks (not sure whether overlapped coding can be used in such case though). - So, after FMA+SA+sort on the first pass, we'd start actually compressing the data, taking into account the file/segment reordering info from analysis phase. Blocks matched by FMA or known from the versioning mechanism should be simply skipped. 5. Archive Format - So there'd be file system structure info, with data hashes for each file, then FMA,SA and solid-block info. - FS structure should include unicode filename, 64-bit size, windows FILETIME struct, attribute word, crc32 file hash. - In case of a local archive, the structural info should be also compressed and appended to the archive. But for a remote backup its better to keep it in separate files, and probably uncompressed too. > Filesystem structures for different users should be probably kept as separate files as well (or whole structural info then?). - For archive update, we'd perform FMA of supplied files, then filter the result using the known blockwise hashes from archive, and add the unmatched content to archive as new file versions. > "file versions" are the same as any files, just have the same name. - Unfortunately, all the numeric values in archive would have to be 64-bit (probably including such things as number of file versions etc). 6. Implementation Plan - FMA has to be implemented first - archored crc32 - single file processing (filtering with filtered data on output) - encoding - decoding - multiple file processing (with single content stream on output) - file archiving utility based on "scan5" source - FMA archive encoding - FMA archive decoding - Segmentation - Fast CM implementation development / optimization > should have a very small memory footprint - like 1M maybe - Segmentation specific speed optimization (rc removal, counter paralleling) - Average redundancy analysis efficiency investigation - Preprocessed data - reversed blocks - E8 - text filters? - ppmd_sh - replacement of tree shrinking with overlapped coding - removal of functions/structures only necessary for tree shrinking - parser redesign using ash-style context array - tree simplification - simple new/delete-based allocation - suffix pointer removal - context node merge with counter table (counter table pointer removal) - new custom node allocation (ash-style) - rangecoder reattachement - separate compression and rangecoding ? probability sorting