--------------------------------------------------------------------------- Input compression techniques Memory access became very slow recently (comparing to arithmetics), so now it makes sense to compress data before storing - probably even page compression schemes can be useful (again). But the best opportunity for speed optimization appears when its possible to perform required operations directly with compressed data, without decoding. We can save time on memory access, and additionally any sequential processing (string comparison/sorting, entropy coding) would require less iterations. But there's a problem - the needed compression method can't be spatially adaptive, because it has to provide random access to all the data. I. Candidate compression methods 1. Static huffman compression (possible o1 or o2) - optimal bitcode size, good access speed, but changes the lexicographic order of compressed strings. 2. Optimal ordered code - optimal bitcode, which preserves the original alphabet order; but compression is noticeably worse than with huffman. 3. Arithmetic code - the best compression, and, in theory, preserves the lexicographic order of strings, but the produced code for the same string depends on initial range value. 4. Arithmetic code with huffman freqs - huffman code provides the optimal freq alignment to 2^n and then arithmetic code allows to preserve the lexicographic order. 5. Fritcoding - two-stage coding when 1st stage produces bits with known skewed probability distribution, and 2nd stage encodes fixed-size chunks of skewed bits (aka fractional bits, or "frits"). > tested, but likely not practical 6. Nonprefix bitcoding - its possible to add special "escape codes" and implement a stable backtracking decoder which won't be limited by huffman's prefix property. This method provides higher compression than huffman and has comparable encoding and decoding speeds. Unfortunately alphabet order is not preserved and a fast code generation method is unknown. 7. MSB-like higher order compression. The context order in normal methods (huffman etc) is limited because of speed/memory issues. But it may be also possible to quantize a longer context to 8-16 bits (equivalent of o1-o2), which would be more efficient than 8-16 bits of direct context. The main expected benefit of this approach is that the same code can be used to compress any kinds of data, so it can be used in sequential compression methods, while huffman coding etc imply blockwise processing with possible additional redundancy (we may need to encode the huffman tables if they're required for decoding). > Compression in MSB_v0 is far from impressive II. Applications 1. BWT sorting - input compression can be easily applied, and should especially improve the efficiency of radix sorting. Contextual compression significantly reduces the size of actual input data (and thus compared strings etc), but contexts have to be stored with the pointers, which actually increases the required memory. Thus, the main current idea is the multiscan approach - compression makes it easy to split the set of strings into chunks of near-equal size (simply by dividing 0..2^32-1 into necessary number of intervals), which enables us to extract fixed-size BWT chunks at each pass - obviously with fixed memory use. This approach seems especially attractive due to current trend to add more cores to cpus - sequential buffer scans are much cheaper than random access, and its also possible to pass the generated BWT chunks to postcoder without waiting for the whole block. 1a. Interval coding - arithmetic coding actually can be used if we won't restrict ourselves with actual compression. Its possible to encode substrings to fixed-size units (eg. 32-bit), which can be done in O(N) time recursively, and then all the possibilities of huffman coding apply, but with improve compression due to entropy coding (it doesn't save the memory, but still reduces iterations of string comparisons). > Unfortunately with O(N) coding its hard to determine the capacity of units > (how many symbols are stored in each AC-coded unit), and when we know the capacity, > the speed becomes O(N^2) (potentially). 2. LZ matchfinding. LZ with large windows became pretty important, and input compression allows to significantly (easily 2-3x) increase the window size within the same memory. But LZ is (normally) sequential, and global static bitcoding looks like a very bad idea. Splitting the window into multiple blocks with independent compression can solve the compression ratio issue, but we mostly need a large window to find rare strings, and its hard to determine whether a string exists anywhere, so with multiple blocks it would be slow. That's where MSB comes into the picture - its especially compatible with LZ, because in LZ its possible to look up the short matches directly (in hashtables), so a longer compression context won't be an issue. 3. CM speed optimization. Input compression directly affects the speed of a bitwise CM, because it simply reduces the number of bits to model and encode. But in CM lower-order contexts are much more important, so it may be hard to apply a o4 MSB or something similar. Still, with hashtables its also impossible to remap statistics to different codes once per block. But instead, we can use adaptive compression in this case (for symbol decomposition), which may be a good workaround (ppmd's symbol ranking is an example of that). However, good CM also needs a data window, so it won't be a wonder if we'd actually need two different kinds of codes. Unfortunately adaptive decomposition is not compatible with probability-based counters, especially fsm counters, so static codes are still important here. 4. Bitmask compression. This is somewhat unrelated to all the previous stuff, but its still a compression application. The idea is to compress the tree nodes by replacing empty leaves with bit flags. So to a 256-symbol table, we can add a 32-byte bitmask which would show which symbols are actually present. III. Further research 1a. Precise arithmetic code not limited to 2^n freqs. 1b. Arithmetic code with redundancy, to preserve the order of strings despite imprecision. 2. A complete BWT compressor with input compression and multipass radix sorting, preferably GPU-based. 3. LZ with window compression 4. CM with efficient symbol decomposition, bitmask compression of tree nodes, and window compression.