--------------------------------------------------------------------------- On a new LZ codec implementation 1. Match coding LZ coding is redundant, because its possible to describe the same data with multiple LZ codes. But its especially troublesome that LZ77 provides multiple ways for a _single match_ to be redundant in that sense, because it can reference any of multiple previous occurrences of a specific string. Well, there're some possible workarounds. - length masking Presuming the full offset references an _unique_ string, we can exclude all lengths below the length of the longest match of target offset and any string closer than that. That implies that we'd use a shorter offset if it was possible. But there's actually more, if we'd presume greedy parsing. Basically the string at given offset has to be unique, different from any other strings in the data. Now it really reminds of ACB, though. Well, no wonder. Its still a question though, what to do if the "future" data doesn't match any of previous unique strings, but does match a non-unique string (with a unique new suffix). Well, in that sense, I guess only masking newer strings is better. - offset masking Another possibility is to encode the full length, and then the MTF rank of a string of specified length. We even have an MTF implementation which (kinda) can deal with that. Cons are: * it seems harder/slower than length masking * knowing the real offset is good for modelling (not that its actually used though) Pros: * implementation is more obvious Also, rep0-3 values normally have to be excluded from distance values... not sure if lzma encoder would actually encode them via distance though. - literal masking After we encoded a match, we have to mask the next symbol somehow. Basically it has to be different from any of suffixes of previous occurrences of matched string. There was an idea to define the selected interval for literal-after-match value by offset, but its troublesome, because if the closest match is too far, its code can be easily longer than corresponding literal(s). The more interesting idea is to use the symbol after match to further mask the length Its obvious if there's a literal after match, but with match after match it also applies, although in that case the offset of the second match can't include the first match, but that in fact sounds good. In other words, if we know the match offset and the symbol after match... we only can exclude that symbol from lengths. 2. fuzzy matching bsdiff shows much better results than lzma at compression of file versions. Now, what bsdiff suggests is to have a fullsize literal stream (the whole original file) with matches subtracted from it (also remaining zeroes can be runlength-encoded). Separate literal stream is actually an interesting idea w.r.t parallel processing, also we can imply that a match can start only from a zero (or is there any sense in subtracting bytes before the match?), although that causes a problem with real zero bytes. Though, in other words, the bsdiff idea is that subtracting things [...] 3. Levenshtein distance