--------------------------------------------------------------------------- Match finder implementation /2 - So, we need to find the address of last occurrence of any string suffix (where "string" represents the processed data). Its better to describe like that, because plain "last occurrence of given string" is not incremental. One possible algorithm for that works like this: 1. find the next symbol in context of previous longest match 2. if there was only one occurrence of that symbol in that context, create a new tree node (presuming that such unique symbols point to data window) 3. if there exists a tree node corresponding to next symbol, goto 5 4. if next symbol never occured in current context, follow the suffix pointer to shrink the context, then goto 2 5. update last occurrence pointers in all current contexts 6. goto 1 Here it seems that a "suffix pointer" (like in ppmd tree) is not necessary, because we need to always update all contexts anyway. Also we need to delete contexts on other side of window - surprisingly, the last occurrence pointer is helpful for that - if window start is also the last occurrence, then delete the context, or keep it otherwise. Thats good, because with counters it would use up more memory. - It seems possible to actually use a limited-order tree for LZ, instead of full order-272 or so, although that adds additional overhead - a 4N table with maxorder occurrence links. But full order-272 PPM tree for even enwik8 is clearly impossible. (ppmd_sh8 o256 reached ~1900M at 50M of enwik8, o16 took ~1800M, 441M for o8) However, if we have to switch to chains at some relatively low order anyway, isn't it better to use a hashtable instead? Then there're also fma chains, which are certainly interesting, because there hash lookups / updates are only performed on fragment edges. But unfortunately anchored hashing doesn't give a solution for lower orders (at least up to o8), and when these are already indexed, there's kinda no need to do anything special anymore. Anchored hashing still is certainly helpful for longer matches, though. - An interesting option for PPM tree is a branch offset. The occurrence pointer can be used as a base, and an extra field added for branch offset. Then while (base[i]==c), we don't have to do anything about this context. I guess the problem is that a tree w/o that has to be implemented first anyway (for validation and memory usage comparison). But still it seems like a good thing, especially for higher orders. Anyway, the usual problem with hashtables is that there're no certain matches - some string may be in the buffer, but not reachable, because its hash node got overwritten by some other string. I guess this is especially bad for deflate recompression and such, because it can cause mispredictions there. So there's no way around the suffix tree, I guess.