--------------------------------------------------------------------------- Match finder implementation 1. Task spec Find matching byte strings within a sliding window. - match size can be limited - its possible to put the "future" data into the window and ensure that string comparison won't wrap. Ideally, for each byte, we need to find the nearest offsets of all matching strings, including overlapped. 2. dumb search / hash chains The simplest possible method is to compare current string with strings at all distances. It also doesn't require any extra memory, but instead, the complexity is O(N^2). It can be improved at the cost of 4N+O(1) memory - we can make linked lists of all strings with the same prefix. However, its still potentially O(N^2), or the chain length has to be very limited, which affects compression. 3. adding anchor hashing Its possible to manage short strings (2-4 or 2-5 or so) via a simple hashtable (string to its last location, instead of all locations). Then, the window can be indexed with anchored hashing, which would allow to treat longer strings as symbols. Unfortunately it seems hard to understand how to implement standard LZ using anchored fragments - follow the chain of the first full fragment within target length?.. Instead, something similar can be done with usual hash chains, but in a more uniform fashion - direct hashtable for short matches, then chains for 4+ symbols. But either way, with anchor hashing or not, potential chain length is still a problem. 4. ternary tree Its possible to build a binary or ternary tree with O(N) memory - we add a string on each offset, thus a tree node per offset. But like that we'd only have access to offsets of unique strings - otherwise the only way would be to recursively process the whole tree branch. Sure, its possible to maintain linked lists for the nodes of given prefix length... but that won't be O(N) memory already. There're interesting possible optimizations (instead of looking up each string from tree root), like suffix pointers and hash comparisons (to make binary trees more balanced), but first its necessary to find a way to map a string to its last offset. Well, its possible to add a last prefix offset pointer to nodes (or use "equal" pointer for that), but then n'th prefix occurrence would store that pointer for the n-byte prefix. Which means that when there're less than n strings with the same prefix, we'd have to compare them all to find the best match. 5. BWT Still, its known that its possible to index all strings with 4N or 5N extra memory, using BWT. The main problem with it is its blockwise nature (impossible to do anything until window is full), and having to skip "future" strings in lookup. There may be a workaround for future string skipping (like by adding extra pointers), but its unknown atm. Also its possible to redo the sort after reading each few MB of data (to avoid waiting for 1G read with 1G window), but that would add significant delays. Overall, BWT for string indexing really seems interesting, but a sequential method seems preferable, even if its slower. 6. PPM tree In the end, it seems that a tree similar to ppmd's one would be a "fast" solution. Simply because I know how to implement it, and it can provide all the necessary info. But there're issues: - its not O(N), so depending on data, full indexing of N-byte window may be impossible with a fixed k*N byte tree buffer. - with variable-size nodes there's a "garbage collection" problem (which I generally avoid in Ash and the like). Its necessary to at least have an ability to "reformat" memory blocks (from one node size to another), or otherwise actual window size can deteriorate to nearly 0. For that it might make sense to move tree nodes to new locations even if their size didn't change (otherwise its impossible to free a block w/o full tree traversal, even when there're enough of free nodes). - the main problem is parsing complexity. Order-272 (or so) parser can be pretty slow on redundant data (with lots of such matches), especially with sliding window support.