<Shelwien> | test | 2019-04-29 22:58:36 |
<FunkyBob> | o/ | 2019-04-30 00:51:48 |
| yay, people came :) | 2019-04-30 00:52:07 |
<Shelwien> | ...and went :) | 2019-04-30 01:48:01 |
| but we have bots! | 2019-04-30 01:48:17 |
<FunkyBob> | :) | 2019-04-30 01:49:10 |
<Shelwien> | btw, what do you think about fuzzy matches and delta matches (like bsdiff)? | 2019-04-30 01:50:20 |
| i think these would allow to make something interesting even without entropy coding | 2019-04-30 01:50:42 |
| ...and another thing is secondary LZ | 2019-04-30 01:51:02 |
<FunkyBob> | delta matches? | 2019-04-30 01:51:53 |
| I've been considering doing some sort of entropy-lowering matching | 2019-04-30 01:52:07 |
<Shelwien> | see how bsdiff works | 2019-04-30 01:52:18 |
| http://nishi.dreamhosters.com/u/bsdiff_sh3.rar | 2019-04-30 01:52:29 |
| its basically LZ | 2019-04-30 01:52:42 |
| although its matchfinder is based on BWT SA | 2019-04-30 01:52:54 |
| and instead of replacing strings with references | 2019-04-30 01:53:17 |
| it subtracts the matches from data | 2019-04-30 01:53:30 |
| so full matches just leave lots of zeroes | 2019-04-30 01:53:39 |
| ...which gets removed by a RLE pass later | 2019-04-30 01:54:15 |
<FunkyBob> | hrm | 2019-04-30 01:54:39 |
<Shelwien> | but it also allows bsdiff to work with fuzzy matches | 2019-04-30 01:54:52 |
| if there're some differences in the middle, they just remain in literal buffer | 2019-04-30 01:55:16 |
| and when you subtract code in v2 exe from code in v1 exe | 2019-04-30 01:55:39 |
<FunkyBob> | hrm... similar to my thinking :) | 2019-04-30 01:55:40 |
<Shelwien> | you get lots of same code, but addr differences in it | 2019-04-30 01:56:07 |
| and deltas of these addrs tend to be the same | 2019-04-30 01:56:23 |
| so these match deltas are successfully compressed by 2nd LZ pass | 2019-04-30 01:56:53 |
<FunkyBob> | well, seems you've solved the last problem I had left :) | 2019-04-30 01:56:57 |
<Shelwien> | but what I also found | 2019-04-30 01:57:12 |
* FunkyBob is currently working on a better structure for finding matches for his "basic lz" to make it faster | 2019-04-30 01:57:33 |
| is that bsdiff+lzma works much better than just lzma of same data | 2019-04-30 01:57:57 |
<FunkyBob> | I also need to take some time and work out how to do some sort of "optimal" parsing | 2019-04-30 01:58:02 |
| Shelwien: valuable info | 2019-04-30 01:58:25 |
<Shelwien> | I think its easier to start with dumb bruteforce matchfinding | 2019-04-30 01:58:54 |
<FunkyBob> | I have | 2019-04-30 01:59:02 |
<Shelwien> | like, build a digram hashchain | 2019-04-30 01:59:06 |
<FunkyBob> | and I have a hash -> fixed array system | 2019-04-30 01:59:10 |
<Shelwien> | and do full scans on it | 2019-04-30 01:59:12 |
| uh, not that | 2019-04-30 01:59:19 |
<FunkyBob> | "digram hashchain"? | 2019-04-30 01:59:19 |
<Shelwien> | digram = 2 symbols | 2019-04-30 01:59:48 |
<FunkyBob> | ahhh | 2019-04-30 02:00:04 |
<Shelwien> | and hashchain is the data structure which is used in most LZ libs, like zlib | 2019-04-30 02:00:16 |
<FunkyBob> | yus | 2019-04-30 02:00:27 |
<Shelwien> | where hashtable keeps the address of last context match | 2019-04-30 02:00:34 |
| and then a table parallel to data window (pointer per byte) | 2019-04-30 02:00:54 |
| keeps previous context occurence addr for each symbol | 2019-04-30 02:01:19 |
| so you get a linked list for all occurences of these 2 bytes | 2019-04-30 02:02:09 |
<FunkyBob> | neat | 2019-04-30 02:02:27 |
| right now, I'm going to get dinner :) | 2019-04-30 02:02:33 |
<Shelwien> | its still faster than a linear scan | 2019-04-30 02:02:35 |
| and doesn't lose any matches | 2019-04-30 02:02:53 |
| though i've been thinking that it might be possible to force linear scans to become practical for small windows | 2019-04-30 02:03:37 |
| like with boyer-moor LUTs, or some vector magic | 2019-04-30 02:04:14 |
| btw, you can also look at this: https://encode.ru/threads/3005-TP-LINK-router-config-compression | 2019-04-30 02:06:47 |
| i didn't see format quite like that anywhere else (with interleaved bit and byte streams) | 2019-04-30 02:07:57 |
| its kinda annoying to encode, because code bytes have to be arranged to follow decoder logic | 2019-04-30 02:08:37 |
| but decoding should have the same speed/complexity like normal LZSS | 2019-04-30 02:09:16 |
| while supporting variable-size bitfields for distance/length | 2019-04-30 02:09:49 |
*** richox has joined the channel | 2019-04-30 02:59:34 |
*** unic0rn has joined the channel | 2019-04-30 03:21:11 |
<FunkyBob> | back | 2019-04-30 03:41:54 |
| well, as I'm sure you've seen from my posts and my code, I'm still new to actually coding compression stuff, despite having studied it for a looong time :) | 2019-04-30 03:45:32 |
<richox> | :) | 2019-04-30 03:57:45 |
<Shelwien> | there're plenty of people who just invent compression from scratch for themselves | 2019-04-30 04:01:42 |
| like mahoney | 2019-04-30 04:01:50 |
| so the only requirement for writing a good compressor is programming in C/C++ :) | 2019-04-30 04:03:26 |
| everything else can be always reinvented, when you have a hard case of "not invented here syndrome" | 2019-04-30 04:04:53 |
<FunkyBob> | well, I'm only one hamming step from mahoney :P | 2019-04-30 04:05:14 |
<Shelwien> | btw, did you see DCE? | 2019-04-30 04:05:25 |
<FunkyBob> | ? | 2019-04-30 04:05:30 |
<Shelwien> | http://mattmahoney.net/dc/dce.html | 2019-04-30 04:05:41 |
<FunkyBob> | oh, right... yes | 2019-04-30 04:06:36 |
| thanks, though | 2019-04-30 04:06:40 |
| I'm partly using this exploration as a way to refresh my C skills | 2019-04-30 04:08:11 |
| but also... exploring how different changes match up with my mental models | 2019-04-30 04:08:24 |
<Shelwien> | even if you don't want STL strings and vectors | 2019-04-30 04:09:56 |
| these days its a good idea to use C++ rather than C | 2019-04-30 04:10:10 |
| most of C code compiles as C++ anyway, except for some typecasts maybe | 2019-04-30 04:10:38 |
| but C++ also has templates | 2019-04-30 04:10:57 |
| class methods that let you avoid typing ptr-> before every name | 2019-04-30 04:11:38 |
<FunkyBob> | most of the good bits of c++ I liked are in C now anyway :P | 2019-04-30 04:11:47 |
| there is that, yes | 2019-04-30 04:11:53 |
<Shelwien> | nope, templates are not | 2019-04-30 04:11:58 |
<FunkyBob> | no, that's true | 2019-04-30 04:12:08 |
<Shelwien> | and they're very, very helpful for performance | 2019-04-30 04:12:12 |
| since without templates | 2019-04-30 04:12:19 |
<FunkyBob> | my everyday language is Python, so classes are natural to me:) | 2019-04-30 04:12:20 |
<Shelwien> | code usually ends up with lots of dynamic memory allocation | 2019-04-30 04:12:33 |
| and that's crazily slow these days | 2019-04-30 04:12:52 |
| and in C++ with templates | 2019-04-30 04:13:07 |
| I can just define everything statically | 2019-04-30 04:13:36 |
| for example, reflate library | 2019-04-30 04:13:50 |
| it has to decode deflate stream, encode it, then encode the diff with original stream | 2019-04-30 04:14:33 |
| and then there's recursion | 2019-04-30 04:14:52 |
| ie a deflate stream can contain a pdf file | 2019-04-30 04:15:02 |
| with its own deflate streams | 2019-04-30 04:15:07 |
| so, the reflate library does stream processing | 2019-04-30 04:15:45 |
<FunkyBob> | I'll look into it | 2019-04-30 04:16:07 |
<Shelwien> | take input stream, detects deflate in it, unpacks the data + adds diffs for lossless restore | 2019-04-30 04:16:15 |
| and there's no dynamic memory allocation at all | 2019-04-30 04:16:30 |
<FunkyBob> | sounds like what I use AdvanceCOMP for | 2019-04-30 04:16:41 |
<Shelwien> | well, most of office formats these days use zip containers | 2019-04-30 04:19:06 |
| also apk, jar etc | 2019-04-30 04:19:19 |
| so any modern archiver needs something like that | 2019-04-30 04:19:35 |
| otherwise there's kinda nothing to compress :) | 2019-04-30 04:19:45 |
<FunkyBob> | yeah, re use xlsx at work, so I repack that | 2019-04-30 04:21:07 |
<Shelwien> | advancecomp is lossy, right? | 2019-04-30 04:24:01 |
| its a different kind of recompression then | 2019-04-30 04:24:37 |
| i meant something like https://github.com/schnaader/precomp-cpp | 2019-04-30 04:26:15 |
<FunkyBob> | no, advancecomp uses a varienty of deflate engines | 2019-04-30 04:26:43 |
| -1, --shrink-fast Compress fast (zlib) | 2019-04-30 04:26:57 |
| -2, --shrink-normal Compress normal (libdeflate) | 2019-04-30 04:26:57 |
| -3, --shrink-extra Compress extra (7z) | 2019-04-30 04:26:57 |
| -4, --shrink-insane Compress extreme (zopfli) | 2019-04-30 04:26:57 |
| -i N, --iter=N Compress iterations | 2019-04-30 04:27:14 |
<Shelwien> | i mean, it can optimize the container, but can't compress it, then restore as it was, right? | 2019-04-30 04:27:20 |
<FunkyBob> | it's solely for recompressing deflate streams | 2019-04-30 04:27:55 |
| either on their own, or in png or zip files | 2019-04-30 04:28:01 |
<Shelwien> | but for a normal archiver, like 7-zip, you need a lossless method | 2019-04-30 04:28:08 |
<FunkyBob> | it's lossless... | 2019-04-30 04:28:24 |
| it's deflate | 2019-04-30 04:28:26 |
<Shelwien> | yes, but processing is lossy | 2019-04-30 04:28:42 |
<FunkyBob> | well, you won't get back the original archive ... but you will get back the original decompressed data | 2019-04-30 04:29:46 |
<Shelwien> | btw http://nishi.dreamhosters.com/u/stegzip_v2.rar :) | 2019-04-30 04:30:04 |
<FunkyBob> | reminded me... I wanted to ping the ULZ thread for a Linux version | 2019-04-30 04:30:30 |
<Shelwien> | just compile it? | 2019-04-30 04:30:45 |
<FunkyBob> | last I saw , no source | 2019-04-30 04:30:53 |
<Shelwien> | https://github.com/encode84/ulz | 2019-04-30 04:31:14 |
<FunkyBob> | sorry, meant https://encode.ru/threads/550-Ultra-fast-LZ/page8 | 2019-04-30 04:31:41 |
*** richox has left the channel | 2019-04-30 04:31:47 |
| must have missed that post | 2019-04-30 04:32:58 |
<Shelwien> | btw, as to optimal parsing | 2019-04-30 04:37:41 |
| its very simple for this kind of coders | 2019-04-30 04:37:52 |
| you just have to make a price/choice array with element per symbol | 2019-04-30 04:38:33 |
| and at each position try all possible choices (literal(s)/matches) | 2019-04-30 04:39:02 |
| pos[i+len].price = min( pos[i+len].price, pos[i].price + price(current_choice) ); | 2019-04-30 04:40:04 |
| so when you reach the end of data | 2019-04-30 04:40:23 |
| at pos[filesize] you would have your optimize compressed size | 2019-04-30 04:40:41 |
| *optimal | 2019-04-30 04:40:57 |
| and then you have to do a backward pass to reconstruct the choices required to reach this len | 2019-04-30 04:41:34 |
| (choices are actually just stored too when new min is found) | 2019-04-30 04:41:51 |
| that's it | 2019-04-30 04:41:55 |
<FunkyBob> | yeh, I've readt some posts [from cbloom et al] on how... especially since there's no entropy codec | 2019-04-30 04:43:55 |
| right now I don't feel so bad, watching how long ulz is taking | 2019-04-30 04:44:05 |
| sure, at level 9 it beats lz4, but not for the time | 2019-04-30 04:46:13 |
<Shelwien> | encode's usual problem is that he uses getc/putc for i/o :) | 2019-04-30 04:47:09 |
<FunkyBob> | erk... I stopped that some time back :) | 2019-04-30 04:47:25 |
| am reading 8MB at a time... | 2019-04-30 04:47:29 |
| was 16M, but... reasons :) | 2019-04-30 04:47:37 |
<Shelwien> | i checked and ulz has blockwise fread, so i guess i was wrong | 2019-04-30 04:48:16 |
| but still, that's 16M blocks, which is also bad for performance | 2019-04-30 04:48:42 |
| i/o needs to be either async, mmap, or small aligned blocks, at most 64k | 2019-04-30 04:49:19 |
| reading or writing unaligned chunks is slower because OS has to align it for you - it only can write whole clusters at once to file | 2019-04-30 04:50:13 |
<FunkyBob> | given deflate is, what, 32k blocks, and lz4 is 64k... | 2019-04-30 04:50:19 |
| I should also test if alighning my 8byte comparison helps | 2019-04-30 04:50:36 |
<Shelwien> | I solved the i/o problem for myself with coroutines | 2019-04-30 04:51:00 |
| so frontend always does aligned i/o | 2019-04-30 04:51:36 |
<FunkyBob> | so which are your tools? | 2019-04-30 04:51:37 |
| yeah, i have a fixed "main.c" I share between my tests | 2019-04-30 04:52:00 |
| reads 8M at a time, | 2019-04-30 04:52:16 |
<Shelwien> | yeah, that's also not too good | 2019-04-30 04:52:30 |
| it blocks while reading that | 2019-04-30 04:52:36 |
<FunkyBob> | point | 2019-04-30 04:52:47 |
| but for now it's consistent | 2019-04-30 04:52:53 |
<Shelwien> | https://github.com/Shelwien/ppmd_sh/blob/master/pmd.cpp#L54 | 2019-04-30 04:53:26 |
| i don't have any bytewise LZs so its hard to compare | 2019-04-30 04:54:03 |
<FunkyBob> | ah, ppmd | 2019-04-30 04:54:30 |
<Shelwien> | found some old pictures | 2019-04-30 05:18:46 |
| https://sites.google.com/site/shelwien/bcm8e25.png | 2019-04-30 05:18:49 |
| (red = memory used, green = bytes read, blue = bytes written) | 2019-04-30 05:19:06 |
| (horizontal axis is time in seconds, vertical is volume in bytes) | 2019-04-30 05:20:25 |
| https://sites.google.com/site/shelwien/ccm_5.png | 2019-04-30 05:20:41 |
| ccm graph is especially nice because its visibile how it takes time to initialize 512M hashtable at start :) | 2019-04-30 05:21:39 |
*** richox has joined the channel | 2019-04-30 05:50:28 |
<richox> | i no longer use c++ after a learned rust | 2019-04-30 05:51:43 |
| c++ sucks | 2019-04-30 05:51:46 |
<unic0rn> | sadly, it sometimes can't be avoided | 2019-04-30 05:58:09 |
<FunkyBob> | I'd rather avoid language wars in here, if possible :) | 2019-04-30 05:59:01 |
<Shelwien> | rust uses llvm for compiling, right? | 2019-04-30 06:01:08 |
| unfortunately its not good enough yet | 2019-04-30 06:01:25 |
| gcc and intelc are better compilers, quite frequently | 2019-04-30 06:01:43 |
<FunkyBob> | for C, perhaps | 2019-04-30 06:02:08 |
<Shelwien> | for x86/x64, more like | 2019-04-30 06:02:31 |
| llvm is still pretty bad at auto-vectorization | 2019-04-30 06:03:00 |
| sometimes it can be better at scalar code - for example, i've got a faster build for precomp with clang | 2019-04-30 06:03:36 |
| but overall its still worse | 2019-04-30 06:04:04 |
| 1.750s 5.687s: raw2unp_gcc820x.exe enwik9.gz x | 2019-04-30 06:05:14 |
| 1.906s 6.672s: raw2unp_cl601x.exe enwik9.gz x | 2019-04-30 06:05:14 |
| 1.328s 2.578s: raw2unp.exe enwik9.gz x -- IC19/x64 (1-2.578/3.469)*100 = 25.68% | 2019-04-30 06:05:15 |
| for example | 2019-04-30 06:05:17 |
| that's my deflate decoder on two 1gb samples | 2019-04-30 06:05:51 |
| sure, its getting better with time, but gcc also does | 2019-04-30 06:06:53 |
<FunkyBob> | on a practical note, what the world needs is better offline deflate and brotle encoders :) | 2019-04-30 06:07:51 |
<Shelwien> | somehow i dislike brotli a lot | 2019-04-30 06:08:18 |
| its encoder levels are unbalanced, thus useless | 2019-04-30 06:08:43 |
| we actually tested it for possible integration into our archive format | 2019-04-30 06:09:38 |
| but didn't find a use case for it | 2019-04-30 06:09:51 |
| either some other codec has better compression | 2019-04-30 06:10:12 |
| or brotli encoder is so slow, that nobody would use it | 2019-04-30 06:10:24 |
| zstd turned out much more practical | 2019-04-30 06:11:54 |
| but then, zstd level profiles are actually generated by a special tool, based on speed/ratio balance | 2019-04-30 06:12:43 |
<FunkyBob> | yeah, I'm not all that enamoured of it, either... but it's supported | 2019-04-30 06:13:06 |
<Shelwien> | i want to write my own parsing optimizer for deflate though | 2019-04-30 06:13:42 |
<FunkyBob> | please do :) | 2019-04-30 06:14:07 |
<Shelwien> | there won't be any practical use in it though | 2019-04-30 06:14:43 |
| there're plenty of these optimizers already | 2019-04-30 06:15:02 |
| in fact 7-zip only has an optimizing encoder for deflate | 2019-04-30 06:15:20 |
| which is kinda slow | 2019-04-30 06:15:31 |
| so at best I can expect combining 7z+defluff results in one coder | 2019-04-30 06:16:15 |
| so i'm actually more interested in lzma optimization | 2019-04-30 06:17:03 |
| you see, lzma has position context for tokens | 2019-04-30 06:17:56 |
| but its matchfinder only collects one most recent match per length | 2019-04-30 06:18:21 |
| basically it collects distances for matches from len=2 to 273 | 2019-04-30 06:18:42 |
| and passes these to optimizer | 2019-04-30 06:18:56 |
| but taking into account position context, we'd actually need at least a match per len for each alignment | 2019-04-30 06:20:15 |
| and then | 2019-04-30 06:20:26 |
| lzma's rep-match system | 2019-04-30 06:20:37 |
| actually mostly works because of fuzzy matches | 2019-04-30 06:21:00 |
| if you'd look at lzma parsing results | 2019-04-30 06:22:06 |
| rep-matches are mostly used to skip "parameters" in patterns | 2019-04-30 06:23:07 |
| but lzma matchfinder doesn't have any support for fuzzy matches | 2019-04-30 06:23:37 |
| although parsing optimizer does check some specific patterns | 2019-04-30 06:24:19 |
| like match-literal-rep_match | 2019-04-30 06:24:26 |
| but overall it mostly uses them on accident | 2019-04-30 06:24:49 |
| so it won't be surprising if a proper parsing optimizer could improve lzma compression by 10% or so | 2019-04-30 06:26:00 |
| at least on structured data, like exes | 2019-04-30 06:26:08 |
*** Jibz has joined the channel | 2019-04-30 07:00:41 |
| still trying to optimize a counter FSM with SAT | 2019-04-30 07:09:38 |
| CBMC now loads the generated script and translates | 2019-04-30 07:11:11 |
| but can't reach solving stage - it hits 56G memory usage and quits | 2019-04-30 07:11:55 |
| I tried to precalculate some stuff to reduce size of generated source | 2019-04-30 07:12:53 |
| but looks like it somehow complicated things for CBMC | 2019-04-30 07:13:18 |
*** richox has left the channel | 2019-04-30 08:45:09 |
*** schnaader has joined the channel | 2019-04-30 09:28:09 |
*** schnaader has left the channel | 2019-04-30 10:17:30 |
*** Eppie has left the channel | 2019-04-30 11:28:17 |
*** Jibz has left the channel | 2019-04-30 12:00:55 |
*** Eppie has joined the channel | 2019-04-30 13:40:08 |
<Eppie> | I've been working on this for a while, do any of you think anyone will find any value in it? https://github.com/andrew-epstein/paq_history | 2019-04-30 13:41:15 |
*** Eppie has left the channel | 2019-04-30 15:03:03 |
<Shelwien> | "history" actually started from this: http://mattmahoney.net/dc/paq.html#neural | 2019-04-30 20:09:24 |
| i think its a good idea to archive them all on github | 2019-04-30 20:10:03 |
| although i'd appreciate something more detailed | 2019-04-30 20:10:54 |
| like the list of involved ideas (stochastic fsm counters?) | 2019-04-30 20:12:57 |
| and their sources | 2019-04-30 20:13:19 |
| like, SSE was first implemented by Dmitry Shkarin in ppmonstr, based on SEE in C.Bloom's ppmz | 2019-04-30 20:14:42 |
| also, another useful project | 2019-04-30 20:16:25 |
| would be splitting paq into individul components (rather than single .cpp source) | 2019-04-30 20:16:45 |
| and tracking their individual evolution over paq development | 2019-04-30 20:17:04 |
<FunkyBob> | Shelwien: so, I've taken your advice, and am starting on a new tac... and, wishing TurboDebugger was still a thing :P | 2019-04-30 20:39:35 |
| time to learn me some gdb :P | 2019-04-30 20:40:08 |
<unic0rn> | there are no bugs, only features ;) | 2019-04-30 21:13:22 |
<FunkyBob> | well, this feature means my code never gets past the first match lookup :/ | 2019-04-30 21:13:59 |
<Shelwien> | i'm using ida pro these days, its almost like turbo debugger :) | 2019-04-30 21:44:52 |
| http://nishi.dreamhosters.com/u/ida_dbg1.png | 2019-04-30 21:48:18 |
<FunkyBob> | looks purdy | 2019-04-30 21:58:27 |
| https://www.hex-rays.com/products/ida/debugger/index.shtml ? | 2019-04-30 21:58:48 |
<Shelwien> | yes, it can be found in torrents, latest available version is 7.0 | 2019-04-30 22:35:31 |
| source-level debug needs exes compiled with debug-info | 2019-04-30 22:35:49 |
| also there's the decompiler plugin otherwise | 2019-04-30 22:36:20 |
<FunkyBob> | well, gdb has a built in tui, which is not so bad... | 2019-04-30 22:38:01 |
<Shelwien> | for most of the normal problems its more convenient to write some logs anyway | 2019-04-30 23:01:46 |
| but sometimes there're compiler-related problems | 2019-04-30 23:02:16 |
| where i need to see actual code to understand what happens | 2019-04-30 23:02:57 |
| and in these cases ida is useful | 2019-04-30 23:03:28 |
*** maniscalco_ has joined the channel | 2019-04-30 23:13:07 |
| hi~~ | 2019-04-30 23:13:16 |
<maniscalco_> | hi Eugene. glad to see a channel is back | 2019-04-30 23:17:35 |
<Shelwien> | not much talking here, as usual :) | 2019-04-30 23:18:24 |
| how's your progress btw? | 2019-04-30 23:22:01 |
| your code is kinda scary because of STL though :) | 2019-04-30 23:22:09 |
<maniscalco_> | what, all of that template stuff? | 2019-04-30 23:22:54 |
| definitely get a speed boost though. | 2019-04-30 23:23:10 |
<Shelwien> | MT itself is okay, sure | 2019-04-30 23:23:41 |
| but pthread is one thing, and std::thread is another | 2019-04-30 23:24:01 |
| std::thread is not properly supported by all compilers on all platforms | 2019-04-30 23:24:25 |
<maniscalco_> | which platforms? (I don't get out much) | 2019-04-30 23:25:20 |
<Shelwien> | i'm still on windows | 2019-04-30 23:25:31 |
| gcc/mingw just plain doesn't have std::thread | 2019-04-30 23:25:42 |
<maniscalco_> | I've been toying with a different approach to recency coding lately. It looks like it's never going to be quiet able to keep up with MTF because it isn't vectorizable | 2019-04-30 23:25:50 |
| Windows? Why? | 2019-04-30 23:25:56 |
<Shelwien> | there're some unofficial 3rd-party implementations, but when I compiled msufsort with that, it crashed | 2019-04-30 23:26:34 |
<maniscalco_> | really. hmm. I'll try to find some portable alternative perhaps | 2019-04-30 23:27:03 |
<Shelwien> | pthread itself is ok | 2019-04-30 23:27:13 |
<maniscalco_> | ok, well, that wouldn't be too hard to switch to I suppose | 2019-04-30 23:27:32 |
<Shelwien> | that one is not supported by VS/IC, but there're working implementations | 2019-04-30 23:27:51 |
<maniscalco_> | I really have to force myself to make time to finish v4 anyhow | 2019-04-30 23:28:08 |
<Shelwien> | yeah :) | 2019-04-30 23:28:14 |
<maniscalco_> | instead I'm screwing around with MTF alternatives ... | 2019-04-30 23:28:32 |
<Shelwien> | well, BWT certainly needs some kind of solution for compression issue | 2019-04-30 23:28:57 |
| there's certainly plenty of potential for transform speed improvement | 2019-04-30 23:29:24 |
| but what to compress with it? | 2019-04-30 23:29:33 |
<maniscalco_> | the logical next advance would be context based but also building up a model | 2019-04-30 23:29:51 |
| oh, I had some insights into bit wise bwt. | 2019-04-30 23:30:06 |
| I remeber that we were discussing it a while back. could it be better than 32N | 2019-04-30 23:30:29 |
| and it definitely can be ... by a lot | 2019-04-30 23:30:39 |
<Shelwien> | we tried to use BWT/qlfc in our archiver (basically a refactored bsc, because it uses random access to files) | 2019-04-30 23:30:45 |
<maniscalco_> | what archiver is this? | 2019-04-30 23:31:04 |
<Shelwien> | i added MT support and everything | 2019-04-30 23:31:04 |
| http://nishi.dreamhosters.com/u/7zdll_vF6.rar | 2019-04-30 23:31:24 |
| .pa format | 2019-04-30 23:31:26 |
<maniscalco_> | I'll take a look | 2019-04-30 23:31:35 |
<Shelwien> | i just added my selection of codecs to .7z basically | 2019-04-30 23:32:02 |
| BWT-wise its nothing interesting really | 2019-04-30 23:32:33 |
| though you can look at this: http://nishi.dreamhosters.com/7zcmd.html | 2019-04-30 23:32:47 |
| one mostly unknown interesting feature of 7-zip is its support for codec trees with multiple output streams | 2019-04-30 23:33:36 |
<maniscalco_> | I'm pretty sure that bitwise BWT can be produced with a worst case of 8N space | 2019-04-30 23:33:50 |
| codec trees? | 2019-04-30 23:34:25 |
<Shelwien> | btw, i have a simple bitwise BWT there | 2019-04-30 23:35:00 |
<maniscalco_> | expand 1 bit to 1 byte? | 2019-04-30 23:35:22 |
<Shelwien> | yes, but still can estimate compression with it | 2019-04-30 23:35:45 |
| actually i needed it for cdm | 2019-04-30 23:36:04 |
| which is a kind of universal recompressor for bitcode formats | 2019-04-30 23:36:19 |
| cdm itself uses optimal parsing to detect and compress areas with slightly skewed distribution of 0/1 | 2019-04-30 23:36:57 |
<maniscalco_> | there are special nuances of bitwise suffix arrays that allow it to be extremely fast for the second stage of the ITS. That insight, plus the fact that binary m99 can encode/decode in the hundreds of MB/sec single threaded is what motivated me to see if I could produce a superior recency transform. | 2019-04-30 23:37:27 |
<Shelwien> | and bitwise BWT creates these for formats based on huffman coding | 2019-04-30 23:37:36 |
| https://encode.ru/threads/2742-Compressed-data-model?p=52493&viewfull=1#post52493 | 2019-04-30 23:37:59 |
<maniscalco_> | I'll take a look. Stepping away, have to put the kids to bed. | 2019-04-30 23:38:58 |
<Shelwien> | ah, ok | 2019-04-30 23:39:04 |
| as to codec trees btw | 2019-04-30 23:39:53 |
| 7-zip has a system designed like directshow system | 2019-04-30 23:40:32 |
| https://en.wikipedia.org/wiki/DirectShow#Architecture | 2019-04-30 23:41:09 |
| video streams have comples design with lots of replaceable parts | 2019-04-30 23:41:43 |
| container format parser, video/audio/subtitle streams, decoders for these, etc | 2019-04-30 23:42:29 |
| and now we kinda have the same situation in compression | 2019-04-30 23:43:06 |
| because there're various preprocessing methods for different file formats | 2019-04-30 23:43:50 |
| and sometimes preprocessors output multiple streams, which have to be compressed separately, etc | 2019-04-30 23:44:20 |
| for BWT thats even more important | 2019-04-30 23:46:17 |
* FunkyBob still remembers the derision when BWT was first announced | 2019-04-30 23:47:33 |
| derision? | 2019-04-30 23:48:04 |
<FunkyBob> | yeah... people initially thought it was yet another "infinite compression" scan | 2019-04-30 23:48:28 |
<Shelwien> | well, I didn't like how they used text preprocessing and compared the results to straight LZ/PPM | 2019-04-30 23:48:31 |
<FunkyBob> | but upon careful reading, it was clear that's not what they claimed at all | 2019-04-30 23:48:50 |
<Shelwien> | ah, no, we didn't have that, its a pretty obvious transform, with lots of previously known methods leading to it | 2019-04-30 23:49:20 |
| for example, we had static CM | 2019-04-30 23:49:44 |
| which generated order* statistics for a file | 2019-04-30 23:50:02 |
| then compressed these statistics instead of data | 2019-04-30 23:50:11 |
| also, ST2-ST4 are very obvious and easy to make | 2019-04-30 23:51:18 |
| so the only real novelty is how simple inverse BWT can be | 2019-04-30 23:54:36 |
| but that doesn't affect the compression | 2019-04-30 23:54:48 |
| still, there was a lot of excitement for a few years, when it appeared | 2019-04-30 23:55:59 |
| and comparing to compression algorithms like LZ77/8 and PPM, its certainly a step forward | 2019-04-30 23:57:00 |
| but unfortunately it loses to serial LZ/CM/PPM when they start using different contexts | 2019-04-30 23:58:15 |
| like position alignment or fuzzy matches in lzma | 2019-04-30 23:58:55 |
<FunkyBob> | ok, I've fixed the stalling problem... now I just need to fix the "never finds a match" problem | 2019-05-01 00:23:43 |
<maniscalco_> | There are improvements yet to be had in BWT. But it's going to require m03 like parsing and building up context models to make any significant advances. | 2019-05-01 00:24:26 |
| Plus It has other applications outside of compression | 2019-05-01 00:24:52 |
<Shelwien> | sure, it even has applications in compression - for efficient data indexing | 2019-05-01 00:27:56 |
| i want to try making a LZ matchfinder with it | 2019-05-01 00:28:18 |
| by building SAs for small blocks (64k?) and linking them | 2019-05-01 00:29:08 |
| for 64k-or-smaller blocks it would be 2N for SA, 2N for links to prev SA, 1N for prefix lengths, 1N for actual data | 2019-05-01 00:31:03 |
<FunkyBob> | yay, progress! | 2019-05-01 00:31:57 |
| first block got compression... | 2019-05-01 00:32:02 |
<Shelwien> | :) | 2019-05-01 00:32:05 |
<FunkyBob> | and fixed | 2019-05-01 00:32:31 |
<Shelwien> | wanna see the lzma bt4 matchfinder? | 2019-05-01 00:32:59 |
<FunkyBob> | not sure I can brain enough to grok it right now :) | 2019-05-01 00:33:48 |
<Shelwien> | http://nishi.dreamhosters.com/u/mf_bt4_v0.rar | 2019-05-01 00:33:59 |
| it generates readable output | 2019-05-01 00:34:19 |
<FunkyBob> | 51,518,328 | 2019-05-01 00:36:32 |
| is my initial result for enwik8 | 2019-05-01 00:36:40 |
<Shelwien> | yeah, better switch to BWT i guess :) | 2019-05-01 00:37:30 |
| bsc output for enwik8 is 24M _before_ actual entropy coding | 2019-05-01 00:37:58 |
<maniscalco_> | what are the steps for BSC again. BWT->QLFC and aynthing else? | 2019-05-01 00:42:33 |
<Shelwien> | also LZP, segmentation | 2019-05-01 00:43:01 |
<maniscalco_> | enwik8 is pretty stationary so its not difficult to expect it to be greatly reduced like that before entropy coding | 2019-05-01 00:43:15 |
<Shelwien> | and QLFC is not just reverse-BWT | 2019-05-01 00:43:20 |
<maniscalco_> | I'm not overly familiar with QLFC. only in passing | 2019-05-01 00:43:42 |
<Shelwien> | * reverse-MTF i mean | 2019-05-01 00:43:44 |
| its actually reverse-MTF + RLE + VL bitcode + order1 CM | 2019-05-01 00:44:30 |
<maniscalco_> | reverse mtf? explain please | 2019-05-01 00:45:35 |
<Shelwien> | http://nishi.dreamhosters.com/u/bsc_usage.txt | 2019-05-01 00:45:40 |
<maniscalco_> | jesus Eugine. you have a link for everything, don't you? | 2019-05-01 00:46:06 |
<Shelwien> | that's my file dump, i just upload stuff | 2019-05-01 00:46:22 |
| mtf can be better explained in terms of distance coding | 2019-05-01 00:46:47 |
| mtf turns symbols into counts of different symbols between two occurrences of current symbol | 2019-05-01 00:47:32 |
<maniscalco_> | yeah, i'm familar with that. DC is from early 2000s | 2019-05-01 00:47:37 |
| Binder, I believe was the guy's name. | 2019-05-01 00:48:06 |
<Shelwien> | while DC encodes distance not counting new symbols | 2019-05-01 00:48:17 |
| and QLFC kinda uses MTF ranks | 2019-05-01 00:48:44 |
| but they're counts of new symbols not since last symbol occurrence, like normal | 2019-05-01 00:49:03 |
| but _to_ next occurrence | 2019-05-01 00:49:10 |
| so statistically its the same MTF | 2019-05-01 00:49:19 |
| but decoder knowns the actual symbol when reading its rank | 2019-05-01 00:49:41 |
| while normally with MTF we'd read the rank, then decode the symbol from it | 2019-05-01 00:49:59 |
| and apparently using actual symbol as context for its rank | 2019-05-01 00:50:19 |
| can improve entropy coding a lot | 2019-05-01 00:50:26 |
| there've been a recent improvement on that, i guess | 2019-05-01 00:58:09 |
| https://github.com/loxxous/Behemoth-Rank-Coding | 2019-05-01 00:58:10 |
| its similar, but also vectorized | 2019-05-01 00:58:28 |
<maniscalco_> | yes, lucas' work. | 2019-05-01 00:58:55 |
<Shelwien> | so, as to m03 | 2019-05-01 01:02:09 |
| i think it extracts the original prefix context for a symbol | 2019-05-01 01:02:31 |
<maniscalco_> | it parses the boundaries of each context at each order on order at a time. | 2019-05-01 01:03:11 |
<Shelwien> | so it becomes possible to compress as good as best prefix CM | 2019-05-01 01:03:17 |
| but it still processes symbols not in original order, but rather in order of contexts | 2019-05-01 01:03:47 |
| there're certainly some benefits to that | 2019-05-01 01:04:05 |
| but unfortunately plain 8-bit text is not that popular anymore | 2019-05-01 01:04:32 |
<maniscalco_> | so if you had the input `swiss miss` then the BWT is `swm ssiiss` | 2019-05-01 01:04:33 |
| The order zero context bounaries are trivially easy to locate for both the encoder and decoder: | 2019-05-01 01:04:52 |
<Shelwien> | so the question is how to modify BWT | 2019-05-01 01:05:07 |
| to at least compress 16-bit alphabet for unicode | 2019-05-01 01:05:17 |
| or 2D for pictures | 2019-05-01 01:05:26 |
<maniscalco_> | s|wm| |ssiis|s | 2019-05-01 01:05:44 |
<Shelwien> | yeah, as i mentioned before, there've been some static CM attempts | 2019-05-01 01:06:23 |
<maniscalco_> | there's no reason this couldn't be used for 16 bit symbols as well, really | 2019-05-01 01:06:46 |
| you do need a bit more space since symbols are now 2 bytes wide but I think that's it | 2019-05-01 01:07:08 |
<Shelwien> | for 16bit i think there'd be scaling issues because of cpu cache size | 2019-05-01 01:07:17 |
| where with 8-bit symbol everything is nicely in L1 | 2019-05-01 01:07:31 |
| with 16-bit symbols it won't fit | 2019-05-01 01:07:41 |
<maniscalco_> | sure, there would be a performance issue but in terms of algorithmic performance, its all the same | 2019-05-01 01:08:02 |
<Shelwien> | anyway, ideally we'd need a generalized transform | 2019-05-01 01:08:51 |
| which would sort symbols based on whatever function of context | 2019-05-01 01:09:12 |
| its actually possible to implement like ST | 2019-05-01 01:09:36 |
<maniscalco_> | I thought about that in the past. Somehow consider distance in the sort | 2019-05-01 01:09:45 |
<Shelwien> | simple compute the index for current symbol and output the symbol to stream[index] | 2019-05-01 01:10:12 |
<maniscalco_> | so that the transform wouldn't attempt to associate 'similar' contexts that are a million miles appart | 2019-05-01 01:10:15 |
<Shelwien> | well, that's an big issue with BWT compression of sorted dictionaries :) | 2019-05-01 01:10:42 |
<maniscalco_> | exactly. and why st4 does better | 2019-05-01 01:10:53 |
| it's 'too stupid' to see the really good correlations that are toxic in reality | 2019-05-01 01:11:41 |
<Shelwien> | also, there was an interesting algorithm for serializing of pictures | 2019-05-01 01:11:59 |
<maniscalco_> | so it chooses whatever is close by, which in the case of dictionaries, IS the most similar | 2019-05-01 01:12:02 |
<Shelwien> | https://www.youtube.com/watch?v=cGGzZY8OYts | 2019-05-01 01:12:24 |
<maniscalco_> | the repo on my github does the full m03 parsing. it wouldn't be hard to modify it to do 16 bit symbols instead. | 2019-05-01 01:13:18 |
| that would give you a really good idea as to how much it would degrade | 2019-05-01 01:13:38 |
<Shelwien> | yeah, but we need it to work not much slower than 8-bit, otherwise there's no point | 2019-05-01 01:13:53 |
<maniscalco_> | only one way to find out | 2019-05-01 01:15:10 |
<Shelwien> | i got reminded of that idea by NNCP | 2019-05-01 01:15:50 |
| it has a self-made preprocessor which extends the alphabet to 16 bits | 2019-05-01 01:16:17 |
| then iteratively finds most frequent strings and assigns them to new symbols | 2019-05-01 01:16:36 |
| https://bellard.org/nncp/ | 2019-05-01 01:17:05 |
| https://encode.ru/threads/2702-Adaptive-Lossless-Prediction-impl-in-C | 2019-05-01 01:21:10 |
<maniscalco_> | so m03 has almost no performance loss for 16 bit symbols | 2019-05-01 02:05:26 |
| at least for the parsing | 2019-05-01 02:05:44 |
| I kept the BWT the same (8 bit) and then widened the BWT output to 16 bits per symbol | 2019-05-01 02:06:09 |
| and process as if they were always 16 bit symbols | 2019-05-01 02:06:17 |
| results for 16 bit enwik8: | 2019-05-01 02:06:47 |
| ./build/release/bin/m19 3 ~/programming/testset/enwik/enwik8 m19 - next generation BWT compressor. Author: M.A. Maniscalco (2018 - 2019) *** This build is pre-alpha and demonstrates both M19 and M03 context parsing only *** max order = 5317 elapsed time: 8288 ms, 11.5067 MB/sec | 2019-05-01 02:06:49 |
| and 8 bit normal enwik8: | 2019-05-01 02:07:14 |
| ./build/release/bin/m19 3 ~/programming/testset/enwik/enwik8 m19 - next generation BWT compressor. Author: M.A. Maniscalco (2018 - 2019) *** This build is pre-alpha and demonstrates both M19 and M03 context parsing only *** max order = 5317 elapsed time: 8035 ms, 11.869 MB/sec | 2019-05-01 02:07:15 |
| +250ms | 2019-05-01 02:07:22 |
| about 2 sec of that time is just the BWT time | 2019-05-01 02:07:53 |
<Shelwien> | can you try it with nncp preprocessor? | 2019-05-01 02:08:23 |
| https://bellard.org/nncp/nncp-2019-04-13.tar.gz | 2019-05-01 02:08:45 |
| just compiler preprocess.c | 2019-05-01 02:09:21 |
| then run .\preprocess c words.txt enwik9 enwik9.bin 65500 2 | 2019-05-01 02:09:56 |
<maniscalco_> | and come to think of it, that 250ms is probably the overhead for widening the array | 2019-05-01 02:13:58 |
| will do. just pushing the change to m03/19 first | 2019-05-01 02:14:16 |
| https://github.com/michaelmaniscalco/m19.git just in case anyone wants to confirm for themselves | 2019-05-01 02:22:27 |
<Shelwien> | post bwt its less interesting, especially if you just expand 8 to 16 bits | 2019-05-01 02:24:31 |
| but that preprocessor can really use the full 16-bit alphabet | 2019-05-01 02:25:01 |
<maniscalco_> | nncp-2019-04-13$ make gcc -O3 -Wall -Wpointer-arith -g -fno-math-errno -fno-trapping-math -MMD -Wno-format-truncation -c -o nncp.o nncp.c gcc -O3 -Wall -Wpointer-arith -g -fno-math-errno -fno-trapping-math -MMD -Wno-format-truncation -c -o cp_utils.o cp_utils.c gcc -O3 -Wall -Wpointer-arith -g -fno-math-errno -fno-trapping-math -MMD -Wno-format-truncation -c -o arith.o arith.c gcc -o nncp nncp.o cp_utils.o arith.o libnc.a -l | 2019-05-01 02:25:12 |
<Shelwien> | ? | 2019-05-01 02:25:29 |
| no need really | 2019-05-01 02:25:33 |
<maniscalco_> | libnc needs to be compiled with PIC | 2019-05-01 02:25:46 |
| Sure, its more interesting with 16 bit BWT. But I can't just write one of those up in 20 minutes (^: | 2019-05-01 02:26:42 |
<Shelwien> | yes, i mean, just type "gcc preprocess.c" | 2019-05-01 02:26:43 |
<maniscalco_> | k | 2019-05-01 02:26:49 |
| gcc preprocess.c /tmp/ccdVrLNV.o: In function `get_n_bits': preprocess.c:(.text+0x11c2): undefined reference to `log2' preprocess.c:(.text+0x12a6): undefined reference to `log2' /tmp/ccdVrLNV.o: In function `compute_score': preprocess.c:(.text+0x13b9): undefined reference to `log2' /tmp/ccdVrLNV.o: In function `compute_entropy': preprocess.c:(.text+0x1b16): undefined reference to `log2' preprocess.c:(.text+0x1c17): undefined | 2019-05-01 02:27:15 |
| log2 as in the math function ? | 2019-05-01 02:27:31 |
<Shelwien> | its probably in headers, let me see... | 2019-05-01 02:28:03 |
| well, just define it then | 2019-05-01 02:28:54 |
<maniscalco_> | added -lm at the end and it worked | 2019-05-01 02:29:36 |
<Shelwien> | cool :) | 2019-05-01 02:29:44 |
<maniscalco_> | no words.txt file though | 2019-05-01 02:34:32 |
<Shelwien> | its generated by preprocessor | 2019-05-01 02:35:04 |
<maniscalco_> | how long should the preprocessing take? | 2019-05-01 02:40:59 |
<Shelwien> | algorithm is dumb, so not fast | 2019-05-01 02:42:07 |
| depends on last number - its a cut-off frequency | 2019-05-01 02:42:18 |
| the higher it is, the earlier it would quit | 2019-05-01 02:42:41 |
<maniscalco_> | regarding m03 and m19 with 16 bit symbols. I can't test it with real 16 bit data because the algorithm only works for *real* bwt data. if I made fake 16 bit input it wouldn't work. | 2019-05-01 02:46:08 |
| To test a real world input I would need a BWT that does 16 bit symbols. Do you know of any? | 2019-05-01 02:46:45 |
<Shelwien> | i can quickly make a qsort one | 2019-05-01 02:46:59 |
<maniscalco_> | it would allow us to test the parsing with real data. although it shouldn't affect m03/19 performance at all in any case | 2019-05-01 02:47:46 |
| right about now we should be hearing the ghost of David Scott saying "you should be using BWTS instead" | 2019-05-01 02:49:32 |
| (^: | 2019-05-01 02:49:34 |
<Shelwien> | http://nishi.dreamhosters.com/u/BWT16.cpp | 2019-05-01 02:51:51 |
| i actually added BWTS and bitwise BWTS to .pa, for cdm | 2019-05-01 02:52:19 |
| turned out to be the perfect transform for that kind of thing | 2019-05-01 02:52:46 |
<maniscalco_> | thanks for the BWT .. still has an issue though. M03/19 needs to know the sentinel position in the BWT | 2019-05-01 03:04:49 |
| otherwise it will fail | 2019-05-01 03:04:56 |
| so it has to be a BWT that uses $ | 2019-05-01 03:05:13 |
| and the preprocessor is still working away ... | 2019-05-01 03:05:38 |
<Shelwien> | yeah, i also started it: | 2019-05-01 03:07:06 |
| Z:\003>timetest .\a c words.txt enwik9 enwik9.bin 65500 256 | 2019-05-01 03:07:08 |
| input: 1000000000 bytes | 2019-05-01 03:07:08 |
| after case/space preprocessing: 1119415035 symbols | 2019-05-01 03:07:08 |
| 4396 280965958 348346311 | 2019-05-01 03:07:08 |
| first number is the number of allocated symbols | 2019-05-01 03:07:24 |
| http://nishi.dreamhosters.com/u/BWT16.rar | 2019-05-01 03:21:55 |
| more complete version with index and unBWT | 2019-05-01 03:22:14 |
| and it kinda uses EOF, if you mean the comparison wrap | 2019-05-01 03:22:54 |
| prev version sorts in reverse order | 2019-05-01 03:29:19 |
| but this one sorts in forward order | 2019-05-01 03:29:29 |
| and both don't wrap around, just stop at end of text buffer | 2019-05-01 03:29:48 |
<maniscalco_> | ok. I'm going to sign off for the night and leave the preprocessor running overnight. hopefully it will be done by the morning. I'll let you know | 2019-05-01 03:31:24 |
<Shelwien> | ok :) | 2019-05-01 03:31:39 |
| !next | 2019-05-01 03:34:42 |