*** NCDR has left the channel | 2019-05-04 14:54:09 |
<FunkyBob> | o/ | 2019-05-04 17:21:21 |
*** NCDR has joined the channel | 2019-05-04 20:08:15 |
<Shelwien> | 3. LZ4 supports match length >130. | 2019-05-04 20:08:17 |
| 4. you can skip zero distance | 2019-05-04 20:09:04 |
| 5. you can add rep-matches: a flag to skip repeated distance value | 2019-05-04 20:11:12 |
| 6, literal flag is not always bad. you can encode single literals without length | 2019-05-04 20:12:55 |
<FunkyBob> | skip zero distances? | 2019-05-04 21:04:08 |
| you mean zero length literal or matches? | 2019-05-04 21:05:06 |
<Shelwien> | i mean distances | 2019-05-04 21:54:48 |
| uint16_t offset = src->data[sptr++]; | 2019-05-04 21:55:08 |
| offset |= src->data[sptr++] << 8; | 2019-05-04 21:55:08 |
| what's the meaning of offset=0 here? | 2019-05-04 21:55:34 |
<FunkyBob> | ah | 2019-05-04 21:56:53 |
<Shelwien> | btw, an interesting option would be to encode 7-bit literals | 2019-05-04 21:56:59 |
| at least for enwik :) | 2019-05-04 21:57:06 |
<FunkyBob> | :P | 2019-05-04 21:57:25 |
| am trying to stick to byte aligned | 2019-05-04 21:58:11 |
<Shelwien> | also, i've seen some weird LZ recently, with 4k window | 2019-05-04 21:58:11 |
<FunkyBob> | erk | 2019-05-04 21:58:22 |
<Shelwien> | it had shorter distance codes for values near 0 (that's normal), but also near window size | 2019-05-04 21:59:22 |
| i don't think 64k is different in that sense | 2019-05-04 22:00:01 |
<FunkyBob> | I can also get a bit of an improvement by passing 16MB at a time... but, well... | 2019-05-04 22:00:46 |
| enwik8 goes down to 45,812,181 | 2019-05-04 22:01:04 |
<Shelwien> | also where's #define __COMP_H__ ? | 2019-05-04 22:03:16 |
<FunkyBob> | oops | 2019-05-04 22:03:45 |
| my original reason for this project was to brush up on my C, so... :) | 2019-05-04 22:03:54 |
<Shelwien> | In file included from main.c:11: | 2019-05-04 22:04:24 |
| basiclz.inc:120:10: error: conflicting types for 'compress' | 2019-05-04 22:04:24 |
| uint32_t compress(struct buffer *src, struct buffer *dest) { | 2019-05-04 22:04:24 |
| ^~~~~~~~ | 2019-05-04 22:04:24 |
| In file included from main.c:9: | 2019-05-04 22:04:24 |
| comp.h:12:8: note: previous declaration of 'compress' was here | 2019-05-04 22:04:24 |
<FunkyBob> | just pushed it.. forgot it was edited | 2019-05-04 22:04:32 |
<Shelwien> | ok, compiled | 2019-05-04 22:05:00 |
| also use "rb", "wb" for fopen | 2019-05-04 22:05:38 |
| with "r" it won't work on windows | 2019-05-04 22:05:50 |
<FunkyBob> | silly windows :P | 2019-05-04 22:07:01 |
<Shelwien> | got extra 5 bytes on decoding | 2019-05-04 22:07:13 |
<FunkyBob> | ? | 2019-05-04 22:07:28 |
| you mean from not having rb/wb ? | 2019-05-04 22:07:57 |
<Shelwien> | nope, it simply adds extra 5 zeroes at the end | 2019-05-04 22:13:10 |
<FunkyBob> | hrm | 2019-05-04 22:16:19 |
| what size is the source file? | 2019-05-04 22:16:25 |
<Shelwien> | 1463/5589 | 2019-05-04 22:17:13 |
<FunkyBob> | can you send me the original file to test with, please? | 2019-05-04 22:18:41 |
<Shelwien> | http://nishi.dreamhosters.com/u/lzfb_002.zip | 2019-05-04 22:21:09 |
| "original file" is lzfb.exe | 2019-05-04 22:21:31 |
<FunkyBob> | your numbers don't match what I see | 2019-05-04 22:25:35 |
| 24576 is how big lzfb.exe is, not 5589 | 2019-05-04 22:25:46 |
| that said, I get a segfauly trying to decompress | 2019-05-04 22:26:00 |
| oh, no I don't... I get an assertion | 2019-05-04 22:26:18 |
<Shelwien> | i meant source size of main.c and basiclz.c that i downloaded | 2019-05-04 22:27:02 |
<FunkyBob> | ah | 2019-05-04 22:27:49 |
<Shelwien> | http://nishi.dreamhosters.com/u/lzfb_ofs_e8.png | 2019-05-04 22:29:49 |
| http://nishi.dreamhosters.com/u/lzfb_len_e8.png | 2019-05-04 22:31:30 |
| http://nishi.dreamhosters.com/u/lzfb_lenlit_e8.png | 2019-05-04 22:32:33 |
| { 0, 0, 0, 0, 0, 0, 0, 0, 57437, 34432, 26618, 32872, 0, 0, 0, 0, 6520, 2970, 2463, 3949, 0, 0, 0, 0, 1839, 795, 757, 1156, 0, 0, 0, 0, 976, | 2019-05-04 22:34:10 |
| matchlen seems buggy? | 2019-05-04 22:34:30 |
<FunkyBob> | hmm? | 2019-05-04 22:34:35 |
<Shelwien> | that match len occurrence counts for enwik8 | 2019-05-04 22:35:22 |
<FunkyBob> | ah | 2019-05-04 22:35:33 |
| ok, I'm going to go get some dinner, then head back to my hotel... and delve further into this :) | 2019-05-04 22:35:51 |
<Shelwien> | its in decompress() in archive that i posted | 2019-05-04 22:35:59 |
| (counting) | 2019-05-04 22:36:01 |
<FunkyBob> | back again | 2019-05-05 00:03:29 |
| Shelwien: are you saying the bug is in decompress? | 2019-05-05 00:06:25 |
<Shelwien> | probably in compress? | 2019-05-05 00:21:45 |
| anyway, it seems like it doesn't use certain ranges of len values, which should make compression worse | 2019-05-05 00:22:32 |
| minlen=8 may be ok, but why skip 12-15 etc? | 2019-05-05 00:23:43 |
<FunkyBob> | sorry? | 2019-05-05 00:25:59 |
<Shelwien> | ? | 2019-05-05 00:26:06 |
<FunkyBob> | skip 12 - 15? | 2019-05-05 00:26:07 |
| when am I doing that? | 2019-05-05 00:26:12 |
<Shelwien> | do you see the { ... } table above? | 2019-05-05 00:26:25 |
| that's match len freqs for enwik8 | 2019-05-05 00:26:49 |
<FunkyBob> | if it's not finding them, it's not finding them. | 2019-05-05 00:26:59 |
<Shelwien> | nope, its enwik, not some binary structured file | 2019-05-05 00:27:14 |
| it can't have a 4-byte align | 2019-05-05 00:27:36 |
<FunkyBob> | you can see my code, I have not tried to bias it to any particulate lengths | 2019-05-05 00:28:17 |
<Shelwien> | ok, lets see with lazy disabled - extra 5 bytes already disappeared btw | 2019-05-05 00:29:42 |
<FunkyBob> | with greedy? | 2019-05-05 00:29:56 |
| or with a git pull? | 2019-05-05 00:30:00 |
<Shelwien> | greedy | 2019-05-05 00:30:41 |
<FunkyBob> | hrm | 2019-05-05 00:31:18 |
<Shelwien> | but same align4 on enwik | 2019-05-05 00:31:34 |
<FunkyBob> | | 0| L,4, | 2019-05-05 00:36:54 |
| < 1< L,21, | 2019-05-05 00:36:54 |
| > 1> L,128, | 2019-05-05 00:36:55 |
| ok... they lose sync almost immediately :/ | 2019-05-05 00:37:03 |
<Shelwien> | its ctzl -> ctzll | 2019-05-05 00:37:32 |
<FunkyBob> | that'd do it | 2019-05-05 00:38:04 |
| hrm... now my enwik test is failing :/ | 2019-05-05 00:39:32 |
| (I really appreciate your help on this) | 2019-05-05 00:40:35 |
<Shelwien> | { 0, 0, 0, 0, 69632, 62067, 53056, 44525, 35571, 26296, 19629, 14025, 10051, 7404, 5548, 4319, 3215, | 2019-05-05 00:41:03 |
| mine works | 2019-05-05 00:41:06 |
| need to fix minmatchlen to 4 i guess, or more maybe | 2019-05-05 00:41:39 |
<FunkyBob> | oh, I thought I did... must've been in a different version | 2019-05-05 00:42:11 |
<Shelwien> | ... 5, 7, 3, 1, 3, 1, 3, 2, 2, 3, 6, 1, 1, 1, 0, 2, 1, 552 } | 2019-05-05 00:42:28 |
| that's len=128 :) | 2019-05-05 00:42:42 |
<FunkyBob> | mmm? | 2019-05-05 00:43:03 |
| oh | 2019-05-05 00:43:11 |
| yeah, I think I checked before on max len hits | 2019-05-05 00:43:31 |
| I checked how many extra bytes using a LZ4-ish "keep emitting 255..." scheme would take, it was ~3.6k | 2019-05-05 00:46:37 |
| that is, that many extra bytes in length counters... | 2019-05-05 00:46:46 |
| so, a lot of savings to be had | 2019-05-05 00:46:49 |
<Shelwien> | maybe you can make a special case for len<16+min, ofs<256 | 2019-05-05 00:47:10 |
| like I111LLLL OOOOOOOO | 2019-05-05 00:47:42 |
| one byte shorter | 2019-05-05 00:47:48 |
<FunkyBob> | well, I'll just commit the fixes we have so far :) | 2019-05-05 00:48:35 |
| odd... compression got worse: 47146900 | 2019-05-05 00:48:59 |
<Shelwien> | lazy is 45,877,684 here | 2019-05-05 00:49:26 |
<FunkyBob> | did you set MIN_MATCH_LEN to 4 ? | 2019-05-05 00:50:16 |
<Shelwien> | no | 2019-05-05 00:50:36 |
| let's see | 2019-05-05 00:50:38 |
<FunkyBob> | ah, but that also requires changing find_match to test for len >= not len > | 2019-05-05 00:51:00 |
<Shelwien> | 47,146,924 | 2019-05-05 00:51:39 |
| hm | 2019-05-05 00:51:53 |
<FunkyBob> | and now it's way slower :/ | 2019-05-05 00:52:52 |
| oh duh.. | 2019-05-05 00:53:05 |
| ignore the slower git :) | 2019-05-05 00:53:12 |
<Shelwien> | 45,877,524 with min 4 | 2019-05-05 00:55:15 |
<FunkyBob> | 45877500 | 2019-05-05 00:56:47 |
| I'm sure I checked which version of ctlz it was emitting, that itwas teh 64bit version :/ | 2019-05-05 00:58:18 |
<Shelwien> | maybe on linux it works differently, dunno | 2019-05-05 00:58:52 |
<FunkyBob> | I've added a debug mode that will print out CSV lines of either "L,{len}," or "M,{len},{offset}" | 2019-05-05 01:00:00 |
<Shelwien> | wanna test how lzma parsing works with your format? | 2019-05-05 01:00:55 |
<FunkyBob> | but, yeah, perhaps objdump was confused, or I read the docs wrong, because as your stats showed, it wasn't getting good match lengths | 2019-05-05 01:01:11 |
| umm... sure? | 2019-05-05 01:01:22 |
<Shelwien> | http://nishi.dreamhosters.com/u/lzma_delrep_v1.rar | 2019-05-05 01:01:29 |
| encode a file with lzma -d16 | 2019-05-05 01:02:09 |
| remove rep codes | 2019-05-05 01:02:20 |
| then convert to your format | 2019-05-05 01:02:30 |
<FunkyBob> | blah... lazy still gets the wrong size | 2019-05-05 01:08:29 |
| ok, I see the bug | 2019-05-05 01:11:20 |
| oh, nope | 2019-05-05 01:11:45 |
<unic0rn> | someone's having fun i see | 2019-05-05 01:20:17 |
<FunkyBob> | :) | 2019-05-05 01:22:23 |
<unic0rn> | out of curiosity, what's your memory usage and compression time for enwik8? | 2019-05-05 01:24:32 |
<FunkyBob> | memory usage I haven't measured, but it'd mostly be static... input buffer, output buffer, a 64k x 32bit chain head hash table, and MAX_FRAME_SIZE x 32bit chain links buffer | 2019-05-05 01:26:22 |
| MAX_FRAME_BUFFER is currently 4MB | 2019-05-05 01:26:31 |
| so... 16MB + 256k for tables, | 2019-05-05 01:27:03 |
| M,131,1372 | 2019-05-05 01:29:12 |
| M,131,5 | 2019-05-05 01:29:12 |
| M,24,5 | 2019-05-05 01:29:12 |
| now that's intereseting... that's the last 3 actions for the file that's 5 bytes over size | 2019-05-05 01:29:24 |
<unic0rn> | so it's small and fast | 2019-05-05 01:30:29 |
<FunkyBob> | so it seems | 2019-05-05 01:30:48 |
<unic0rn> | not interested in higher ratio? | 2019-05-05 01:31:34 |
<Shelwien> | ok, i got it to correctly decode book1 from lzma parsing | 2019-05-05 01:31:36 |
| worse compression on book1 :) | 2019-05-05 01:32:11 |
<FunkyBob> | unic0rn: in time... | 2019-05-05 01:32:36 |
| unic0rn: this is mostly an exercise in refreshing my C skills :) | 2019-05-05 01:32:49 |
| once I debug this lazy parsing bug, I might move onto something that compresses better | 2019-05-05 01:33:07 |
<Shelwien> | 768,771 BOOK1 | 2019-05-05 01:35:04 |
| 284,257 book1.lzma // lzma.exe e BOOK1 book1.lzma -d16 -fb273 -mc999 -lc0 -lp0 -pb0 -mt1 | 2019-05-05 01:35:05 |
| 840,823 book1.dec // lzma tokens w/o entropy coding | 2019-05-05 01:35:05 |
| 848,751 book1_norep.dec // delrep_v0 to leave only matches at literals | 2019-05-05 01:35:05 |
| 394,403 0.lzfb // lzfb with minmatch=2 | 2019-05-05 01:35:05 |
| 397,409 book1_norep.lzfb // conversion result | 2019-05-05 01:35:05 |
<unic0rn> | hah. i'm going different route. all work in progress, decompression isn't even started yet, but will be simple, compression is work in progress, moving forward with it while optimizing stuff on the fly as needed. with current buffers eats up 150mb ram, speed isn't great but there's a lot of headroom for improvement, as for ratio... we'll see. can't guess yet, gotta decide on few algorithm details | 2019-05-05 01:35:26 |
| first | 2019-05-05 01:35:26 |
<Shelwien> | decoding is important :) | 2019-05-05 01:35:58 |
<unic0rn> | to sort out bugs, yeah. | 2019-05-05 01:36:26 |
<FunkyBob> | decoding is essential :) | 2019-05-05 01:36:37 |
<unic0rn> | but on its own, how hard can it be to code a damn huffman decoder | 2019-05-05 01:36:43 |
<Shelwien> | depending on speed optimization | 2019-05-05 01:37:04 |
*** NCDR has left the channel | 2019-05-05 01:37:19 |
<unic0rn> | i mean, sure it's mandatory, but compared to compression, complexity is close to 0 | 2019-05-05 01:37:24 |
<FunkyBob> | As my dad said - "You've turned the avocado into guacamole... now can you turn the guacamole back into avocado?" | 2019-05-05 01:37:29 |
<Shelwien> | https://encode.ru/threads/1183-Huffman-code-generator :) | 2019-05-05 01:37:47 |
| well, i did that with deflate->lzma before | 2019-05-05 01:38:19 |
<unic0rn> | lol | 2019-05-05 01:39:03 |
| that seems bloated for no reason ;) | 2019-05-05 01:39:09 |
<Shelwien> | lzma doesn't have literal runs | 2019-05-05 01:39:24 |
| i converted them, but it didn't optimize for that | 2019-05-05 01:39:43 |
<unic0rn> | and on more serious note, huffman isn't a problem. "what do i encode" is ;) | 2019-05-05 01:41:14 |
<Shelwien> | https://encode.ru/threads/1288-LZMA-markup-tool?p=25481&viewfull=1#post25481 | 2019-05-05 01:41:30 |
| btw, its possible to get better compression than huffman with almost the same decoder | 2019-05-05 01:43:10 |
| well, FSE, but aside from that | 2019-05-05 01:43:27 |
<FunkyBob> | yeah, was thinking I might try an entrope codec next | 2019-05-05 01:44:18 |
<Shelwien> | these are fast: https://encode.ru/threads/3109-How-to-build-Bonfield-s-rANS-coders-on-windows | 2019-05-05 01:45:16 |
<unic0rn> | there's no telling how my "huffman" will work. i certainly won't be generating the tree exactly like the standard variant does | 2019-05-05 01:45:38 |
<FunkyBob> | yeah, I think I've mostly got my head around how to build a tANS table | 2019-05-05 01:45:41 |
<Shelwien> | then it won't be huffman anymore? :) | 2019-05-05 01:47:36 |
<FunkyBob> | :) | 2019-05-05 01:47:57 |
| so, it seems the len fixes and the MinMatch = 4 ... as well as no longer special casing the first 4 bytes... has improved my silesia scores by up to 600 ish bytes at times | 2019-05-05 01:48:30 |
| now to test enwik9 | 2019-05-05 01:49:02 |
| Shelwien: book1 is ... Calgary corpus? or Cantebury? | 2019-05-05 01:51:29 |
<Shelwien> | calgary | 2019-05-05 01:51:37 |
| there's also this: http://ctxmodel.net/sh_samples_1.rar | 2019-05-05 01:53:01 |
| it has russian texts, finnish dictionary and some binary files | 2019-05-05 01:53:54 |
| gets weird results from some "optimized" compressors | 2019-05-05 01:54:37 |
<FunkyBob> | heh | 2019-05-05 01:55:07 |
| ok, well, I'm tired... thanks for the help, Shelwien | 2019-05-05 02:07:18 |
| will talk tomorrow, I hope | 2019-05-05 02:07:24 |
<Shelwien> | ok :) | 2019-05-05 02:07:38 |
<FunkyBob> | still not sure what that extra 5 bytes is happening | 2019-05-05 02:07:58 |
<Shelwien> | probably match comparison after buffer | 2019-05-05 02:08:53 |
<unic0rn> | or they just have jet lag | 2019-05-05 02:48:32 |
<Shelwien> | !next | 2019-05-05 03:14:56 |