*** NCDR has left the channel2019-05-04 14:54:09
<FunkyBob> o/2019-05-04 17:21:21
*** NCDR has joined the channel2019-05-04 20:08:15
<Shelwien> 3. LZ4 supports match length >130. 2019-05-04 20:08:17
 4. you can skip zero distance2019-05-04 20:09:04
 5. you can add rep-matches: a flag to skip repeated distance value2019-05-04 20:11:12
 6, literal flag is not always bad. you can encode single literals without length2019-05-04 20:12:55
<FunkyBob> skip zero distances?2019-05-04 21:04:08
 you mean zero length literal or matches?2019-05-04 21:05:06
<Shelwien> i mean distances2019-05-04 21:54:48
  uint16_t offset = src->data[sptr++];2019-05-04 21:55:08
  offset |= src->data[sptr++] << 8;2019-05-04 21:55:08
 what's the meaning of offset=0 here?2019-05-04 21:55:34
<FunkyBob> ah2019-05-04 21:56:53
<Shelwien> btw, an interesting option would be to encode 7-bit literals2019-05-04 21:56:59
 at least for enwik :)2019-05-04 21:57:06
<FunkyBob> :P2019-05-04 21:57:25
 am trying to stick to byte aligned2019-05-04 21:58:11
<Shelwien> also, i've seen some weird LZ recently, with 4k window2019-05-04 21:58:11
<FunkyBob> erk2019-05-04 21:58:22
<Shelwien> it had shorter distance codes for values near 0 (that's normal), but also near window size2019-05-04 21:59:22
 i don't think 64k is different in that sense2019-05-04 22:00:01
<FunkyBob> I can also get a bit of an improvement by passing 16MB at a time... but, well...2019-05-04 22:00:46
 enwik8 goes down to 45,812,1812019-05-04 22:01:04
<Shelwien> also where's #define __COMP_H__ ?2019-05-04 22:03:16
<FunkyBob> oops2019-05-04 22:03:45
 my original reason for this project was to brush up on my C, so... :)2019-05-04 22:03:54
<Shelwien> In file included from main.c:11:2019-05-04 22:04:24
 basiclz.inc:120:10: error: conflicting types for 'compress'2019-05-04 22:04:24
  uint32_t compress(struct buffer *src, struct buffer *dest) {2019-05-04 22:04:24
  ^~~~~~~~2019-05-04 22:04:24
 In file included from main.c:9:2019-05-04 22:04:24
 comp.h:12:8: note: previous declaration of 'compress' was here2019-05-04 22:04:24
<FunkyBob> just pushed it.. forgot it was edited2019-05-04 22:04:32
<Shelwien> ok, compiled2019-05-04 22:05:00
 also use "rb", "wb" for fopen2019-05-04 22:05:38
 with "r" it won't work on windows2019-05-04 22:05:50
<FunkyBob> silly windows :P2019-05-04 22:07:01
<Shelwien> got extra 5 bytes on decoding2019-05-04 22:07:13
<FunkyBob> ?2019-05-04 22:07:28
 you mean from not having rb/wb ?2019-05-04 22:07:57
<Shelwien> nope, it simply adds extra 5 zeroes at the end2019-05-04 22:13:10
<FunkyBob> hrm2019-05-04 22:16:19
 what size is the source file?2019-05-04 22:16:25
<Shelwien> 1463/55892019-05-04 22:17:13
<FunkyBob> can you send me the original file to test with, please?2019-05-04 22:18:41
<Shelwien> http://nishi.dreamhosters.com/u/lzfb_002.zip2019-05-04 22:21:09
 "original file" is lzfb.exe2019-05-04 22:21:31
<FunkyBob> your numbers don't match what I see2019-05-04 22:25:35
 24576 is how big lzfb.exe is, not 55892019-05-04 22:25:46
 that said, I get a segfauly trying to decompress2019-05-04 22:26:00
 oh, no I don't... I get an assertion2019-05-04 22:26:18
<Shelwien> i meant source size of main.c and basiclz.c that i downloaded2019-05-04 22:27:02
<FunkyBob> ah2019-05-04 22:27:49
<Shelwien> http://nishi.dreamhosters.com/u/lzfb_ofs_e8.png2019-05-04 22:29:49
 http://nishi.dreamhosters.com/u/lzfb_len_e8.png2019-05-04 22:31:30
 http://nishi.dreamhosters.com/u/lzfb_lenlit_e8.png2019-05-04 22:32:33
 { 0, 0, 0, 0, 0, 0, 0, 0, 57437, 34432, 26618, 32872, 0, 0, 0, 0, 6520, 2970, 2463, 3949, 0, 0, 0, 0, 1839, 795, 757, 1156, 0, 0, 0, 0, 976,2019-05-04 22:34:10
 matchlen seems buggy?2019-05-04 22:34:30
<FunkyBob> hmm?2019-05-04 22:34:35
<Shelwien> that match len occurrence counts for enwik82019-05-04 22:35:22
<FunkyBob> ah2019-05-04 22:35:33
 ok, I'm going to go get some dinner, then head back to my hotel... and delve further into this :)2019-05-04 22:35:51
<Shelwien> its in decompress() in archive that i posted2019-05-04 22:35:59
 (counting)2019-05-04 22:36:01
<FunkyBob> back again2019-05-05 00:03:29
 Shelwien: are you saying the bug is in decompress?2019-05-05 00:06:25
<Shelwien> probably in compress?2019-05-05 00:21:45
 anyway, it seems like it doesn't use certain ranges of len values, which should make compression worse2019-05-05 00:22:32
 minlen=8 may be ok, but why skip 12-15 etc?2019-05-05 00:23:43
<FunkyBob> sorry?2019-05-05 00:25:59
<Shelwien> ?2019-05-05 00:26:06
<FunkyBob> skip 12 - 15?2019-05-05 00:26:07
 when am I doing that?2019-05-05 00:26:12
<Shelwien> do you see the { ... } table above?2019-05-05 00:26:25
 that's match len freqs for enwik82019-05-05 00:26:49
<FunkyBob> if it's not finding them, it's not finding them.2019-05-05 00:26:59
<Shelwien> nope, its enwik, not some binary structured file2019-05-05 00:27:14
 it can't have a 4-byte align2019-05-05 00:27:36
<FunkyBob> you can see my code, I have not tried to bias it to any particulate lengths2019-05-05 00:28:17
<Shelwien> ok, lets see with lazy disabled - extra 5 bytes already disappeared btw2019-05-05 00:29:42
<FunkyBob> with greedy?2019-05-05 00:29:56
 or with a git pull?2019-05-05 00:30:00
<Shelwien> greedy2019-05-05 00:30:41
<FunkyBob> hrm2019-05-05 00:31:18
<Shelwien> but same align4 on enwik2019-05-05 00:31:34
<FunkyBob> | 0| L,4,2019-05-05 00:36:54
 < 1< L,21,2019-05-05 00:36:54
 > 1> L,128,2019-05-05 00:36:55
 ok... they lose sync almost immediately :/2019-05-05 00:37:03
<Shelwien> its ctzl -> ctzll2019-05-05 00:37:32
<FunkyBob> that'd do it2019-05-05 00:38:04
 hrm... now my enwik test is failing :/2019-05-05 00:39:32
 (I really appreciate your help on this)2019-05-05 00:40:35
<Shelwien> { 0, 0, 0, 0, 69632, 62067, 53056, 44525, 35571, 26296, 19629, 14025, 10051, 7404, 5548, 4319, 3215,2019-05-05 00:41:03
 mine works2019-05-05 00:41:06
 need to fix minmatchlen to 4 i guess, or more maybe2019-05-05 00:41:39
<FunkyBob> oh, I thought I did... must've been in a different version2019-05-05 00:42:11
<Shelwien> ... 5, 7, 3, 1, 3, 1, 3, 2, 2, 3, 6, 1, 1, 1, 0, 2, 1, 552 }2019-05-05 00:42:28
 that's len=128 :)2019-05-05 00:42:42
<FunkyBob> mmm?2019-05-05 00:43:03
 oh2019-05-05 00:43:11
 yeah, I think I checked before on max len hits2019-05-05 00:43:31
 I checked how many extra bytes using a LZ4-ish "keep emitting 255..." scheme would take, it was ~3.6k2019-05-05 00:46:37
 that is, that many extra bytes in length counters...2019-05-05 00:46:46
 so, a lot of savings to be had2019-05-05 00:46:49
<Shelwien> maybe you can make a special case for len<16+min, ofs<2562019-05-05 00:47:10
 like I111LLLL OOOOOOOO2019-05-05 00:47:42
 one byte shorter2019-05-05 00:47:48
<FunkyBob> well, I'll just commit the fixes we have so far :)2019-05-05 00:48:35
 odd... compression got worse: 471469002019-05-05 00:48:59
<Shelwien> lazy is 45,877,684 here2019-05-05 00:49:26
<FunkyBob> did you set MIN_MATCH_LEN to 4 ?2019-05-05 00:50:16
<Shelwien> no2019-05-05 00:50:36
 let's see2019-05-05 00:50:38
<FunkyBob> ah, but that also requires changing find_match to test for len >= not len >2019-05-05 00:51:00
<Shelwien> 47,146,9242019-05-05 00:51:39
 hm2019-05-05 00:51:53
<FunkyBob> and now it's way slower :/2019-05-05 00:52:52
 oh duh..2019-05-05 00:53:05
 ignore the slower git :)2019-05-05 00:53:12
<Shelwien> 45,877,524 with min 42019-05-05 00:55:15
<FunkyBob> 458775002019-05-05 00:56:47
 I'm sure I checked which version of ctlz it was emitting, that itwas teh 64bit version :/2019-05-05 00:58:18
<Shelwien> maybe on linux it works differently, dunno2019-05-05 00:58:52
<FunkyBob> I've added a debug mode that will print out CSV lines of either "L,{len}," or "M,{len},{offset}"2019-05-05 01:00:00
<Shelwien> wanna test how lzma parsing works with your format?2019-05-05 01:00:55
<FunkyBob> but, yeah, perhaps objdump was confused, or I read the docs wrong, because as your stats showed, it wasn't getting good match lengths2019-05-05 01:01:11
 umm... sure?2019-05-05 01:01:22
<Shelwien> http://nishi.dreamhosters.com/u/lzma_delrep_v1.rar2019-05-05 01:01:29
 encode a file with lzma -d162019-05-05 01:02:09
 remove rep codes2019-05-05 01:02:20
 then convert to your format2019-05-05 01:02:30
<FunkyBob> blah... lazy still gets the wrong size2019-05-05 01:08:29
 ok, I see the bug2019-05-05 01:11:20
 oh, nope2019-05-05 01:11:45
<unic0rn> someone's having fun i see2019-05-05 01:20:17
<FunkyBob> :)2019-05-05 01:22:23
<unic0rn> out of curiosity, what's your memory usage and compression time for enwik8?2019-05-05 01:24:32
<FunkyBob> memory usage I haven't measured, but it'd mostly be static... input buffer, output buffer, a 64k x 32bit chain head hash table, and MAX_FRAME_SIZE x 32bit chain links buffer2019-05-05 01:26:22
 MAX_FRAME_BUFFER is currently 4MB2019-05-05 01:26:31
 so... 16MB + 256k for tables,2019-05-05 01:27:03
 M,131,13722019-05-05 01:29:12
 M,131,52019-05-05 01:29:12
 M,24,52019-05-05 01:29:12
 now that's intereseting... that's the last 3 actions for the file that's 5 bytes over size2019-05-05 01:29:24
<unic0rn> so it's small and fast2019-05-05 01:30:29
<FunkyBob> so it seems2019-05-05 01:30:48
<unic0rn> not interested in higher ratio?2019-05-05 01:31:34
<Shelwien> ok, i got it to correctly decode book1 from lzma parsing2019-05-05 01:31:36
 worse compression on book1 :)2019-05-05 01:32:11
<FunkyBob> unic0rn: in time...2019-05-05 01:32:36
 unic0rn: this is mostly an exercise in refreshing my C skills :)2019-05-05 01:32:49
 once I debug this lazy parsing bug, I might move onto something that compresses better2019-05-05 01:33:07
<Shelwien> 768,771 BOOK12019-05-05 01:35:04
 284,257 book1.lzma // lzma.exe e BOOK1 book1.lzma -d16 -fb273 -mc999 -lc0 -lp0 -pb0 -mt1 2019-05-05 01:35:05
 840,823 book1.dec // lzma tokens w/o entropy coding2019-05-05 01:35:05
 848,751 book1_norep.dec // delrep_v0 to leave only matches at literals2019-05-05 01:35:05
 394,403 0.lzfb // lzfb with minmatch=22019-05-05 01:35:05
 397,409 book1_norep.lzfb // conversion result2019-05-05 01:35:05
<unic0rn> hah. i'm going different route. all work in progress, decompression isn't even started yet, but will be simple, compression is work in progress, moving forward with it while optimizing stuff on the fly as needed. with current buffers eats up 150mb ram, speed isn't great but there's a lot of headroom for improvement, as for ratio... we'll see. can't guess yet, gotta decide on few algorithm details 2019-05-05 01:35:26
 first2019-05-05 01:35:26
<Shelwien> decoding is important :)2019-05-05 01:35:58
<unic0rn> to sort out bugs, yeah.2019-05-05 01:36:26
<FunkyBob> decoding is essential :)2019-05-05 01:36:37
<unic0rn> but on its own, how hard can it be to code a damn huffman decoder2019-05-05 01:36:43
<Shelwien> depending on speed optimization2019-05-05 01:37:04
*** NCDR has left the channel2019-05-05 01:37:19
<unic0rn> i mean, sure it's mandatory, but compared to compression, complexity is close to 02019-05-05 01:37:24
<FunkyBob> As my dad said - "You've turned the avocado into guacamole... now can you turn the guacamole back into avocado?"2019-05-05 01:37:29
<Shelwien> https://encode.ru/threads/1183-Huffman-code-generator :)2019-05-05 01:37:47
 well, i did that with deflate->lzma before2019-05-05 01:38:19
<unic0rn> lol2019-05-05 01:39:03
 that seems bloated for no reason ;)2019-05-05 01:39:09
<Shelwien> lzma doesn't have literal runs2019-05-05 01:39:24
 i converted them, but it didn't optimize for that2019-05-05 01:39:43
<unic0rn> and on more serious note, huffman isn't a problem. "what do i encode" is ;)2019-05-05 01:41:14
<Shelwien> https://encode.ru/threads/1288-LZMA-markup-tool?p=25481&viewfull=1#post254812019-05-05 01:41:30
 btw, its possible to get better compression than huffman with almost the same decoder2019-05-05 01:43:10
 well, FSE, but aside from that2019-05-05 01:43:27
<FunkyBob> yeah, was thinking I might try an entrope codec next2019-05-05 01:44:18
<Shelwien> these are fast: https://encode.ru/threads/3109-How-to-build-Bonfield-s-rANS-coders-on-windows2019-05-05 01:45:16
<unic0rn> there's no telling how my "huffman" will work. i certainly won't be generating the tree exactly like the standard variant does2019-05-05 01:45:38
<FunkyBob> yeah, I think I've mostly got my head around how to build a tANS table2019-05-05 01:45:41
<Shelwien> then it won't be huffman anymore? :)2019-05-05 01:47:36
<FunkyBob> :)2019-05-05 01:47:57
 so, it seems the len fixes and the MinMatch = 4 ... as well as no longer special casing the first 4 bytes... has improved my silesia scores by up to 600 ish bytes at times2019-05-05 01:48:30
 now to test enwik92019-05-05 01:49:02
 Shelwien: book1 is ... Calgary corpus? or Cantebury?2019-05-05 01:51:29
<Shelwien> calgary2019-05-05 01:51:37
 there's also this: http://ctxmodel.net/sh_samples_1.rar2019-05-05 01:53:01
 it has russian texts, finnish dictionary and some binary files2019-05-05 01:53:54
 gets weird results from some "optimized" compressors2019-05-05 01:54:37
<FunkyBob> heh2019-05-05 01:55:07
 ok, well, I'm tired... thanks for the help, Shelwien 2019-05-05 02:07:18
 will talk tomorrow, I hope2019-05-05 02:07:24
<Shelwien> ok :)2019-05-05 02:07:38
<FunkyBob> still not sure what that extra 5 bytes is happening2019-05-05 02:07:58
<Shelwien> probably match comparison after buffer2019-05-05 02:08:53
<unic0rn> or they just have jet lag2019-05-05 02:48:32
<Shelwien> !next2019-05-05 03:14:56