*** NCDR has left the channel		2019-05-04 14:54:09
<FunkyBob>	o/	2019-05-04 17:21:21
*** NCDR has joined the channel		2019-05-04 20:08:15
<Shelwien>	3. LZ4 supports match length >130.	2019-05-04 20:08:17
	4. you can skip zero distance	2019-05-04 20:09:04
	5. you can add rep-matches: a flag to skip repeated distance value	2019-05-04 20:11:12
	6, literal flag is not always bad. you can encode single literals without length	2019-05-04 20:12:55
<FunkyBob>	skip zero distances?	2019-05-04 21:04:08
	you mean zero length literal or matches?	2019-05-04 21:05:06
<Shelwien>	i mean distances	2019-05-04 21:54:48
	uint16_t offset = src->data[sptr++];	2019-05-04 21:55:08
	offset \|= src->data[sptr++] << 8;	2019-05-04 21:55:08
	what's the meaning of offset=0 here?	2019-05-04 21:55:34
<FunkyBob>	ah	2019-05-04 21:56:53
<Shelwien>	btw, an interesting option would be to encode 7-bit literals	2019-05-04 21:56:59
	at least for enwik :)	2019-05-04 21:57:06
<FunkyBob>	:P	2019-05-04 21:57:25
	am trying to stick to byte aligned	2019-05-04 21:58:11
<Shelwien>	also, i've seen some weird LZ recently, with 4k window	2019-05-04 21:58:11
<FunkyBob>	erk	2019-05-04 21:58:22
<Shelwien>	it had shorter distance codes for values near 0 (that's normal), but also near window size	2019-05-04 21:59:22
	i don't think 64k is different in that sense	2019-05-04 22:00:01
<FunkyBob>	I can also get a bit of an improvement by passing 16MB at a time... but, well...	2019-05-04 22:00:46
	enwik8 goes down to 45,812,181	2019-05-04 22:01:04
<Shelwien>	also where's #define __COMP_H__ ?	2019-05-04 22:03:16
<FunkyBob>	oops	2019-05-04 22:03:45
	my original reason for this project was to brush up on my C, so... :)	2019-05-04 22:03:54
<Shelwien>	In file included from main.c:11:	2019-05-04 22:04:24
	basiclz.inc:120:10: error: conflicting types for 'compress'	2019-05-04 22:04:24
	uint32_t compress(struct buffer src, struct buffer dest) {	2019-05-04 22:04:24
	^~~~~~~~	2019-05-04 22:04:24
	In file included from main.c:9:	2019-05-04 22:04:24
	comp.h:12:8: note: previous declaration of 'compress' was here	2019-05-04 22:04:24
<FunkyBob>	just pushed it.. forgot it was edited	2019-05-04 22:04:32
<Shelwien>	ok, compiled	2019-05-04 22:05:00
	also use "rb", "wb" for fopen	2019-05-04 22:05:38
	with "r" it won't work on windows	2019-05-04 22:05:50
<FunkyBob>	silly windows :P	2019-05-04 22:07:01
<Shelwien>	got extra 5 bytes on decoding	2019-05-04 22:07:13
<FunkyBob>	?	2019-05-04 22:07:28
	you mean from not having rb/wb ?	2019-05-04 22:07:57
<Shelwien>	nope, it simply adds extra 5 zeroes at the end	2019-05-04 22:13:10
<FunkyBob>	hrm	2019-05-04 22:16:19
	what size is the source file?	2019-05-04 22:16:25
<Shelwien>	1463/5589	2019-05-04 22:17:13
<FunkyBob>	can you send me the original file to test with, please?	2019-05-04 22:18:41
<Shelwien>	http://nishi.dreamhosters.com/u/lzfb_002.zip	2019-05-04 22:21:09
	"original file" is lzfb.exe	2019-05-04 22:21:31
<FunkyBob>	your numbers don't match what I see	2019-05-04 22:25:35
	24576 is how big lzfb.exe is, not 5589	2019-05-04 22:25:46
	that said, I get a segfauly trying to decompress	2019-05-04 22:26:00
	oh, no I don't... I get an assertion	2019-05-04 22:26:18
<Shelwien>	i meant source size of main.c and basiclz.c that i downloaded	2019-05-04 22:27:02
<FunkyBob>	ah	2019-05-04 22:27:49
<Shelwien>	http://nishi.dreamhosters.com/u/lzfb_ofs_e8.png	2019-05-04 22:29:49
	http://nishi.dreamhosters.com/u/lzfb_len_e8.png	2019-05-04 22:31:30
	http://nishi.dreamhosters.com/u/lzfb_lenlit_e8.png	2019-05-04 22:32:33
	{ 0, 0, 0, 0, 0, 0, 0, 0, 57437, 34432, 26618, 32872, 0, 0, 0, 0, 6520, 2970, 2463, 3949, 0, 0, 0, 0, 1839, 795, 757, 1156, 0, 0, 0, 0, 976,	2019-05-04 22:34:10
	matchlen seems buggy?	2019-05-04 22:34:30
<FunkyBob>	hmm?	2019-05-04 22:34:35
<Shelwien>	that match len occurrence counts for enwik8	2019-05-04 22:35:22
<FunkyBob>	ah	2019-05-04 22:35:33
	ok, I'm going to go get some dinner, then head back to my hotel... and delve further into this :)	2019-05-04 22:35:51
<Shelwien>	its in decompress() in archive that i posted	2019-05-04 22:35:59
	(counting)	2019-05-04 22:36:01
<FunkyBob>	back again	2019-05-05 00:03:29
	Shelwien: are you saying the bug is in decompress?	2019-05-05 00:06:25
<Shelwien>	probably in compress?	2019-05-05 00:21:45
	anyway, it seems like it doesn't use certain ranges of len values, which should make compression worse	2019-05-05 00:22:32
	minlen=8 may be ok, but why skip 12-15 etc?	2019-05-05 00:23:43
<FunkyBob>	sorry?	2019-05-05 00:25:59
<Shelwien>	?	2019-05-05 00:26:06
<FunkyBob>	skip 12 - 15?	2019-05-05 00:26:07
	when am I doing that?	2019-05-05 00:26:12
<Shelwien>	do you see the { ... } table above?	2019-05-05 00:26:25
	that's match len freqs for enwik8	2019-05-05 00:26:49
<FunkyBob>	if it's not finding them, it's not finding them.	2019-05-05 00:26:59
<Shelwien>	nope, its enwik, not some binary structured file	2019-05-05 00:27:14
	it can't have a 4-byte align	2019-05-05 00:27:36
<FunkyBob>	you can see my code, I have not tried to bias it to any particulate lengths	2019-05-05 00:28:17
<Shelwien>	ok, lets see with lazy disabled - extra 5 bytes already disappeared btw	2019-05-05 00:29:42
<FunkyBob>	with greedy?	2019-05-05 00:29:56
	or with a git pull?	2019-05-05 00:30:00
<Shelwien>	greedy	2019-05-05 00:30:41
<FunkyBob>	hrm	2019-05-05 00:31:18
<Shelwien>	but same align4 on enwik	2019-05-05 00:31:34
<FunkyBob>	\| 0\| L,4,	2019-05-05 00:36:54
	< 1< L,21,	2019-05-05 00:36:54
	> 1> L,128,	2019-05-05 00:36:55
	ok... they lose sync almost immediately :/	2019-05-05 00:37:03
<Shelwien>	its ctzl -> ctzll	2019-05-05 00:37:32
<FunkyBob>	that'd do it	2019-05-05 00:38:04
	hrm... now my enwik test is failing :/	2019-05-05 00:39:32
	(I really appreciate your help on this)	2019-05-05 00:40:35
<Shelwien>	{ 0, 0, 0, 0, 69632, 62067, 53056, 44525, 35571, 26296, 19629, 14025, 10051, 7404, 5548, 4319, 3215,	2019-05-05 00:41:03
	mine works	2019-05-05 00:41:06
	need to fix minmatchlen to 4 i guess, or more maybe	2019-05-05 00:41:39
<FunkyBob>	oh, I thought I did... must've been in a different version	2019-05-05 00:42:11
<Shelwien>	... 5, 7, 3, 1, 3, 1, 3, 2, 2, 3, 6, 1, 1, 1, 0, 2, 1, 552 }	2019-05-05 00:42:28
	that's len=128 :)	2019-05-05 00:42:42
<FunkyBob>	mmm?	2019-05-05 00:43:03
	oh	2019-05-05 00:43:11
	yeah, I think I checked before on max len hits	2019-05-05 00:43:31
	I checked how many extra bytes using a LZ4-ish "keep emitting 255..." scheme would take, it was ~3.6k	2019-05-05 00:46:37
	that is, that many extra bytes in length counters...	2019-05-05 00:46:46
	so, a lot of savings to be had	2019-05-05 00:46:49
<Shelwien>	maybe you can make a special case for len<16+min, ofs<256	2019-05-05 00:47:10
	like I111LLLL OOOOOOOO	2019-05-05 00:47:42
	one byte shorter	2019-05-05 00:47:48
<FunkyBob>	well, I'll just commit the fixes we have so far :)	2019-05-05 00:48:35
	odd... compression got worse: 47146900	2019-05-05 00:48:59
<Shelwien>	lazy is 45,877,684 here	2019-05-05 00:49:26
<FunkyBob>	did you set MIN_MATCH_LEN to 4 ?	2019-05-05 00:50:16
<Shelwien>	no	2019-05-05 00:50:36
	let's see	2019-05-05 00:50:38
<FunkyBob>	ah, but that also requires changing find_match to test for len >= not len >	2019-05-05 00:51:00
<Shelwien>	47,146,924	2019-05-05 00:51:39
	hm	2019-05-05 00:51:53
<FunkyBob>	and now it's way slower :/	2019-05-05 00:52:52
	oh duh..	2019-05-05 00:53:05
	ignore the slower git :)	2019-05-05 00:53:12
<Shelwien>	45,877,524 with min 4	2019-05-05 00:55:15
<FunkyBob>	45877500	2019-05-05 00:56:47
	I'm sure I checked which version of ctlz it was emitting, that itwas teh 64bit version :/	2019-05-05 00:58:18
<Shelwien>	maybe on linux it works differently, dunno	2019-05-05 00:58:52
<FunkyBob>	I've added a debug mode that will print out CSV lines of either "L,{len}," or "M,{len},{offset}"	2019-05-05 01:00:00
<Shelwien>	wanna test how lzma parsing works with your format?	2019-05-05 01:00:55
<FunkyBob>	but, yeah, perhaps objdump was confused, or I read the docs wrong, because as your stats showed, it wasn't getting good match lengths	2019-05-05 01:01:11
	umm... sure?	2019-05-05 01:01:22
<Shelwien>	http://nishi.dreamhosters.com/u/lzma_delrep_v1.rar	2019-05-05 01:01:29
	encode a file with lzma -d16	2019-05-05 01:02:09
	remove rep codes	2019-05-05 01:02:20
	then convert to your format	2019-05-05 01:02:30
<FunkyBob>	blah... lazy still gets the wrong size	2019-05-05 01:08:29
	ok, I see the bug	2019-05-05 01:11:20
	oh, nope	2019-05-05 01:11:45
<unic0rn>	someone's having fun i see	2019-05-05 01:20:17
<FunkyBob>	:)	2019-05-05 01:22:23
<unic0rn>	out of curiosity, what's your memory usage and compression time for enwik8?	2019-05-05 01:24:32
<FunkyBob>	memory usage I haven't measured, but it'd mostly be static... input buffer, output buffer, a 64k x 32bit chain head hash table, and MAX_FRAME_SIZE x 32bit chain links buffer	2019-05-05 01:26:22
	MAX_FRAME_BUFFER is currently 4MB	2019-05-05 01:26:31
	so... 16MB + 256k for tables,	2019-05-05 01:27:03
	M,131,1372	2019-05-05 01:29:12
	M,131,5	2019-05-05 01:29:12
	M,24,5	2019-05-05 01:29:12
	now that's intereseting... that's the last 3 actions for the file that's 5 bytes over size	2019-05-05 01:29:24
<unic0rn>	so it's small and fast	2019-05-05 01:30:29
<FunkyBob>	so it seems	2019-05-05 01:30:48
<unic0rn>	not interested in higher ratio?	2019-05-05 01:31:34
<Shelwien>	ok, i got it to correctly decode book1 from lzma parsing	2019-05-05 01:31:36
	worse compression on book1 :)	2019-05-05 01:32:11
<FunkyBob>	unic0rn: in time...	2019-05-05 01:32:36
	unic0rn: this is mostly an exercise in refreshing my C skills :)	2019-05-05 01:32:49
	once I debug this lazy parsing bug, I might move onto something that compresses better	2019-05-05 01:33:07
<Shelwien>	768,771 BOOK1	2019-05-05 01:35:04
	284,257 book1.lzma // lzma.exe e BOOK1 book1.lzma -d16 -fb273 -mc999 -lc0 -lp0 -pb0 -mt1	2019-05-05 01:35:05
	840,823 book1.dec // lzma tokens w/o entropy coding	2019-05-05 01:35:05
	848,751 book1_norep.dec // delrep_v0 to leave only matches at literals	2019-05-05 01:35:05
	394,403 0.lzfb // lzfb with minmatch=2	2019-05-05 01:35:05
	397,409 book1_norep.lzfb // conversion result	2019-05-05 01:35:05
<unic0rn>	hah. i'm going different route. all work in progress, decompression isn't even started yet, but will be simple, compression is work in progress, moving forward with it while optimizing stuff on the fly as needed. with current buffers eats up 150mb ram, speed isn't great but there's a lot of headroom for improvement, as for ratio... we'll see. can't guess yet, gotta decide on few algorithm details	2019-05-05 01:35:26
	first	2019-05-05 01:35:26
<Shelwien>	decoding is important :)	2019-05-05 01:35:58
<unic0rn>	to sort out bugs, yeah.	2019-05-05 01:36:26
<FunkyBob>	decoding is essential :)	2019-05-05 01:36:37
<unic0rn>	but on its own, how hard can it be to code a damn huffman decoder	2019-05-05 01:36:43
<Shelwien>	depending on speed optimization	2019-05-05 01:37:04
*** NCDR has left the channel		2019-05-05 01:37:19
<unic0rn>	i mean, sure it's mandatory, but compared to compression, complexity is close to 0	2019-05-05 01:37:24
<FunkyBob>	As my dad said - "You've turned the avocado into guacamole... now can you turn the guacamole back into avocado?"	2019-05-05 01:37:29
<Shelwien>	https://encode.ru/threads/1183-Huffman-code-generator :)	2019-05-05 01:37:47
	well, i did that with deflate->lzma before	2019-05-05 01:38:19
<unic0rn>	lol	2019-05-05 01:39:03
	that seems bloated for no reason ;)	2019-05-05 01:39:09
<Shelwien>	lzma doesn't have literal runs	2019-05-05 01:39:24
	i converted them, but it didn't optimize for that	2019-05-05 01:39:43
<unic0rn>	and on more serious note, huffman isn't a problem. "what do i encode" is ;)	2019-05-05 01:41:14
<Shelwien>	https://encode.ru/threads/1288-LZMA-markup-tool?p=25481&viewfull=1#post25481	2019-05-05 01:41:30
	btw, its possible to get better compression than huffman with almost the same decoder	2019-05-05 01:43:10
	well, FSE, but aside from that	2019-05-05 01:43:27
<FunkyBob>	yeah, was thinking I might try an entrope codec next	2019-05-05 01:44:18
<Shelwien>	these are fast: https://encode.ru/threads/3109-How-to-build-Bonfield-s-rANS-coders-on-windows	2019-05-05 01:45:16
<unic0rn>	there's no telling how my "huffman" will work. i certainly won't be generating the tree exactly like the standard variant does	2019-05-05 01:45:38
<FunkyBob>	yeah, I think I've mostly got my head around how to build a tANS table	2019-05-05 01:45:41
<Shelwien>	then it won't be huffman anymore? :)	2019-05-05 01:47:36
<FunkyBob>	:)	2019-05-05 01:47:57
	so, it seems the len fixes and the MinMatch = 4 ... as well as no longer special casing the first 4 bytes... has improved my silesia scores by up to 600 ish bytes at times	2019-05-05 01:48:30
	now to test enwik9	2019-05-05 01:49:02
	Shelwien: book1 is ... Calgary corpus? or Cantebury?	2019-05-05 01:51:29
<Shelwien>	calgary	2019-05-05 01:51:37
	there's also this: http://ctxmodel.net/sh_samples_1.rar	2019-05-05 01:53:01
	it has russian texts, finnish dictionary and some binary files	2019-05-05 01:53:54
	gets weird results from some "optimized" compressors	2019-05-05 01:54:37
<FunkyBob>	heh	2019-05-05 01:55:07
	ok, well, I'm tired... thanks for the help, Shelwien	2019-05-05 02:07:18
	will talk tomorrow, I hope	2019-05-05 02:07:24
<Shelwien>	ok :)	2019-05-05 02:07:38
<FunkyBob>	still not sure what that extra 5 bytes is happening	2019-05-05 02:07:58
<Shelwien>	probably match comparison after buffer	2019-05-05 02:08:53
<unic0rn>	or they just have jet lag	2019-05-05 02:48:32
<Shelwien>	!next	2019-05-05 03:14:56