<Shelwien>	test	2019-04-29 22:58:36
<FunkyBob>	o/	2019-04-30 00:51:48
	yay, people came :)	2019-04-30 00:52:07
<Shelwien>	...and went :)	2019-04-30 01:48:01
	but we have bots!	2019-04-30 01:48:17
<FunkyBob>	:)	2019-04-30 01:49:10
<Shelwien>	btw, what do you think about fuzzy matches and delta matches (like bsdiff)?	2019-04-30 01:50:20
	i think these would allow to make something interesting even without entropy coding	2019-04-30 01:50:42
	...and another thing is secondary LZ	2019-04-30 01:51:02
<FunkyBob>	delta matches?	2019-04-30 01:51:53
	I've been considering doing some sort of entropy-lowering matching	2019-04-30 01:52:07
<Shelwien>	see how bsdiff works	2019-04-30 01:52:18
	http://nishi.dreamhosters.com/u/bsdiff_sh3.rar	2019-04-30 01:52:29
	its basically LZ	2019-04-30 01:52:42
	although its matchfinder is based on BWT SA	2019-04-30 01:52:54
	and instead of replacing strings with references	2019-04-30 01:53:17
	it subtracts the matches from data	2019-04-30 01:53:30
	so full matches just leave lots of zeroes	2019-04-30 01:53:39
	...which gets removed by a RLE pass later	2019-04-30 01:54:15
<FunkyBob>	hrm	2019-04-30 01:54:39
<Shelwien>	but it also allows bsdiff to work with fuzzy matches	2019-04-30 01:54:52
	if there're some differences in the middle, they just remain in literal buffer	2019-04-30 01:55:16
	and when you subtract code in v2 exe from code in v1 exe	2019-04-30 01:55:39
<FunkyBob>	hrm... similar to my thinking :)	2019-04-30 01:55:40
<Shelwien>	you get lots of same code, but addr differences in it	2019-04-30 01:56:07
	and deltas of these addrs tend to be the same	2019-04-30 01:56:23
	so these match deltas are successfully compressed by 2nd LZ pass	2019-04-30 01:56:53
<FunkyBob>	well, seems you've solved the last problem I had left :)	2019-04-30 01:56:57
<Shelwien>	but what I also found	2019-04-30 01:57:12
* FunkyBob is currently working on a better structure for finding matches for his "basic lz" to make it faster		2019-04-30 01:57:33
	is that bsdiff+lzma works much better than just lzma of same data	2019-04-30 01:57:57
<FunkyBob>	I also need to take some time and work out how to do some sort of "optimal" parsing	2019-04-30 01:58:02
	Shelwien: valuable info	2019-04-30 01:58:25
<Shelwien>	I think its easier to start with dumb bruteforce matchfinding	2019-04-30 01:58:54
<FunkyBob>	I have	2019-04-30 01:59:02
<Shelwien>	like, build a digram hashchain	2019-04-30 01:59:06
<FunkyBob>	and I have a hash -> fixed array system	2019-04-30 01:59:10
<Shelwien>	and do full scans on it	2019-04-30 01:59:12
	uh, not that	2019-04-30 01:59:19
<FunkyBob>	"digram hashchain"?	2019-04-30 01:59:19
<Shelwien>	digram = 2 symbols	2019-04-30 01:59:48
<FunkyBob>	ahhh	2019-04-30 02:00:04
<Shelwien>	and hashchain is the data structure which is used in most LZ libs, like zlib	2019-04-30 02:00:16
<FunkyBob>	yus	2019-04-30 02:00:27
<Shelwien>	where hashtable keeps the address of last context match	2019-04-30 02:00:34
	and then a table parallel to data window (pointer per byte)	2019-04-30 02:00:54
	keeps previous context occurence addr for each symbol	2019-04-30 02:01:19
	so you get a linked list for all occurences of these 2 bytes	2019-04-30 02:02:09
<FunkyBob>	neat	2019-04-30 02:02:27
	right now, I'm going to get dinner :)	2019-04-30 02:02:33
<Shelwien>	its still faster than a linear scan	2019-04-30 02:02:35
	and doesn't lose any matches	2019-04-30 02:02:53
	though i've been thinking that it might be possible to force linear scans to become practical for small windows	2019-04-30 02:03:37
	like with boyer-moor LUTs, or some vector magic	2019-04-30 02:04:14
	btw, you can also look at this: https://encode.ru/threads/3005-TP-LINK-router-config-compression	2019-04-30 02:06:47
	i didn't see format quite like that anywhere else (with interleaved bit and byte streams)	2019-04-30 02:07:57
	its kinda annoying to encode, because code bytes have to be arranged to follow decoder logic	2019-04-30 02:08:37
	but decoding should have the same speed/complexity like normal LZSS	2019-04-30 02:09:16
	while supporting variable-size bitfields for distance/length	2019-04-30 02:09:49
*** richox has joined the channel		2019-04-30 02:59:34
*** unic0rn has joined the channel		2019-04-30 03:21:11
<FunkyBob>	back	2019-04-30 03:41:54
	well, as I'm sure you've seen from my posts and my code, I'm still new to actually coding compression stuff, despite having studied it for a looong time :)	2019-04-30 03:45:32
<richox>	:)	2019-04-30 03:57:45
<Shelwien>	there're plenty of people who just invent compression from scratch for themselves	2019-04-30 04:01:42
	like mahoney	2019-04-30 04:01:50
	so the only requirement for writing a good compressor is programming in C/C++ :)	2019-04-30 04:03:26
	everything else can be always reinvented, when you have a hard case of "not invented here syndrome"	2019-04-30 04:04:53
<FunkyBob>	well, I'm only one hamming step from mahoney :P	2019-04-30 04:05:14
<Shelwien>	btw, did you see DCE?	2019-04-30 04:05:25
<FunkyBob>	?	2019-04-30 04:05:30
<Shelwien>	http://mattmahoney.net/dc/dce.html	2019-04-30 04:05:41
<FunkyBob>	oh, right... yes	2019-04-30 04:06:36
	thanks, though	2019-04-30 04:06:40
	I'm partly using this exploration as a way to refresh my C skills	2019-04-30 04:08:11
	but also... exploring how different changes match up with my mental models	2019-04-30 04:08:24
<Shelwien>	even if you don't want STL strings and vectors	2019-04-30 04:09:56
	these days its a good idea to use C++ rather than C	2019-04-30 04:10:10
	most of C code compiles as C++ anyway, except for some typecasts maybe	2019-04-30 04:10:38
	but C++ also has templates	2019-04-30 04:10:57
	class methods that let you avoid typing ptr-> before every name	2019-04-30 04:11:38
<FunkyBob>	most of the good bits of c++ I liked are in C now anyway :P	2019-04-30 04:11:47
	there is that, yes	2019-04-30 04:11:53
<Shelwien>	nope, templates are not	2019-04-30 04:11:58
<FunkyBob>	no, that's true	2019-04-30 04:12:08
<Shelwien>	and they're very, very helpful for performance	2019-04-30 04:12:12
	since without templates	2019-04-30 04:12:19
<FunkyBob>	my everyday language is Python, so classes are natural to me:)	2019-04-30 04:12:20
<Shelwien>	code usually ends up with lots of dynamic memory allocation	2019-04-30 04:12:33
	and that's crazily slow these days	2019-04-30 04:12:52
	and in C++ with templates	2019-04-30 04:13:07
	I can just define everything statically	2019-04-30 04:13:36
	for example, reflate library	2019-04-30 04:13:50
	it has to decode deflate stream, encode it, then encode the diff with original stream	2019-04-30 04:14:33
	and then there's recursion	2019-04-30 04:14:52
	ie a deflate stream can contain a pdf file	2019-04-30 04:15:02
	with its own deflate streams	2019-04-30 04:15:07
	so, the reflate library does stream processing	2019-04-30 04:15:45
<FunkyBob>	I'll look into it	2019-04-30 04:16:07
<Shelwien>	take input stream, detects deflate in it, unpacks the data + adds diffs for lossless restore	2019-04-30 04:16:15
	and there's no dynamic memory allocation at all	2019-04-30 04:16:30
<FunkyBob>	sounds like what I use AdvanceCOMP for	2019-04-30 04:16:41
<Shelwien>	well, most of office formats these days use zip containers	2019-04-30 04:19:06
	also apk, jar etc	2019-04-30 04:19:19
	so any modern archiver needs something like that	2019-04-30 04:19:35
	otherwise there's kinda nothing to compress :)	2019-04-30 04:19:45
<FunkyBob>	yeah, re use xlsx at work, so I repack that	2019-04-30 04:21:07
<Shelwien>	advancecomp is lossy, right?	2019-04-30 04:24:01
	its a different kind of recompression then	2019-04-30 04:24:37
	i meant something like https://github.com/schnaader/precomp-cpp	2019-04-30 04:26:15
<FunkyBob>	no, advancecomp uses a varienty of deflate engines	2019-04-30 04:26:43
	-1, --shrink-fast Compress fast (zlib)	2019-04-30 04:26:57
	-2, --shrink-normal Compress normal (libdeflate)	2019-04-30 04:26:57
	-3, --shrink-extra Compress extra (7z)	2019-04-30 04:26:57
	-4, --shrink-insane Compress extreme (zopfli)	2019-04-30 04:26:57
	-i N, --iter=N Compress iterations	2019-04-30 04:27:14
<Shelwien>	i mean, it can optimize the container, but can't compress it, then restore as it was, right?	2019-04-30 04:27:20
<FunkyBob>	it's solely for recompressing deflate streams	2019-04-30 04:27:55
	either on their own, or in png or zip files	2019-04-30 04:28:01
<Shelwien>	but for a normal archiver, like 7-zip, you need a lossless method	2019-04-30 04:28:08
<FunkyBob>	it's lossless...	2019-04-30 04:28:24
	it's deflate	2019-04-30 04:28:26
<Shelwien>	yes, but processing is lossy	2019-04-30 04:28:42
<FunkyBob>	well, you won't get back the original archive ... but you will get back the original decompressed data	2019-04-30 04:29:46
<Shelwien>	btw http://nishi.dreamhosters.com/u/stegzip_v2.rar :)	2019-04-30 04:30:04
<FunkyBob>	reminded me... I wanted to ping the ULZ thread for a Linux version	2019-04-30 04:30:30
<Shelwien>	just compile it?	2019-04-30 04:30:45
<FunkyBob>	last I saw , no source	2019-04-30 04:30:53
<Shelwien>	https://github.com/encode84/ulz	2019-04-30 04:31:14
<FunkyBob>	sorry, meant https://encode.ru/threads/550-Ultra-fast-LZ/page8	2019-04-30 04:31:41
*** richox has left the channel		2019-04-30 04:31:47
	must have missed that post	2019-04-30 04:32:58
<Shelwien>	btw, as to optimal parsing	2019-04-30 04:37:41
	its very simple for this kind of coders	2019-04-30 04:37:52
	you just have to make a price/choice array with element per symbol	2019-04-30 04:38:33
	and at each position try all possible choices (literal(s)/matches)	2019-04-30 04:39:02
	pos[i+len].price = min( pos[i+len].price, pos[i].price + price(current_choice) );	2019-04-30 04:40:04
	so when you reach the end of data	2019-04-30 04:40:23
	at pos[filesize] you would have your optimize compressed size	2019-04-30 04:40:41
	*optimal	2019-04-30 04:40:57
	and then you have to do a backward pass to reconstruct the choices required to reach this len	2019-04-30 04:41:34
	(choices are actually just stored too when new min is found)	2019-04-30 04:41:51
	that's it	2019-04-30 04:41:55
<FunkyBob>	yeh, I've readt some posts [from cbloom et al] on how... especially since there's no entropy codec	2019-04-30 04:43:55
	right now I don't feel so bad, watching how long ulz is taking	2019-04-30 04:44:05
	sure, at level 9 it beats lz4, but not for the time	2019-04-30 04:46:13
<Shelwien>	encode's usual problem is that he uses getc/putc for i/o :)	2019-04-30 04:47:09
<FunkyBob>	erk... I stopped that some time back :)	2019-04-30 04:47:25
	am reading 8MB at a time...	2019-04-30 04:47:29
	was 16M, but... reasons :)	2019-04-30 04:47:37
<Shelwien>	i checked and ulz has blockwise fread, so i guess i was wrong	2019-04-30 04:48:16
	but still, that's 16M blocks, which is also bad for performance	2019-04-30 04:48:42
	i/o needs to be either async, mmap, or small aligned blocks, at most 64k	2019-04-30 04:49:19
	reading or writing unaligned chunks is slower because OS has to align it for you - it only can write whole clusters at once to file	2019-04-30 04:50:13
<FunkyBob>	given deflate is, what, 32k blocks, and lz4 is 64k...	2019-04-30 04:50:19
	I should also test if alighning my 8byte comparison helps	2019-04-30 04:50:36
<Shelwien>	I solved the i/o problem for myself with coroutines	2019-04-30 04:51:00
	so frontend always does aligned i/o	2019-04-30 04:51:36
<FunkyBob>	so which are your tools?	2019-04-30 04:51:37
	yeah, i have a fixed "main.c" I share between my tests	2019-04-30 04:52:00
	reads 8M at a time,	2019-04-30 04:52:16
<Shelwien>	yeah, that's also not too good	2019-04-30 04:52:30
	it blocks while reading that	2019-04-30 04:52:36
<FunkyBob>	point	2019-04-30 04:52:47
	but for now it's consistent	2019-04-30 04:52:53
<Shelwien>	https://github.com/Shelwien/ppmd_sh/blob/master/pmd.cpp#L54	2019-04-30 04:53:26
	i don't have any bytewise LZs so its hard to compare	2019-04-30 04:54:03
<FunkyBob>	ah, ppmd	2019-04-30 04:54:30
<Shelwien>	found some old pictures	2019-04-30 05:18:46
	https://sites.google.com/site/shelwien/bcm8e25.png	2019-04-30 05:18:49
	(red = memory used, green = bytes read, blue = bytes written)	2019-04-30 05:19:06
	(horizontal axis is time in seconds, vertical is volume in bytes)	2019-04-30 05:20:25
	https://sites.google.com/site/shelwien/ccm_5.png	2019-04-30 05:20:41
	ccm graph is especially nice because its visibile how it takes time to initialize 512M hashtable at start :)	2019-04-30 05:21:39
*** richox has joined the channel		2019-04-30 05:50:28
<richox>	i no longer use c++ after a learned rust	2019-04-30 05:51:43
	c++ sucks	2019-04-30 05:51:46
<unic0rn>	sadly, it sometimes can't be avoided	2019-04-30 05:58:09
<FunkyBob>	I'd rather avoid language wars in here, if possible :)	2019-04-30 05:59:01
<Shelwien>	rust uses llvm for compiling, right?	2019-04-30 06:01:08
	unfortunately its not good enough yet	2019-04-30 06:01:25
	gcc and intelc are better compilers, quite frequently	2019-04-30 06:01:43
<FunkyBob>	for C, perhaps	2019-04-30 06:02:08
<Shelwien>	for x86/x64, more like	2019-04-30 06:02:31
	llvm is still pretty bad at auto-vectorization	2019-04-30 06:03:00
	sometimes it can be better at scalar code - for example, i've got a faster build for precomp with clang	2019-04-30 06:03:36
	but overall its still worse	2019-04-30 06:04:04
	1.750s 5.687s: raw2unp_gcc820x.exe enwik9.gz x	2019-04-30 06:05:14
	1.906s 6.672s: raw2unp_cl601x.exe enwik9.gz x	2019-04-30 06:05:14
	1.328s 2.578s: raw2unp.exe enwik9.gz x -- IC19/x64 (1-2.578/3.469)*100 = 25.68%	2019-04-30 06:05:15
	for example	2019-04-30 06:05:17
	that's my deflate decoder on two 1gb samples	2019-04-30 06:05:51
	sure, its getting better with time, but gcc also does	2019-04-30 06:06:53
<FunkyBob>	on a practical note, what the world needs is better offline deflate and brotle encoders :)	2019-04-30 06:07:51
<Shelwien>	somehow i dislike brotli a lot	2019-04-30 06:08:18
	its encoder levels are unbalanced, thus useless	2019-04-30 06:08:43
	we actually tested it for possible integration into our archive format	2019-04-30 06:09:38
	but didn't find a use case for it	2019-04-30 06:09:51
	either some other codec has better compression	2019-04-30 06:10:12
	or brotli encoder is so slow, that nobody would use it	2019-04-30 06:10:24
	zstd turned out much more practical	2019-04-30 06:11:54
	but then, zstd level profiles are actually generated by a special tool, based on speed/ratio balance	2019-04-30 06:12:43
<FunkyBob>	yeah, I'm not all that enamoured of it, either... but it's supported	2019-04-30 06:13:06
<Shelwien>	i want to write my own parsing optimizer for deflate though	2019-04-30 06:13:42
<FunkyBob>	please do :)	2019-04-30 06:14:07
<Shelwien>	there won't be any practical use in it though	2019-04-30 06:14:43
	there're plenty of these optimizers already	2019-04-30 06:15:02
	in fact 7-zip only has an optimizing encoder for deflate	2019-04-30 06:15:20
	which is kinda slow	2019-04-30 06:15:31
	so at best I can expect combining 7z+defluff results in one coder	2019-04-30 06:16:15
	so i'm actually more interested in lzma optimization	2019-04-30 06:17:03
	you see, lzma has position context for tokens	2019-04-30 06:17:56
	but its matchfinder only collects one most recent match per length	2019-04-30 06:18:21
	basically it collects distances for matches from len=2 to 273	2019-04-30 06:18:42
	and passes these to optimizer	2019-04-30 06:18:56
	but taking into account position context, we'd actually need at least a match per len for each alignment	2019-04-30 06:20:15
	and then	2019-04-30 06:20:26
	lzma's rep-match system	2019-04-30 06:20:37
	actually mostly works because of fuzzy matches	2019-04-30 06:21:00
	if you'd look at lzma parsing results	2019-04-30 06:22:06
	rep-matches are mostly used to skip "parameters" in patterns	2019-04-30 06:23:07
	but lzma matchfinder doesn't have any support for fuzzy matches	2019-04-30 06:23:37
	although parsing optimizer does check some specific patterns	2019-04-30 06:24:19
	like match-literal-rep_match	2019-04-30 06:24:26
	but overall it mostly uses them on accident	2019-04-30 06:24:49
	so it won't be surprising if a proper parsing optimizer could improve lzma compression by 10% or so	2019-04-30 06:26:00
	at least on structured data, like exes	2019-04-30 06:26:08
*** Jibz has joined the channel		2019-04-30 07:00:41
	still trying to optimize a counter FSM with SAT	2019-04-30 07:09:38
	CBMC now loads the generated script and translates	2019-04-30 07:11:11
	but can't reach solving stage - it hits 56G memory usage and quits	2019-04-30 07:11:55
	I tried to precalculate some stuff to reduce size of generated source	2019-04-30 07:12:53
	but looks like it somehow complicated things for CBMC	2019-04-30 07:13:18
*** richox has left the channel		2019-04-30 08:45:09
*** schnaader has joined the channel		2019-04-30 09:28:09
*** schnaader has left the channel		2019-04-30 10:17:30
*** Eppie has left the channel		2019-04-30 11:28:17
*** Jibz has left the channel		2019-04-30 12:00:55
*** Eppie has joined the channel		2019-04-30 13:40:08
<Eppie>	I've been working on this for a while, do any of you think anyone will find any value in it? https://github.com/andrew-epstein/paq_history	2019-04-30 13:41:15
*** Eppie has left the channel		2019-04-30 15:03:03
<Shelwien>	"history" actually started from this: http://mattmahoney.net/dc/paq.html#neural	2019-04-30 20:09:24
	i think its a good idea to archive them all on github	2019-04-30 20:10:03
	although i'd appreciate something more detailed	2019-04-30 20:10:54
	like the list of involved ideas (stochastic fsm counters?)	2019-04-30 20:12:57
	and their sources	2019-04-30 20:13:19
	like, SSE was first implemented by Dmitry Shkarin in ppmonstr, based on SEE in C.Bloom's ppmz	2019-04-30 20:14:42
	also, another useful project	2019-04-30 20:16:25
	would be splitting paq into individul components (rather than single .cpp source)	2019-04-30 20:16:45
	and tracking their individual evolution over paq development	2019-04-30 20:17:04
<FunkyBob>	Shelwien: so, I've taken your advice, and am starting on a new tac... and, wishing TurboDebugger was still a thing :P	2019-04-30 20:39:35
	time to learn me some gdb :P	2019-04-30 20:40:08
<unic0rn>	there are no bugs, only features ;)	2019-04-30 21:13:22
<FunkyBob>	well, this feature means my code never gets past the first match lookup :/	2019-04-30 21:13:59
<Shelwien>	i'm using ida pro these days, its almost like turbo debugger :)	2019-04-30 21:44:52
	http://nishi.dreamhosters.com/u/ida_dbg1.png	2019-04-30 21:48:18
<FunkyBob>	looks purdy	2019-04-30 21:58:27
	https://www.hex-rays.com/products/ida/debugger/index.shtml ?	2019-04-30 21:58:48
<Shelwien>	yes, it can be found in torrents, latest available version is 7.0	2019-04-30 22:35:31
	source-level debug needs exes compiled with debug-info	2019-04-30 22:35:49
	also there's the decompiler plugin otherwise	2019-04-30 22:36:20
<FunkyBob>	well, gdb has a built in tui, which is not so bad...	2019-04-30 22:38:01
<Shelwien>	for most of the normal problems its more convenient to write some logs anyway	2019-04-30 23:01:46
	but sometimes there're compiler-related problems	2019-04-30 23:02:16
	where i need to see actual code to understand what happens	2019-04-30 23:02:57
	and in these cases ida is useful	2019-04-30 23:03:28
*** maniscalco_ has joined the channel		2019-04-30 23:13:07
	hi~~	2019-04-30 23:13:16
<maniscalco_>	hi Eugene. glad to see a channel is back	2019-04-30 23:17:35
<Shelwien>	not much talking here, as usual :)	2019-04-30 23:18:24
	how's your progress btw?	2019-04-30 23:22:01
	your code is kinda scary because of STL though :)	2019-04-30 23:22:09
<maniscalco_>	what, all of that template stuff?	2019-04-30 23:22:54
	definitely get a speed boost though.	2019-04-30 23:23:10
<Shelwien>	MT itself is okay, sure	2019-04-30 23:23:41
	but pthread is one thing, and std::thread is another	2019-04-30 23:24:01
	std::thread is not properly supported by all compilers on all platforms	2019-04-30 23:24:25
<maniscalco_>	which platforms? (I don't get out much)	2019-04-30 23:25:20
<Shelwien>	i'm still on windows	2019-04-30 23:25:31
	gcc/mingw just plain doesn't have std::thread	2019-04-30 23:25:42
<maniscalco_>	I've been toying with a different approach to recency coding lately. It looks like it's never going to be quiet able to keep up with MTF because it isn't vectorizable	2019-04-30 23:25:50
	Windows? Why?	2019-04-30 23:25:56
<Shelwien>	there're some unofficial 3rd-party implementations, but when I compiled msufsort with that, it crashed	2019-04-30 23:26:34
<maniscalco_>	really. hmm. I'll try to find some portable alternative perhaps	2019-04-30 23:27:03
<Shelwien>	pthread itself is ok	2019-04-30 23:27:13
<maniscalco_>	ok, well, that wouldn't be too hard to switch to I suppose	2019-04-30 23:27:32
<Shelwien>	that one is not supported by VS/IC, but there're working implementations	2019-04-30 23:27:51
<maniscalco_>	I really have to force myself to make time to finish v4 anyhow	2019-04-30 23:28:08
<Shelwien>	yeah :)	2019-04-30 23:28:14
<maniscalco_>	instead I'm screwing around with MTF alternatives ...	2019-04-30 23:28:32
<Shelwien>	well, BWT certainly needs some kind of solution for compression issue	2019-04-30 23:28:57
	there's certainly plenty of potential for transform speed improvement	2019-04-30 23:29:24
	but what to compress with it?	2019-04-30 23:29:33
<maniscalco_>	the logical next advance would be context based but also building up a model	2019-04-30 23:29:51
	oh, I had some insights into bit wise bwt.	2019-04-30 23:30:06
	I remeber that we were discussing it a while back. could it be better than 32N	2019-04-30 23:30:29
	and it definitely can be ... by a lot	2019-04-30 23:30:39
<Shelwien>	we tried to use BWT/qlfc in our archiver (basically a refactored bsc, because it uses random access to files)	2019-04-30 23:30:45
<maniscalco_>	what archiver is this?	2019-04-30 23:31:04
<Shelwien>	i added MT support and everything	2019-04-30 23:31:04
	http://nishi.dreamhosters.com/u/7zdll_vF6.rar	2019-04-30 23:31:24
	.pa format	2019-04-30 23:31:26
<maniscalco_>	I'll take a look	2019-04-30 23:31:35
<Shelwien>	i just added my selection of codecs to .7z basically	2019-04-30 23:32:02
	BWT-wise its nothing interesting really	2019-04-30 23:32:33
	though you can look at this: http://nishi.dreamhosters.com/7zcmd.html	2019-04-30 23:32:47
	one mostly unknown interesting feature of 7-zip is its support for codec trees with multiple output streams	2019-04-30 23:33:36
<maniscalco_>	I'm pretty sure that bitwise BWT can be produced with a worst case of 8N space	2019-04-30 23:33:50
	codec trees?	2019-04-30 23:34:25
<Shelwien>	btw, i have a simple bitwise BWT there	2019-04-30 23:35:00
<maniscalco_>	expand 1 bit to 1 byte?	2019-04-30 23:35:22
<Shelwien>	yes, but still can estimate compression with it	2019-04-30 23:35:45
	actually i needed it for cdm	2019-04-30 23:36:04
	which is a kind of universal recompressor for bitcode formats	2019-04-30 23:36:19
	cdm itself uses optimal parsing to detect and compress areas with slightly skewed distribution of 0/1	2019-04-30 23:36:57
<maniscalco_>	there are special nuances of bitwise suffix arrays that allow it to be extremely fast for the second stage of the ITS. That insight, plus the fact that binary m99 can encode/decode in the hundreds of MB/sec single threaded is what motivated me to see if I could produce a superior recency transform.	2019-04-30 23:37:27
<Shelwien>	and bitwise BWT creates these for formats based on huffman coding	2019-04-30 23:37:36
	https://encode.ru/threads/2742-Compressed-data-model?p=52493&viewfull=1#post52493	2019-04-30 23:37:59
<maniscalco_>	I'll take a look. Stepping away, have to put the kids to bed.	2019-04-30 23:38:58
<Shelwien>	ah, ok	2019-04-30 23:39:04
	as to codec trees btw	2019-04-30 23:39:53
	7-zip has a system designed like directshow system	2019-04-30 23:40:32
	https://en.wikipedia.org/wiki/DirectShow#Architecture	2019-04-30 23:41:09
	video streams have comples design with lots of replaceable parts	2019-04-30 23:41:43
	container format parser, video/audio/subtitle streams, decoders for these, etc	2019-04-30 23:42:29
	and now we kinda have the same situation in compression	2019-04-30 23:43:06
	because there're various preprocessing methods for different file formats	2019-04-30 23:43:50
	and sometimes preprocessors output multiple streams, which have to be compressed separately, etc	2019-04-30 23:44:20
	for BWT thats even more important	2019-04-30 23:46:17
* FunkyBob still remembers the derision when BWT was first announced		2019-04-30 23:47:33
	derision?	2019-04-30 23:48:04
<FunkyBob>	yeah... people initially thought it was yet another "infinite compression" scan	2019-04-30 23:48:28
<Shelwien>	well, I didn't like how they used text preprocessing and compared the results to straight LZ/PPM	2019-04-30 23:48:31
<FunkyBob>	but upon careful reading, it was clear that's not what they claimed at all	2019-04-30 23:48:50
<Shelwien>	ah, no, we didn't have that, its a pretty obvious transform, with lots of previously known methods leading to it	2019-04-30 23:49:20
	for example, we had static CM	2019-04-30 23:49:44
	which generated order* statistics for a file	2019-04-30 23:50:02
	then compressed these statistics instead of data	2019-04-30 23:50:11
	also, ST2-ST4 are very obvious and easy to make	2019-04-30 23:51:18
	so the only real novelty is how simple inverse BWT can be	2019-04-30 23:54:36
	but that doesn't affect the compression	2019-04-30 23:54:48
	still, there was a lot of excitement for a few years, when it appeared	2019-04-30 23:55:59
	and comparing to compression algorithms like LZ77/8 and PPM, its certainly a step forward	2019-04-30 23:57:00
	but unfortunately it loses to serial LZ/CM/PPM when they start using different contexts	2019-04-30 23:58:15
	like position alignment or fuzzy matches in lzma	2019-04-30 23:58:55
<FunkyBob>	ok, I've fixed the stalling problem... now I just need to fix the "never finds a match" problem	2019-05-01 00:23:43
<maniscalco_>	There are improvements yet to be had in BWT. But it's going to require m03 like parsing and building up context models to make any significant advances.	2019-05-01 00:24:26
	Plus It has other applications outside of compression	2019-05-01 00:24:52
<Shelwien>	sure, it even has applications in compression - for efficient data indexing	2019-05-01 00:27:56
	i want to try making a LZ matchfinder with it	2019-05-01 00:28:18
	by building SAs for small blocks (64k?) and linking them	2019-05-01 00:29:08
	for 64k-or-smaller blocks it would be 2N for SA, 2N for links to prev SA, 1N for prefix lengths, 1N for actual data	2019-05-01 00:31:03
<FunkyBob>	yay, progress!	2019-05-01 00:31:57
	first block got compression...	2019-05-01 00:32:02
<Shelwien>	:)	2019-05-01 00:32:05
<FunkyBob>	and fixed	2019-05-01 00:32:31
<Shelwien>	wanna see the lzma bt4 matchfinder?	2019-05-01 00:32:59
<FunkyBob>	not sure I can brain enough to grok it right now :)	2019-05-01 00:33:48
<Shelwien>	http://nishi.dreamhosters.com/u/mf_bt4_v0.rar	2019-05-01 00:33:59
	it generates readable output	2019-05-01 00:34:19
<FunkyBob>	51,518,328	2019-05-01 00:36:32
	is my initial result for enwik8	2019-05-01 00:36:40
<Shelwien>	yeah, better switch to BWT i guess :)	2019-05-01 00:37:30
	bsc output for enwik8 is 24M _before_ actual entropy coding	2019-05-01 00:37:58
<maniscalco_>	what are the steps for BSC again. BWT->QLFC and aynthing else?	2019-05-01 00:42:33
<Shelwien>	also LZP, segmentation	2019-05-01 00:43:01
<maniscalco_>	enwik8 is pretty stationary so its not difficult to expect it to be greatly reduced like that before entropy coding	2019-05-01 00:43:15
<Shelwien>	and QLFC is not just reverse-BWT	2019-05-01 00:43:20
<maniscalco_>	I'm not overly familiar with QLFC. only in passing	2019-05-01 00:43:42
<Shelwien>	* reverse-MTF i mean	2019-05-01 00:43:44
	its actually reverse-MTF + RLE + VL bitcode + order1 CM	2019-05-01 00:44:30
<maniscalco_>	reverse mtf? explain please	2019-05-01 00:45:35
<Shelwien>	http://nishi.dreamhosters.com/u/bsc_usage.txt	2019-05-01 00:45:40
<maniscalco_>	jesus Eugine. you have a link for everything, don't you?	2019-05-01 00:46:06
<Shelwien>	that's my file dump, i just upload stuff	2019-05-01 00:46:22
	mtf can be better explained in terms of distance coding	2019-05-01 00:46:47
	mtf turns symbols into counts of different symbols between two occurrences of current symbol	2019-05-01 00:47:32
<maniscalco_>	yeah, i'm familar with that. DC is from early 2000s	2019-05-01 00:47:37
	Binder, I believe was the guy's name.	2019-05-01 00:48:06
<Shelwien>	while DC encodes distance not counting new symbols	2019-05-01 00:48:17
	and QLFC kinda uses MTF ranks	2019-05-01 00:48:44
	but they're counts of new symbols not since last symbol occurrence, like normal	2019-05-01 00:49:03
	but _to_ next occurrence	2019-05-01 00:49:10
	so statistically its the same MTF	2019-05-01 00:49:19
	but decoder knowns the actual symbol when reading its rank	2019-05-01 00:49:41
	while normally with MTF we'd read the rank, then decode the symbol from it	2019-05-01 00:49:59
	and apparently using actual symbol as context for its rank	2019-05-01 00:50:19
	can improve entropy coding a lot	2019-05-01 00:50:26
	there've been a recent improvement on that, i guess	2019-05-01 00:58:09
	https://github.com/loxxous/Behemoth-Rank-Coding	2019-05-01 00:58:10
	its similar, but also vectorized	2019-05-01 00:58:28
<maniscalco_>	yes, lucas' work.	2019-05-01 00:58:55
<Shelwien>	so, as to m03	2019-05-01 01:02:09
	i think it extracts the original prefix context for a symbol	2019-05-01 01:02:31
<maniscalco_>	it parses the boundaries of each context at each order on order at a time.	2019-05-01 01:03:11
<Shelwien>	so it becomes possible to compress as good as best prefix CM	2019-05-01 01:03:17
	but it still processes symbols not in original order, but rather in order of contexts	2019-05-01 01:03:47
	there're certainly some benefits to that	2019-05-01 01:04:05
	but unfortunately plain 8-bit text is not that popular anymore	2019-05-01 01:04:32
<maniscalco_>	so if you had the input `swiss miss` then the BWT is `swm ssiiss`	2019-05-01 01:04:33
	The order zero context bounaries are trivially easy to locate for both the encoder and decoder:	2019-05-01 01:04:52
<Shelwien>	so the question is how to modify BWT	2019-05-01 01:05:07
	to at least compress 16-bit alphabet for unicode	2019-05-01 01:05:17
	or 2D for pictures	2019-05-01 01:05:26
<maniscalco_>	s\|wm\| \|ssiis\|s	2019-05-01 01:05:44
<Shelwien>	yeah, as i mentioned before, there've been some static CM attempts	2019-05-01 01:06:23
<maniscalco_>	there's no reason this couldn't be used for 16 bit symbols as well, really	2019-05-01 01:06:46
	you do need a bit more space since symbols are now 2 bytes wide but I think that's it	2019-05-01 01:07:08
<Shelwien>	for 16bit i think there'd be scaling issues because of cpu cache size	2019-05-01 01:07:17
	where with 8-bit symbol everything is nicely in L1	2019-05-01 01:07:31
	with 16-bit symbols it won't fit	2019-05-01 01:07:41
<maniscalco_>	sure, there would be a performance issue but in terms of algorithmic performance, its all the same	2019-05-01 01:08:02
<Shelwien>	anyway, ideally we'd need a generalized transform	2019-05-01 01:08:51
	which would sort symbols based on whatever function of context	2019-05-01 01:09:12
	its actually possible to implement like ST	2019-05-01 01:09:36
<maniscalco_>	I thought about that in the past. Somehow consider distance in the sort	2019-05-01 01:09:45
<Shelwien>	simple compute the index for current symbol and output the symbol to stream[index]	2019-05-01 01:10:12
<maniscalco_>	so that the transform wouldn't attempt to associate 'similar' contexts that are a million miles appart	2019-05-01 01:10:15
<Shelwien>	well, that's an big issue with BWT compression of sorted dictionaries :)	2019-05-01 01:10:42
<maniscalco_>	exactly. and why st4 does better	2019-05-01 01:10:53
	it's 'too stupid' to see the really good correlations that are toxic in reality	2019-05-01 01:11:41
<Shelwien>	also, there was an interesting algorithm for serializing of pictures	2019-05-01 01:11:59
<maniscalco_>	so it chooses whatever is close by, which in the case of dictionaries, IS the most similar	2019-05-01 01:12:02
<Shelwien>	https://www.youtube.com/watch?v=cGGzZY8OYts	2019-05-01 01:12:24
<maniscalco_>	the repo on my github does the full m03 parsing. it wouldn't be hard to modify it to do 16 bit symbols instead.	2019-05-01 01:13:18
	that would give you a really good idea as to how much it would degrade	2019-05-01 01:13:38
<Shelwien>	yeah, but we need it to work not much slower than 8-bit, otherwise there's no point	2019-05-01 01:13:53
<maniscalco_>	only one way to find out	2019-05-01 01:15:10
<Shelwien>	i got reminded of that idea by NNCP	2019-05-01 01:15:50
	it has a self-made preprocessor which extends the alphabet to 16 bits	2019-05-01 01:16:17
	then iteratively finds most frequent strings and assigns them to new symbols	2019-05-01 01:16:36
	https://bellard.org/nncp/	2019-05-01 01:17:05
	https://encode.ru/threads/2702-Adaptive-Lossless-Prediction-impl-in-C	2019-05-01 01:21:10
<maniscalco_>	so m03 has almost no performance loss for 16 bit symbols	2019-05-01 02:05:26
	at least for the parsing	2019-05-01 02:05:44
	I kept the BWT the same (8 bit) and then widened the BWT output to 16 bits per symbol	2019-05-01 02:06:09
	and process as if they were always 16 bit symbols	2019-05-01 02:06:17
	results for 16 bit enwik8:	2019-05-01 02:06:47
	./build/release/bin/m19 3 ~/programming/testset/enwik/enwik8 m19 - next generation BWT compressor. Author: M.A. Maniscalco (2018 - 2019) * This build is pre-alpha and demonstrates both M19 and M03 context parsing only * max order = 5317 elapsed time: 8288 ms, 11.5067 MB/sec	2019-05-01 02:06:49
	and 8 bit normal enwik8:	2019-05-01 02:07:14
	./build/release/bin/m19 3 ~/programming/testset/enwik/enwik8 m19 - next generation BWT compressor. Author: M.A. Maniscalco (2018 - 2019) * This build is pre-alpha and demonstrates both M19 and M03 context parsing only * max order = 5317 elapsed time: 8035 ms, 11.869 MB/sec	2019-05-01 02:07:15
	+250ms	2019-05-01 02:07:22
	about 2 sec of that time is just the BWT time	2019-05-01 02:07:53
<Shelwien>	can you try it with nncp preprocessor?	2019-05-01 02:08:23
	https://bellard.org/nncp/nncp-2019-04-13.tar.gz	2019-05-01 02:08:45
	just compiler preprocess.c	2019-05-01 02:09:21
	then run .\preprocess c words.txt enwik9 enwik9.bin 65500 2	2019-05-01 02:09:56
<maniscalco_>	and come to think of it, that 250ms is probably the overhead for widening the array	2019-05-01 02:13:58
	will do. just pushing the change to m03/19 first	2019-05-01 02:14:16
	https://github.com/michaelmaniscalco/m19.git just in case anyone wants to confirm for themselves	2019-05-01 02:22:27
<Shelwien>	post bwt its less interesting, especially if you just expand 8 to 16 bits	2019-05-01 02:24:31
	but that preprocessor can really use the full 16-bit alphabet	2019-05-01 02:25:01
<maniscalco_>	nncp-2019-04-13$ make gcc -O3 -Wall -Wpointer-arith -g -fno-math-errno -fno-trapping-math -MMD -Wno-format-truncation -c -o nncp.o nncp.c gcc -O3 -Wall -Wpointer-arith -g -fno-math-errno -fno-trapping-math -MMD -Wno-format-truncation -c -o cp_utils.o cp_utils.c gcc -O3 -Wall -Wpointer-arith -g -fno-math-errno -fno-trapping-math -MMD -Wno-format-truncation -c -o arith.o arith.c gcc -o nncp nncp.o cp_utils.o arith.o libnc.a -l	2019-05-01 02:25:12
<Shelwien>	?	2019-05-01 02:25:29
	no need really	2019-05-01 02:25:33
<maniscalco_>	libnc needs to be compiled with PIC	2019-05-01 02:25:46
	Sure, its more interesting with 16 bit BWT. But I can't just write one of those up in 20 minutes (^:	2019-05-01 02:26:42
<Shelwien>	yes, i mean, just type "gcc preprocess.c"	2019-05-01 02:26:43
<maniscalco_>	k	2019-05-01 02:26:49
	gcc preprocess.c /tmp/ccdVrLNV.o: In function `get_n_bits': preprocess.c:(.text+0x11c2): undefined reference to `log2' preprocess.c:(.text+0x12a6): undefined reference to `log2' /tmp/ccdVrLNV.o: In function `compute_score': preprocess.c:(.text+0x13b9): undefined reference to `log2' /tmp/ccdVrLNV.o: In function `compute_entropy': preprocess.c:(.text+0x1b16): undefined reference to `log2' preprocess.c:(.text+0x1c17): undefined	2019-05-01 02:27:15
	log2 as in the math function ?	2019-05-01 02:27:31
<Shelwien>	its probably in headers, let me see...	2019-05-01 02:28:03
	well, just define it then	2019-05-01 02:28:54
<maniscalco_>	added -lm at the end and it worked	2019-05-01 02:29:36
<Shelwien>	cool :)	2019-05-01 02:29:44
<maniscalco_>	no words.txt file though	2019-05-01 02:34:32
<Shelwien>	its generated by preprocessor	2019-05-01 02:35:04
<maniscalco_>	how long should the preprocessing take?	2019-05-01 02:40:59
<Shelwien>	algorithm is dumb, so not fast	2019-05-01 02:42:07
	depends on last number - its a cut-off frequency	2019-05-01 02:42:18
	the higher it is, the earlier it would quit	2019-05-01 02:42:41
<maniscalco_>	regarding m03 and m19 with 16 bit symbols. I can't test it with real 16 bit data because the algorithm only works for real bwt data. if I made fake 16 bit input it wouldn't work.	2019-05-01 02:46:08
	To test a real world input I would need a BWT that does 16 bit symbols. Do you know of any?	2019-05-01 02:46:45
<Shelwien>	i can quickly make a qsort one	2019-05-01 02:46:59
<maniscalco_>	it would allow us to test the parsing with real data. although it shouldn't affect m03/19 performance at all in any case	2019-05-01 02:47:46
	right about now we should be hearing the ghost of David Scott saying "you should be using BWTS instead"	2019-05-01 02:49:32
	(^:	2019-05-01 02:49:34
<Shelwien>	http://nishi.dreamhosters.com/u/BWT16.cpp	2019-05-01 02:51:51
	i actually added BWTS and bitwise BWTS to .pa, for cdm	2019-05-01 02:52:19
	turned out to be the perfect transform for that kind of thing	2019-05-01 02:52:46
<maniscalco_>	thanks for the BWT .. still has an issue though. M03/19 needs to know the sentinel position in the BWT	2019-05-01 03:04:49
	otherwise it will fail	2019-05-01 03:04:56
	so it has to be a BWT that uses $	2019-05-01 03:05:13
	and the preprocessor is still working away ...	2019-05-01 03:05:38
<Shelwien>	yeah, i also started it:	2019-05-01 03:07:06
	Z:\003>timetest .\a c words.txt enwik9 enwik9.bin 65500 256	2019-05-01 03:07:08
	input: 1000000000 bytes	2019-05-01 03:07:08
	after case/space preprocessing: 1119415035 symbols	2019-05-01 03:07:08
	4396 280965958 348346311	2019-05-01 03:07:08
	first number is the number of allocated symbols	2019-05-01 03:07:24
	http://nishi.dreamhosters.com/u/BWT16.rar	2019-05-01 03:21:55
	more complete version with index and unBWT	2019-05-01 03:22:14
	and it kinda uses EOF, if you mean the comparison wrap	2019-05-01 03:22:54
	prev version sorts in reverse order	2019-05-01 03:29:19
	but this one sorts in forward order	2019-05-01 03:29:29
	and both don't wrap around, just stop at end of text buffer	2019-05-01 03:29:48
<maniscalco_>	ok. I'm going to sign off for the night and leave the preprocessor running overnight. hopefully it will be done by the morning. I'll let you know	2019-05-01 03:31:24
<Shelwien>	ok :)	2019-05-01 03:31:39
	!next	2019-05-01 03:34:42