* freenode-connect Welcome to freenode. To protect the network all new connections will be scanned for vulnerabilities. This will not harm your computer, and vulnerable hosts will be notified.2019-05-19 09:16:28
*** complogger has joined the channel2019-05-19 09:16:34
<A barely detectible trace of hope> Who knows - it just could be possible... </A barely detectible trace of hope> ||| IF YOU ASK A QUESTION, PLEASE REMAIN IN THE CHANNEL, IT MAY TAKE HOURS TO REPLY (AS TheStar IS OFTEN AFK), BUT HE WILL DO HIS BEST TO ANSWER... ||| Remembe".2019-05-19 09:16:34
<Shelwien> test2019-05-19 09:17:30
<TheWolf> Woof2019-05-19 09:25:10
<FunkyBob> o/2019-05-19 23:45:06
*** schnaader has joined the channel2019-05-20 04:45:08
 huh... small win... just edged out lz4 for compressing a 10MB file of 00 :P2019-05-20 04:45:58
<schnaader> sounds trivial for most, but when coding a matchfinder, this case is the worst - matches everywhere ^^2019-05-20 04:48:06
 or do you mean you optimized ratio?2019-05-20 04:48:33
<FunkyBob> LZ4 got 41178 bytes2019-05-20 04:48:52
<schnaader> btw, forum is down?2019-05-20 04:49:01
<FunkyBob> I got 41137 bytes2019-05-20 04:49:02
 oh, is it?2019-05-20 04:49:07
<schnaader> perhaps it's just me, but I can't access atm2019-05-20 04:49:21
<FunkyBob> seems down from here, too2019-05-20 04:49:31
<schnaader> hm, 40K still sounds a lot. is the result still compressible?2019-05-20 04:50:33
 result of some block size, perhaps? 64K * 4 * 40000 = 10 MB or something like this2019-05-20 04:52:16
<FunkyBob> oh, sure... I mean, bzip2 is the current winner, achieving just 49 bytes2019-05-20 04:52:19
<schnaader> Reminds me of the FLIF blog about encoding single color images: https://cloudinary.com/blog/a_one_color_image_is_worth_two_thousand_words2019-05-20 04:53:18
<FunkyBob> for comparison: https://dpaste.de/hgh92019-05-20 04:53:49
 since neiter mine nor lz4 have an entropy coder, they really can't measure up2019-05-20 04:54:59
<schnaader> ah ok, that explains things2019-05-20 04:55:48
 on the other hand, worst case is 0,075%, so who cares ^^ any other real data will be much bigger2019-05-20 04:55:49
<FunkyBob> am just looking back at my various experiments and attempts2019-05-20 05:01:31
 I really need to make a better parser :)2019-05-20 05:02:53
 or, stop making silly mistakes... just gained a small boost2019-05-20 05:12:24
 by removing something that I should have removed when I stopped doing ROLZ2019-05-20 05:12:40
*** schnaader has left the channel2019-05-20 05:13:04
 improved enwik8 from 43375365 to 433712742019-05-20 05:14:42
*** danlock has joined the channel2019-05-20 16:08:07
<danlock> What's new today, friends?2019-05-20 16:09:31
*** danlock has left the channel2019-05-20 16:18:26
*** danlock has joined the channel2019-05-20 18:23:56
 "A barely detectible trace of hope" is all that is necessary for an expansion of hope, a spark of inspiration, a desire to keep trying to develop the next improvement on anything, whether it's an idea or algorithm or thought or merely a desire to survive.2019-05-20 18:32:29
 I might be popping in and out as I set up this client to become a replacement for the WebChat in-browser method of connection I was using before. </unnecessary comment>2019-05-20 18:43:11
<Shelwien> LZ4 -3 is 43,893,3792019-05-20 19:57:14
 so next is -4? :)2019-05-20 19:57:36
 looks like forum is still down :(2019-05-20 19:58:32
<danlock> I noticed :(2019-05-20 20:04:22
<FunkyBob> Shelwien: yeah, well, so I've played with your ideas on rep-offset... and, at least for enwik8, it's a net loss decause of hte prevalence of short runs2019-05-20 21:30:06
 ISTM better/smarter parsing is the way forward2019-05-20 21:30:23
<Shelwien> well, in enwik8 lzma found these:2019-05-20 22:05:53
  215: <T1?:??:>2019-05-20 22:05:56
  172: <T0?:??:>2019-05-20 22:05:56
  151: <\x3C/id\x3E\n \x3Ctimestamp\x3E2006-0?->2019-05-20 22:05:56
  145: <гѓ?г>2019-05-20 22:05:56
  143: <T1?:>2019-05-20 22:05:56
  142: <\x3C/id\x3E\n \x3Crevision\x3E\n \x3Cid\x3E??9>2019-05-20 22:05:56
  123: <\x3C/id\x3E\n \x3Ctimestamp\x3E2006-03-0?T??:??:>2019-05-20 22:05:57
  120: <:5?:>2019-05-20 22:05:57
  119: <\x3C/id\x3E\n \x3Ctimestamp\x3E2006-02-??T??:??:>2019-05-20 22:05:58
  118: <\x3C/id\x3E\n \x3Ctimestamp\x3E200?->2019-05-20 22:05:58
  104: <-0?->2019-05-20 22:05:59
  103: <.1??.>2019-05-20 22:05:59
 Its output from my fuzzy match detector, "?" are literals2019-05-20 22:06:54
 so yeah, you probably forgot to reduce minmatchlen for rep-matches, or something2019-05-20 22:07:34
<FunkyBob> I did adjust my code to calculate the "gain" for each match2019-05-20 22:10:19
 [something I used to have, but removed when I went to fixed length encoding]2019-05-20 22:10:35
 so... tips on improved parsing? :)2019-05-20 22:18:24
 (in case you hadn't figured it out, I'm back on UTC+10)2019-05-20 22:19:56
<Shelwien> well, just implement optimal parsing? its pretty simple for LZ42019-05-20 22:38:07
 unlike formats with entropy coding2019-05-20 22:38:21
<FunkyBob> yes, but which optimal parsing? :)2019-05-20 22:38:51
 true, it's simpler because ofno entropy coding2019-05-20 22:39:02
 I expect it also to be much slower2019-05-20 22:39:11
 since it has to do a match hunt at every byte2019-05-20 22:39:19
 I can possibly still get some wins foredge cases between match length encoding2019-05-20 22:39:37
* danlock thinks, "Entropy...."2019-05-20 22:58:25
*** danlock has left the channel2019-05-20 23:45:46
*** danlock has joined the channel2019-05-21 00:38:02
*** danlock has left the channel2019-05-21 00:40:09
*** danlock has joined the channel2019-05-21 00:40:09
*** danlock has left the channel2019-05-21 00:44:15
<Shelwien> well, hashchain may be too slow for optimal parsing, yes2019-05-21 01:03:58
 basically you have to find nearest matches of all lengths at each byte2019-05-21 01:04:41
 but there're more efficient solutions for this2019-05-21 01:06:15
 like suffix tree, lzma's bt4 or BWT suffix array2019-05-21 01:06:49
 also, its quite possible (and probably the best) to implement optimal parsing with a sliding window, instead of fixed blocks2019-05-21 01:08:21
<FunkyBob> odd... thought I was doing sliding window... at least, to a degree... since I can only reference offsets up to 16bits away2019-05-21 01:14:29
<Shelwien> but your coder works with fixed memory buffers i think?2019-05-21 01:18:09
 yes, distances limited to window size is convenient for LZ with a sliding window, buts there's no strict dependency2019-05-21 01:19:01
 like, its quite possible to make a sliding window LZ with support for distances longer than window size (via hashtable)2019-05-21 01:19:49
 and matchfinder with a sliding window is not the one which stops when distance becomes too big2019-05-21 01:22:00
 but the one which adjusts its structures while shifting the "window"2019-05-21 01:23:22
 so that same tables can be used for data stream of any size, potentially infinite2019-05-21 01:24:13
 and parsing optimizer is a separate part, it has its own data structures and all2019-05-21 01:25:51
 so its also possible to use difference design for it2019-05-21 01:26:23
 and most implementation actually use a fixed-block optimizer2019-05-21 01:27:05
 for example, LZMA optimizer has a fixed-size array (default size is 4k)2019-05-21 01:27:37
 which stores optimizer info per data byte (including possible matches of different lengths)2019-05-21 01:28:10
 so either when this array fills up2019-05-21 01:28:48
 or when there's matchlen > threshold2019-05-21 01:29:14
 lzma encoder stops and does the optimizer's backward pass2019-05-21 01:29:49
<FunkyBob> so what you're saying is LZMA is "optimal" only over a small window, though that window slides?2019-05-21 01:30:40
 or not?2019-05-21 01:31:37
<Shelwien> no, LZMA parsing optimizer doesn't really implement sliding window either2019-05-21 01:31:46
 though yes, its only approximately optimal for a small window2019-05-21 01:32:02
<FunkyBob> so far the more I look for a clear, detailed example of any of these algorithms, the more I wind up in a land of vague, hand wavey stuff written by people for people who already know the ame terms as them2019-05-21 01:32:34
<Shelwien> and although LZMA matchfinder _does_ work with an actual sliding window2019-05-21 01:32:35
<FunkyBob> making all of this very impenetrable2019-05-21 01:32:40
<Shelwien> true2019-05-21 01:32:50
<FunkyBob> instead of just a clear description of an algorithm2019-05-21 01:32:50
<Shelwien> LZMA is open-source, but completely undocumented2019-05-21 01:33:07
<FunkyBob> yeah, and I've seen the comments about its code by various people2019-05-21 01:33:27
<Shelwien> so only people who managed to write something similar would know how its done :)2019-05-21 01:33:32
<FunkyBob> but I'm just talking about _any_ optimal parsing algorithm... or, even anything beyond basic greedy and "one look ahead lazy"2019-05-21 01:34:11
 (as it is, my current engine uses N look-ahead lazy)2019-05-21 01:34:26
 it will keep looking ahead until the next match isn't "better"2019-05-21 01:34:47
<Shelwien> well, "any" is simpler, and i already explained it before2019-05-21 01:34:52
<FunkyBob> where?2019-05-21 01:35:04
<Shelwien> here, when you just made the channel, first day or so2019-05-21 01:35:23
<FunkyBob> what do you mean "any" is simpler? you mean any algo is simpler than LZMA?2019-05-21 01:35:24
<Shelwien> no, i mean optimizing for whole file2019-05-21 01:35:37
<FunkyBob> well, my history buffer doesn't go back that far2019-05-21 01:35:56
<Shelwien> is simpler than blockwise optimization like iin lzma2019-05-21 01:35:57
 and optimizer with sliding window is yet more complicated2019-05-21 01:36:23
 anyway, for the whole file its simple2019-05-21 01:36:40
 1) find nearest match of each allowed length at each data byte2019-05-21 01:37:09
<FunkyBob> mmm?2019-05-21 01:37:09
 so at each point, I need to find match lengths, say, 4 to infinity?2019-05-21 01:37:51
 [obviously a 'practical' limit on upper bound]2019-05-21 01:38:03
<Shelwien> 2) update codelengths in optimizer array based on possible matches2019-05-21 01:38:10
<FunkyBob> so in step one that's the N-wide state vector cbloom speaks of?2019-05-21 01:38:48
<Shelwien> like price[pos+matchlen]=min(price[pos+matchlen],price[pos]+codelen(matchlen,dist))2019-05-21 01:39:09
 for each match2019-05-21 01:39:12
<FunkyBob> by step 2, do you basically mean calculate the price or gain for each possible match?2019-05-21 01:39:22
<Shelwien> and also remember the len, if new price is better2019-05-21 01:39:36
 2a) of course same has to be done for literal run prices in this case; normally its simpler since there's only one literal2019-05-21 01:40:24
 3) when whole file is processed, you look at price[end_of_file].len2019-05-21 01:41:18
 and generate tokens in backward order2019-05-21 01:41:58
<FunkyBob> so this feels like a slightly more detailed description of LZSS optimal coding that I've read elsewhere...2019-05-21 01:42:04
<Shelwien> like token[end_of_file-price[end_of_file].len] = { price[end_of_file].len, price[end_of_file].dist }2019-05-21 01:42:27
 and when you reach start of file, you have it all parsed2019-05-21 01:43:00
 and can encode the tokens2019-05-21 01:43:13
 ---2019-05-21 01:43:25
 and yes, its the same as LZSS optimal coding2019-05-21 01:43:33
 because the idea is just basic dynamic programming2019-05-21 01:43:45
<FunkyBob> well, it certainly is clearer than most descriptions I'v read before2019-05-21 01:43:52
<Shelwien> instead of trying all possible combinations of tokens2019-05-21 01:44:15
 we see that any pattern of tokens which is optimal to reach [pos]2019-05-21 01:45:14
 stays optimal for any longer file too2019-05-21 01:45:42
<FunkyBob> the part of the description I hadn't seen clearly mentioned before was storing matches for different lengths for a given position2019-05-21 01:45:51
<Shelwien> well, there're multiple ways to implement it2019-05-21 01:46:47
 like, it may be more clear to write a recursive function2019-05-21 01:46:59
 which would start at end-of-file2019-05-21 01:47:11
 and call itself to optimize part of it2019-05-21 01:47:32
 like, there's a limited choice of possible last token2019-05-21 01:47:59
 so we can write 2019-05-21 01:48:06
 opt(endoffile) = min( opt(endoffile-maxmatchlen..endoffile-minmatchlen) )2019-05-21 01:49:02
 and then2019-05-21 01:49:25
 an obvious optimization would be to cache known values of opt[x]2019-05-21 01:49:39
 since data is static2019-05-21 01:49:54
 so optimal compressed size for substrings data[0..x] only depends on x2019-05-21 01:50:24
<FunkyBob> ok... well, I'm going to have to put this aside for now, until my duties for $DAYJOB are over2019-05-21 01:51:59
<Shelwien> :)2019-05-21 01:55:12
 !next2019-05-21 01:55:29