* freenode-connect Welcome to freenode. To protect the network all new connections will be scanned for vulnerabilities. This will not harm your computer, and vulnerable hosts will be notified. | 2019-05-19 09:16:28 |
*** complogger has joined the channel | 2019-05-19 09:16:34 |
<A barely detectible trace of hope> | Who knows - it just could be possible... </A barely detectible trace of hope> ||| IF YOU ASK A QUESTION, PLEASE REMAIN IN THE CHANNEL, IT MAY TAKE HOURS TO REPLY (AS TheStar IS OFTEN AFK), BUT HE WILL DO HIS BEST TO ANSWER... ||| Remembe". | 2019-05-19 09:16:34 |
<Shelwien> | test | 2019-05-19 09:17:30 |
<TheWolf> | Woof | 2019-05-19 09:25:10 |
<FunkyBob> | o/ | 2019-05-19 23:45:06 |
*** schnaader has joined the channel | 2019-05-20 04:45:08 |
| huh... small win... just edged out lz4 for compressing a 10MB file of 00 :P | 2019-05-20 04:45:58 |
<schnaader> | sounds trivial for most, but when coding a matchfinder, this case is the worst - matches everywhere ^^ | 2019-05-20 04:48:06 |
| or do you mean you optimized ratio? | 2019-05-20 04:48:33 |
<FunkyBob> | LZ4 got 41178 bytes | 2019-05-20 04:48:52 |
<schnaader> | btw, forum is down? | 2019-05-20 04:49:01 |
<FunkyBob> | I got 41137 bytes | 2019-05-20 04:49:02 |
| oh, is it? | 2019-05-20 04:49:07 |
<schnaader> | perhaps it's just me, but I can't access atm | 2019-05-20 04:49:21 |
<FunkyBob> | seems down from here, too | 2019-05-20 04:49:31 |
<schnaader> | hm, 40K still sounds a lot. is the result still compressible? | 2019-05-20 04:50:33 |
| result of some block size, perhaps? 64K * 4 * 40000 = 10 MB or something like this | 2019-05-20 04:52:16 |
<FunkyBob> | oh, sure... I mean, bzip2 is the current winner, achieving just 49 bytes | 2019-05-20 04:52:19 |
<schnaader> | Reminds me of the FLIF blog about encoding single color images: https://cloudinary.com/blog/a_one_color_image_is_worth_two_thousand_words | 2019-05-20 04:53:18 |
<FunkyBob> | for comparison: https://dpaste.de/hgh9 | 2019-05-20 04:53:49 |
| since neiter mine nor lz4 have an entropy coder, they really can't measure up | 2019-05-20 04:54:59 |
<schnaader> | ah ok, that explains things | 2019-05-20 04:55:48 |
| on the other hand, worst case is 0,075%, so who cares ^^ any other real data will be much bigger | 2019-05-20 04:55:49 |
<FunkyBob> | am just looking back at my various experiments and attempts | 2019-05-20 05:01:31 |
| I really need to make a better parser :) | 2019-05-20 05:02:53 |
| or, stop making silly mistakes... just gained a small boost | 2019-05-20 05:12:24 |
| by removing something that I should have removed when I stopped doing ROLZ | 2019-05-20 05:12:40 |
*** schnaader has left the channel | 2019-05-20 05:13:04 |
| improved enwik8 from 43375365 to 43371274 | 2019-05-20 05:14:42 |
*** danlock has joined the channel | 2019-05-20 16:08:07 |
<danlock> | What's new today, friends? | 2019-05-20 16:09:31 |
*** danlock has left the channel | 2019-05-20 16:18:26 |
*** danlock has joined the channel | 2019-05-20 18:23:56 |
| "A barely detectible trace of hope" is all that is necessary for an expansion of hope, a spark of inspiration, a desire to keep trying to develop the next improvement on anything, whether it's an idea or algorithm or thought or merely a desire to survive. | 2019-05-20 18:32:29 |
| I might be popping in and out as I set up this client to become a replacement for the WebChat in-browser method of connection I was using before. </unnecessary comment> | 2019-05-20 18:43:11 |
<Shelwien> | LZ4 -3 is 43,893,379 | 2019-05-20 19:57:14 |
| so next is -4? :) | 2019-05-20 19:57:36 |
| looks like forum is still down :( | 2019-05-20 19:58:32 |
<danlock> | I noticed :( | 2019-05-20 20:04:22 |
<FunkyBob> | Shelwien: yeah, well, so I've played with your ideas on rep-offset... and, at least for enwik8, it's a net loss decause of hte prevalence of short runs | 2019-05-20 21:30:06 |
| ISTM better/smarter parsing is the way forward | 2019-05-20 21:30:23 |
<Shelwien> | well, in enwik8 lzma found these: | 2019-05-20 22:05:53 |
| 215: <T1?:??:> | 2019-05-20 22:05:56 |
| 172: <T0?:??:> | 2019-05-20 22:05:56 |
| 151: <\x3C/id\x3E\n \x3Ctimestamp\x3E2006-0?-> | 2019-05-20 22:05:56 |
| 145: <гѓ?г> | 2019-05-20 22:05:56 |
| 143: <T1?:> | 2019-05-20 22:05:56 |
| 142: <\x3C/id\x3E\n \x3Crevision\x3E\n \x3Cid\x3E??9> | 2019-05-20 22:05:56 |
| 123: <\x3C/id\x3E\n \x3Ctimestamp\x3E2006-03-0?T??:??:> | 2019-05-20 22:05:57 |
| 120: <:5?:> | 2019-05-20 22:05:57 |
| 119: <\x3C/id\x3E\n \x3Ctimestamp\x3E2006-02-??T??:??:> | 2019-05-20 22:05:58 |
| 118: <\x3C/id\x3E\n \x3Ctimestamp\x3E200?-> | 2019-05-20 22:05:58 |
| 104: <-0?-> | 2019-05-20 22:05:59 |
| 103: <.1??.> | 2019-05-20 22:05:59 |
| Its output from my fuzzy match detector, "?" are literals | 2019-05-20 22:06:54 |
| so yeah, you probably forgot to reduce minmatchlen for rep-matches, or something | 2019-05-20 22:07:34 |
<FunkyBob> | I did adjust my code to calculate the "gain" for each match | 2019-05-20 22:10:19 |
| [something I used to have, but removed when I went to fixed length encoding] | 2019-05-20 22:10:35 |
| so... tips on improved parsing? :) | 2019-05-20 22:18:24 |
| (in case you hadn't figured it out, I'm back on UTC+10) | 2019-05-20 22:19:56 |
<Shelwien> | well, just implement optimal parsing? its pretty simple for LZ4 | 2019-05-20 22:38:07 |
| unlike formats with entropy coding | 2019-05-20 22:38:21 |
<FunkyBob> | yes, but which optimal parsing? :) | 2019-05-20 22:38:51 |
| true, it's simpler because ofno entropy coding | 2019-05-20 22:39:02 |
| I expect it also to be much slower | 2019-05-20 22:39:11 |
| since it has to do a match hunt at every byte | 2019-05-20 22:39:19 |
| I can possibly still get some wins foredge cases between match length encoding | 2019-05-20 22:39:37 |
* danlock thinks, "Entropy...." | 2019-05-20 22:58:25 |
*** danlock has left the channel | 2019-05-20 23:45:46 |
*** danlock has joined the channel | 2019-05-21 00:38:02 |
*** danlock has left the channel | 2019-05-21 00:40:09 |
*** danlock has joined the channel | 2019-05-21 00:40:09 |
*** danlock has left the channel | 2019-05-21 00:44:15 |
<Shelwien> | well, hashchain may be too slow for optimal parsing, yes | 2019-05-21 01:03:58 |
| basically you have to find nearest matches of all lengths at each byte | 2019-05-21 01:04:41 |
| but there're more efficient solutions for this | 2019-05-21 01:06:15 |
| like suffix tree, lzma's bt4 or BWT suffix array | 2019-05-21 01:06:49 |
| also, its quite possible (and probably the best) to implement optimal parsing with a sliding window, instead of fixed blocks | 2019-05-21 01:08:21 |
<FunkyBob> | odd... thought I was doing sliding window... at least, to a degree... since I can only reference offsets up to 16bits away | 2019-05-21 01:14:29 |
<Shelwien> | but your coder works with fixed memory buffers i think? | 2019-05-21 01:18:09 |
| yes, distances limited to window size is convenient for LZ with a sliding window, buts there's no strict dependency | 2019-05-21 01:19:01 |
| like, its quite possible to make a sliding window LZ with support for distances longer than window size (via hashtable) | 2019-05-21 01:19:49 |
| and matchfinder with a sliding window is not the one which stops when distance becomes too big | 2019-05-21 01:22:00 |
| but the one which adjusts its structures while shifting the "window" | 2019-05-21 01:23:22 |
| so that same tables can be used for data stream of any size, potentially infinite | 2019-05-21 01:24:13 |
| and parsing optimizer is a separate part, it has its own data structures and all | 2019-05-21 01:25:51 |
| so its also possible to use difference design for it | 2019-05-21 01:26:23 |
| and most implementation actually use a fixed-block optimizer | 2019-05-21 01:27:05 |
| for example, LZMA optimizer has a fixed-size array (default size is 4k) | 2019-05-21 01:27:37 |
| which stores optimizer info per data byte (including possible matches of different lengths) | 2019-05-21 01:28:10 |
| so either when this array fills up | 2019-05-21 01:28:48 |
| or when there's matchlen > threshold | 2019-05-21 01:29:14 |
| lzma encoder stops and does the optimizer's backward pass | 2019-05-21 01:29:49 |
<FunkyBob> | so what you're saying is LZMA is "optimal" only over a small window, though that window slides? | 2019-05-21 01:30:40 |
| or not? | 2019-05-21 01:31:37 |
<Shelwien> | no, LZMA parsing optimizer doesn't really implement sliding window either | 2019-05-21 01:31:46 |
| though yes, its only approximately optimal for a small window | 2019-05-21 01:32:02 |
<FunkyBob> | so far the more I look for a clear, detailed example of any of these algorithms, the more I wind up in a land of vague, hand wavey stuff written by people for people who already know the ame terms as them | 2019-05-21 01:32:34 |
<Shelwien> | and although LZMA matchfinder _does_ work with an actual sliding window | 2019-05-21 01:32:35 |
<FunkyBob> | making all of this very impenetrable | 2019-05-21 01:32:40 |
<Shelwien> | true | 2019-05-21 01:32:50 |
<FunkyBob> | instead of just a clear description of an algorithm | 2019-05-21 01:32:50 |
<Shelwien> | LZMA is open-source, but completely undocumented | 2019-05-21 01:33:07 |
<FunkyBob> | yeah, and I've seen the comments about its code by various people | 2019-05-21 01:33:27 |
<Shelwien> | so only people who managed to write something similar would know how its done :) | 2019-05-21 01:33:32 |
<FunkyBob> | but I'm just talking about _any_ optimal parsing algorithm... or, even anything beyond basic greedy and "one look ahead lazy" | 2019-05-21 01:34:11 |
| (as it is, my current engine uses N look-ahead lazy) | 2019-05-21 01:34:26 |
| it will keep looking ahead until the next match isn't "better" | 2019-05-21 01:34:47 |
<Shelwien> | well, "any" is simpler, and i already explained it before | 2019-05-21 01:34:52 |
<FunkyBob> | where? | 2019-05-21 01:35:04 |
<Shelwien> | here, when you just made the channel, first day or so | 2019-05-21 01:35:23 |
<FunkyBob> | what do you mean "any" is simpler? you mean any algo is simpler than LZMA? | 2019-05-21 01:35:24 |
<Shelwien> | no, i mean optimizing for whole file | 2019-05-21 01:35:37 |
<FunkyBob> | well, my history buffer doesn't go back that far | 2019-05-21 01:35:56 |
<Shelwien> | is simpler than blockwise optimization like iin lzma | 2019-05-21 01:35:57 |
| and optimizer with sliding window is yet more complicated | 2019-05-21 01:36:23 |
| anyway, for the whole file its simple | 2019-05-21 01:36:40 |
| 1) find nearest match of each allowed length at each data byte | 2019-05-21 01:37:09 |
<FunkyBob> | mmm? | 2019-05-21 01:37:09 |
| so at each point, I need to find match lengths, say, 4 to infinity? | 2019-05-21 01:37:51 |
| [obviously a 'practical' limit on upper bound] | 2019-05-21 01:38:03 |
<Shelwien> | 2) update codelengths in optimizer array based on possible matches | 2019-05-21 01:38:10 |
<FunkyBob> | so in step one that's the N-wide state vector cbloom speaks of? | 2019-05-21 01:38:48 |
<Shelwien> | like price[pos+matchlen]=min(price[pos+matchlen],price[pos]+codelen(matchlen,dist)) | 2019-05-21 01:39:09 |
| for each match | 2019-05-21 01:39:12 |
<FunkyBob> | by step 2, do you basically mean calculate the price or gain for each possible match? | 2019-05-21 01:39:22 |
<Shelwien> | and also remember the len, if new price is better | 2019-05-21 01:39:36 |
| 2a) of course same has to be done for literal run prices in this case; normally its simpler since there's only one literal | 2019-05-21 01:40:24 |
| 3) when whole file is processed, you look at price[end_of_file].len | 2019-05-21 01:41:18 |
| and generate tokens in backward order | 2019-05-21 01:41:58 |
<FunkyBob> | so this feels like a slightly more detailed description of LZSS optimal coding that I've read elsewhere... | 2019-05-21 01:42:04 |
<Shelwien> | like token[end_of_file-price[end_of_file].len] = { price[end_of_file].len, price[end_of_file].dist } | 2019-05-21 01:42:27 |
| and when you reach start of file, you have it all parsed | 2019-05-21 01:43:00 |
| and can encode the tokens | 2019-05-21 01:43:13 |
| --- | 2019-05-21 01:43:25 |
| and yes, its the same as LZSS optimal coding | 2019-05-21 01:43:33 |
| because the idea is just basic dynamic programming | 2019-05-21 01:43:45 |
<FunkyBob> | well, it certainly is clearer than most descriptions I'v read before | 2019-05-21 01:43:52 |
<Shelwien> | instead of trying all possible combinations of tokens | 2019-05-21 01:44:15 |
| we see that any pattern of tokens which is optimal to reach [pos] | 2019-05-21 01:45:14 |
| stays optimal for any longer file too | 2019-05-21 01:45:42 |
<FunkyBob> | the part of the description I hadn't seen clearly mentioned before was storing matches for different lengths for a given position | 2019-05-21 01:45:51 |
<Shelwien> | well, there're multiple ways to implement it | 2019-05-21 01:46:47 |
| like, it may be more clear to write a recursive function | 2019-05-21 01:46:59 |
| which would start at end-of-file | 2019-05-21 01:47:11 |
| and call itself to optimize part of it | 2019-05-21 01:47:32 |
| like, there's a limited choice of possible last token | 2019-05-21 01:47:59 |
| so we can write | 2019-05-21 01:48:06 |
| opt(endoffile) = min( opt(endoffile-maxmatchlen..endoffile-minmatchlen) ) | 2019-05-21 01:49:02 |
| and then | 2019-05-21 01:49:25 |
| an obvious optimization would be to cache known values of opt[x] | 2019-05-21 01:49:39 |
| since data is static | 2019-05-21 01:49:54 |
| so optimal compressed size for substrings data[0..x] only depends on x | 2019-05-21 01:50:24 |
<FunkyBob> | ok... well, I'm going to have to put this aside for now, until my duties for $DAYJOB are over | 2019-05-21 01:51:59 |
<Shelwien> | :) | 2019-05-21 01:55:12 |
| !next | 2019-05-21 01:55:29 |