*** pinc has left the channel | 2009-12-05 17:49:20 |
*** schnaader has left the channel | 2009-12-05 17:52:19 |
*** toffer_ has joined the channel | 2009-12-05 17:59:29 |
*** toffer has left the channel | 2009-12-05 18:01:58 |
*** mike_____ has joined the channel | 2009-12-05 18:19:39 |
*** schnaader has joined the channel | 2009-12-05 18:31:10 |
*** schnaader has left the channel | 2009-12-05 19:13:55 |
<mike_____> | toffer: here is the result of ccmx on enwik8 on my system: | 2009-12-05 19:28:14 |
| 97656.25 KiB -> 21138.73 KiB (ratio 21.65%, speed 871 KiB/s) | 2009-12-05 19:28:16 |
<Shelwien> | and m1? | 2009-12-05 19:29:09 |
| with similar memory setting? | 2009-12-05 19:29:24 |
<mike_____> | Allocated 196999 kB. | 2009-12-05 19:32:19 |
| Encoding: 21286803/ 100000000 bytes (1.703 bpc), 76.68 s (1304 Kb/s) | 2009-12-05 19:32:19 |
| ccmx allocated 146MB | 2009-12-05 19:32:42 |
<Shelwien> | still, it seems better and faster which is good | 2009-12-05 19:33:09 |
*** Shelwien has left the channel | 2009-12-05 23:03:59 |
*** Shelwien has joined the channel | 2009-12-05 23:04:42 |
<mike_____> | btw, what does compbooks do? | 2009-12-05 23:05:51 |
*** compbooks has left the channel | 2009-12-05 23:23:36 |
*** mike_____ has left the channel | 2009-12-05 23:23:37 |
*** compbooks has joined the channel | 2009-12-05 23:24:14 |
* compbooks eats people's trouts | 2009-12-05 23:26:52 |
* Krugz slaps compbooks around a bit with a large trout | 2009-12-05 23:28:22 |
* compbooks eats the trout | 2009-12-05 23:59:53 |
<Krugz> | lol slow | 2009-12-06 00:00:05 |
* compbooks slept | 2009-12-06 00:00:18 |
| pff sleep | 2009-12-06 00:00:34 |
<Shelwien> | actually its an iroffer bot | 2009-12-06 00:00:43 |
| xdcc list and whatever else | 2009-12-06 00:01:13 |
<Krugz> | oic | 2009-12-06 00:01:19 |
*** Krugz has left the channel | 2009-12-06 00:13:40 |
*** Krugz has joined the channel | 2009-12-06 00:43:31 |
*** Skymmer has joined the channel | 2009-12-06 01:27:53 |
<Skymmer> | Hi dudes | 2009-12-06 01:30:17 |
<Shelwien> | hi | 2009-12-06 01:30:22 |
<Skymmer> | Ah :) You'll not spoof me. It's not you, its your bot :)) | 2009-12-06 01:31:40 |
| Fast reaction | 2009-12-06 01:31:57 |
* Shelwien is thinking that it means that he's a "dude" now | 2009-12-06 01:32:51 |
| Nice ;} | 2009-12-06 01:33:30 |
| Anyway, I'm here to do the thing that I don't like | 2009-12-06 01:33:40 |
<Shelwien> | ? | 2009-12-06 01:33:58 |
<Skymmer> | To ask something to do... well... listen | 2009-12-06 01:34:55 |
| There is one program which I using. It kinda slow but there are sources for it. So I thought if somebody experienced with Intel Compiler can compile it. It also can be a good test for IC capabilities. | 2009-12-06 01:38:04 |
<Shelwien> | well, i can try probably | 2009-12-06 01:38:27 |
<Skymmer> | Damn... I'll feeling bad to ask it. Maybe you're busy? *shame* | 2009-12-06 01:39:31 |
<Shelwien> | not atm | 2009-12-06 01:39:41 |
<Skymmer> | OK | 2009-12-06 01:39:46 |
<Shelwien> | i mean not busy atm | 2009-12-06 01:39:58 |
<Skymmer> | http://omion.dyndns.org/mp3packer/mp3packer-1.20_src.zip | 2009-12-06 01:40:29 |
| More details (if needed) at: | 2009-12-06 01:40:29 |
| http://www.hydrogenaudio.org/forums/index.php?showtopic=32379 | 2009-12-06 01:40:29 |
<Shelwien> | that's ocaml, not C | 2009-12-06 01:41:30 |
| so its unrelated to IntelC or whatever | 2009-12-06 01:43:44 |
| but btw mp3zip can do something similar too | 2009-12-06 01:43:55 |
<Skymmer> | Sad. How about this: | 2009-12-06 01:45:28 |
| http://www.fftw.org/fftw-3.2.2.tar.gz | 2009-12-06 01:45:28 |
| http://www.fftw.org/fftw-3.3alpha1.tar.gz | 2009-12-06 01:45:28 |
<Shelwien> | not sure what you want to get out of that | 2009-12-06 01:48:33 |
<Skymmer> | libfftw3-3.dll | 2009-12-06 01:48:41 |
<Shelwien> | well, i'd better not, i guess | 2009-12-06 01:49:07 |
| its probably possible, but i'd need something to check it with etc - issues are possible | 2009-12-06 01:49:44 |
| "Intel C: you can also use the Intel compilers under VC++ (see below). This may produce marginally faster code than the GNU C compiler, but is probably not worth it for most users. Be cautious with the compiler flags—turning on every optimization under the sun usually makes FFTW slower." | 2009-12-06 01:50:54 |
| http://www.ece.cmu.edu/~franzf/fftw.org/ | 2009-12-06 01:51:19 |
| ;) | 2009-12-06 01:51:20 |
<Skymmer> | Damn :) | 2009-12-06 01:54:05 |
| Ok, last trial | 2009-12-06 01:54:20 |
| http://files.monkeysaudio.com/MAC_SDK_406.zip | 2009-12-06 01:54:24 |
<Shelwien> | dll again or what? | 2009-12-06 01:54:56 |
<Skymmer> | mac.exe | 2009-12-06 01:55:09 |
| Ehhh... Source\Console\ I presume | 2009-12-06 01:59:22 |
| Lame with it | 2009-12-06 01:59:32 |
<Shelwien> | well, dunno | 2009-12-06 02:14:16 |
| i tried but there're syntax errors now | 2009-12-06 02:14:32 |
| "pointer to incomplete class not allowed" etc | 2009-12-06 02:14:44 |
<Skymmer> | No problem. | 2009-12-06 02:18:52 |
| BTW, what you phrase means: but btw mp3zip can do something similar too | 2009-12-06 02:19:32 |
<Shelwien> | CBR->VBR | 2009-12-06 02:19:50 |
| you have the console mp3zip, right? | 2009-12-06 02:19:59 |
<Skymmer> | sure | 2009-12-06 02:20:04 |
| -c ? | 2009-12-06 02:20:15 |
<Shelwien> | you can run it like mp3zip -c 1.mp3 1.mpx; mp3zip -d 1.mpx 1unp.mp3 | 2009-12-06 02:20:16 |
| yeah | 2009-12-06 02:20:19 |
<Skymmer> | Oh no... | 2009-12-06 02:20:27 |
<Shelwien> | its not that smart though | 2009-12-06 02:20:40 |
| would just discard the LAME tag or something | 2009-12-06 02:20:52 |
| well, i can make it do it in a single pass too | 2009-12-06 02:21:22 |
| not that there's any sense to | 2009-12-06 02:22:02 |
<Skymmer> | not only. The problem is that OUT file has no correct Xing VBR info so its lenght shown incorrectly: | 2009-12-06 02:25:46 |
| Processed: 6:02:10 | 2009-12-06 02:25:46 |
| Original: 0:54:12 | 2009-12-06 02:25:46 |
| and more: | 2009-12-06 02:26:07 |
<Shelwien> | well, as i said, it wasn't an intentional feature anyway ;) | 2009-12-06 02:26:43 |
<Skymmer> | foobar's "Differences found in 1 out of 1 track pairs. | 2009-12-06 02:26:51 |
| Comparing: | 2009-12-06 02:26:51 |
| "C:\SHIT\SOFT\ARC\_Shelwien\mp3zip\out.mp3" | 2009-12-06 02:26:51 |
| "C:\SHIT\SOFT\ARC\_Shelwien\mp3zip\Test.mp3" | 2009-12-06 02:26:51 |
| Length mismatch : 54:12.520249 vs 54:12.453333, 143436143 vs 143433192 samples | 2009-12-06 02:26:51 |
| " | 2009-12-06 02:26:51 |
| its foobar's "bit-compare" results | 2009-12-06 02:27:24 |
*** Shelwien has left the channel | 2009-12-06 02:28:23 |
*** Shelwien has joined the channel | 2009-12-06 02:28:28 |
<Shelwien> | but still, my packed mp3s might be different in size from mp3pack | 2009-12-06 02:28:36 |
<Skymmer> | Ehhh... don't look at the "SHIT" name of folder. its my name of Download folder :)) | 2009-12-06 02:29:13 |
<STalKer-Y> | i thought it was an abbrevation | 2009-12-06 02:30:41 |
<Skymmer> | I don't mean the size. I mean the content. | 2009-12-06 02:31:14 |
<Shelwien> | the content is the same | 2009-12-06 02:31:45 |
| that "length mismatch" is probably due to fixed frame size in mp3 | 2009-12-06 02:32:07 |
| so actual PCM size is in framesize increments | 2009-12-06 02:32:52 |
| but maybe the LAME tag contains the precise source file length | 2009-12-06 02:33:09 |
| and mp3zip discards that | 2009-12-06 02:33:16 |
<Skymmer> | I think its because LAME gapless info indroduced in 3.90.3 is missing in OUT file | 2009-12-06 02:33:23 |
| ah yes | 2009-12-06 02:33:28 |
<Shelwien> | anyway it keeps the tag in default mode | 2009-12-06 02:33:56 |
| so its ok ;) | 2009-12-06 02:34:00 |
<Skymmer> | out.mp3 86 513 998 | 2009-12-06 02:34:24 |
| Test.mp3 85 225 904 | 2009-12-06 02:34:24 |
<Shelwien> | but i guess i'd have to find the description for that tag the next time | 2009-12-06 02:34:32 |
<Skymmer> | damn | 2009-12-06 02:34:47 |
| out.mp3 86 513 998 | 2009-12-06 02:34:47 |
| Test.mp3 85 225 904 | 2009-12-06 02:34:48 |
<Shelwien> | expanded it? | 2009-12-06 02:35:00 |
<toffer> | hi & gn8 guys | 2009-12-06 02:37:04 |
<Skymmer> | sorry, what you mean by "expanding it"? | 2009-12-06 02:37:09 |
<toffer> | just came home again from a club | 2009-12-06 02:37:12 |
<Shelwien> | hi ;) | 2009-12-06 02:37:28 |
<Skymmer> | Aloha! | 2009-12-06 02:37:54 |
<Shelwien> | skymmer: i mean that mp3zip output apparently larger than input | 2009-12-06 02:38:16 |
<Skymmer> | yes. | 2009-12-06 02:38:28 |
<Shelwien> | no luck ;) | 2009-12-06 02:38:52 |
<toffer> | mhhh spaghetti | 2009-12-06 02:39:24 |
<Shelwien> | what? | 2009-12-06 02:39:36 |
<Skymmer> | spaghetti its the new lossless audio compressor | 2009-12-06 02:40:06 |
| :)) | 2009-12-06 02:40:13 |
<toffer> | ^^ | 2009-12-06 02:40:14 |
<Skymmer> | Toffer, why you so amazed about spaghetti? have you smoked something? ;) | 2009-12-06 02:42:23 |
<Shelwien> | spaghetti? | 2009-12-06 02:42:53 |
<toffer> | no, not at all | 2009-12-06 02:42:59 |
| but i'm really hungry right now. and my girlfriend wants to eat that stuff, too | 2009-12-06 02:43:18 |
<Skymmer> | What kind of music was in the club? | 2009-12-06 02:43:57 |
<toffer> | well it was some kind of "huge hall" they played everything. i mostly like back musikc | 2009-12-06 02:48:35 |
| musci | 2009-12-06 02:48:36 |
| music | 2009-12-06 02:48:38 |
<Skymmer> | :)) Haaa.. You drunk probably. Not offensive. Just curious ;) | 2009-12-06 02:52:37 |
<toffer> | partially | 2009-12-06 02:55:05 |
<Skymmer> | I'm pretty sure that there wasn't music like this one *devil* | 2009-12-06 03:01:16 |
| http://skymmer.narod.ru/misc/Glenn.mp3 | 2009-12-06 03:01:18 |
<toffer> | not gonna listen to anything right now | 2009-12-06 03:06:31 |
| otherwise everybody gets awake | 2009-12-06 03:06:39 |
<Skymmer> | Bye people. Gonna sleep... | 2009-12-06 03:14:49 |
*** Skymmer has left the channel | 2009-12-06 03:15:28 |
<toffer> | gn8 from me, too | 2009-12-06 03:15:45 |
*** toffer has left the channel | 2009-12-06 03:15:49 |
*** STalKer-X has joined the channel | 2009-12-06 05:00:24 |
*** STalKer-Y has left the channel | 2009-12-06 05:03:45 |
*** Krugz has left the channel | 2009-12-06 06:03:56 |
*** mike_____ has joined the channel | 2009-12-06 11:38:36 |
<Shelwien> | http://encode.dreamhosters.com/showthread.php?t=511 | 2009-12-06 14:54:25 |
*** mike_____ has left the channel | 2009-12-06 15:48:52 |
*** pinc has joined the channel | 2009-12-06 19:13:00 |
*** Krugz has joined the channel | 2009-12-06 20:23:11 |
*** Krugz has left the channel | 2009-12-06 20:27:44 |
*** Krugz has joined the channel | 2009-12-06 20:39:12 |
*** pinc has left the channel | 2009-12-06 20:40:30 |
*** Krugz has left the channel | 2009-12-06 21:04:08 |
*** Krugz has joined the channel | 2009-12-06 21:06:52 |
| there's a conspiracy! | 2009-12-06 21:11:11 |
| compilers store a timestamp into exe header | 2009-12-06 21:11:29 |
| so that i won't be able to determine whether two exes are equal by comparing their hashes | 2009-12-06 21:11:56 |
* Shelwien is bruteforcing compiler options | 2009-12-06 21:12:22 |
<Krugz> | lol | 2009-12-06 21:13:16 |
*** Krugz has left the channel | 2009-12-06 21:14:16 |
*** Krugz has joined the channel | 2009-12-06 21:14:55 |
*** Krugz has left the channel | 2009-12-06 21:16:22 |
*** Krugz has joined the channel | 2009-12-06 21:17:01 |
*** STalKer-X has left the channel | 2009-12-06 22:56:42 |
*** Shelwien has left the channel | 2009-12-06 23:04:08 |
*** Shelwien has joined the channel | 2009-12-06 23:04:13 |
*** STalKer-X has joined the channel | 2009-12-06 23:19:03 |
*** schnaader has joined the channel | 2009-12-07 00:44:54 |
<schnaader> | hi @ all - kinda late, I know, but I thought I could at least have a look who's there | 2009-12-07 00:45:20 |
<Shelwien> | hi | 2009-12-07 00:45:35 |
| i have my usual problem with gcc and templates here | 2009-12-07 00:46:09 |
<schnaader> | what was the compiler you talked about that has timestamps? I did a bit of research because I had similar issues sometimes and found that at least GCC doesn't seem to include timestamps. | 2009-12-07 00:46:55 |
<Shelwien> | MS linker does | 2009-12-07 00:47:10 |
<schnaader> | hm... never did that much template things, so I fear I can't help you there | 2009-12-07 00:47:16 |
<Shelwien> | well, gcc is annoying as hell here | 2009-12-07 00:47:30 |
<schnaader> | ah, I guess that was where I got the problem, too, MSVC | 2009-12-07 00:47:36 |
<Shelwien> | it can't compile some code with which MSC/IntelC don't have any problems | 2009-12-07 00:47:48 |
| and i don't quite know how to solve it | 2009-12-07 00:48:10 |
| its not the first time too... | 2009-12-07 00:48:20 |
| to be specific | 2009-12-07 00:49:07 |
| if i try to use something like | 2009-12-07 00:49:13 |
| template< int flag > class A : public B<flag> { ... } | 2009-12-07 00:49:39 |
<schnaader> | you could try to post a question or search for similar problems at http://stackoverflow.com/ - they are really quick in giving very good answers especially to C questions there | 2009-12-07 00:49:42 |
<Shelwien> | gcc doesn't see anything from template B there | 2009-12-07 00:50:09 |
| dumb thing | 2009-12-07 00:50:12 |
| and i don't know how to search for it | 2009-12-07 00:50:30 |
<schnaader> | If you can strip it down to a short example and some text describing your problem, I could also post it for you on SO, so you wouldn't need to register | 2009-12-07 00:53:11 |
| there seem to be some template FAQs around, but they all seem to handle different things as far as I can tell... | 2009-12-07 00:53:39 |
<Shelwien> | ...i've also got another internal error from IntelC while trying to make it portable ;) | 2009-12-07 00:54:55 |
<schnaader> | internal errors suck :) I had to do some workaround once where the Delphi compiler would throw one when code was compiled with the command-line compiler, but the IDE would just compile it fine. rewriting the code a bit solved it, although both versions seemed to be perfectly valid code... | 2009-12-07 00:57:24 |
<Shelwien> | well, i managed to make a workaround with macros etc here | 2009-12-07 00:58:27 |
| now have some strange linking problems though | 2009-12-07 00:58:52 |
| ... | 2009-12-07 00:59:14 |
| i made a new rangecoder using couroutine template | 2009-12-07 00:59:34 |
| seems pretty nice | 2009-12-07 00:59:51 |
| now trying to compare gcc vs intelc | 2009-12-07 01:00:07 |
| at least gcc version worked after compiling | 2009-12-07 01:00:35 |
| but now i have to do something about the compiler options... | 2009-12-07 01:00:47 |
| any suggestions about gcc options btw? | 2009-12-07 01:03:02 |
<schnaader> | do you need templates here for speed or just for easier changes/readability? | 2009-12-07 01:03:16 |
| hm.. -O2/-O3 -Os -s -march=... are those I usually use, didn't care about it much so far | 2009-12-07 01:03:59 |
| I really enjoyed the forum discussion about those GCC automatic profiling things, these could be handy to gain the last percents of speed out of some code :) | 2009-12-07 01:04:38 |
<Shelwien> | for speed mainly | 2009-12-07 01:04:59 |
| ok, testing | 2009-12-07 01:05:21 |
| intelc time was ~42.5s for enwik9 | 2009-12-07 01:05:45 |
<schnaader> | ah, that's bad.. would've recommended not using templates if it wouldn't have been speed :) | 2009-12-07 01:05:49 |
<Shelwien> | for readability its important too | 2009-12-07 01:06:07 |
| well, i know how people usually solve these problems though | 2009-12-07 01:07:16 |
| they copy-paste stuff | 2009-12-07 01:07:27 |
| ...wow | 2009-12-07 01:07:35 |
| 131s with gcc 4.3/mingw | 2009-12-07 01:07:51 |
| crazy | 2009-12-07 01:07:53 |
<schnaader> | The PGO optimization should be easy: -fprofile-generate, run, recompile with -fprofile-use | 2009-12-07 01:08:09 |
<Shelwien> | sure... its not a PGO problem yet though | 2009-12-07 01:08:38 |
<schnaader> | whoa, that's 3 times faster with IntelC... either GCC is really bad here or IntelC has some neat tricks :) | 2009-12-07 01:08:50 |
<Shelwien> | guess gcc is being crazy about that int64 multiplication | 2009-12-07 01:09:07 |
<schnaader> | int64 = long long, or something homemade? | 2009-12-07 01:09:29 |
<Shelwien> | unsigned long long | 2009-12-07 01:09:43 |
<schnaader> | hm.. never experienced any major speed decreases with it and I had to use it for some Project Euler programs... | 2009-12-07 01:10:14 |
<Shelwien> | thing is, that qword version is actually faster with intelc than alternative 32-bit multiplication | 2009-12-07 01:10:24 |
| 148s decoding | 2009-12-07 01:10:42 |
| something is majorly wrong here... | 2009-12-07 01:10:58 |
<schnaader> | so how does the 32-bit version perform with gcc, then? | 2009-12-07 01:11:07 |
<Shelwien> | i'd try again with different options | 2009-12-07 01:12:12 |
| maybe it was because of inlining or unrolling | 2009-12-07 01:12:20 |
| this time exe is twice smaller | 2009-12-07 01:12:34 |
<schnaader> | I would blame unrolling in that case :) | 2009-12-07 01:12:56 |
<Shelwien> | ...but doesn't seem to be faster... still works | 2009-12-07 01:13:04 |
| ...it also can be a problem with i/o i guess | 2009-12-07 01:13:57 |
| 129s this time too | 2009-12-07 01:14:21 |
| ok, running with 32-bit mult | 2009-12-07 01:15:52 |
| ...again, i guess no luck | 2009-12-07 01:16:48 |
| 119s this time | 2009-12-07 01:17:38 |
<schnaader> | you could also try to disable some of the IntelC optimizations, perhaps it's one of those that just really helps | 2009-12-07 01:18:22 |
<Shelwien> | no | 2009-12-07 01:18:35 |
| it's never been slower than 60s | 2009-12-07 01:18:57 |
| and i don't even use PGO with IntelC atm | 2009-12-07 01:19:06 |
<schnaader> | I've got g++ 3.4.5 here, could test it with this one :)) | 2009-12-07 01:19:28 |
<Shelwien> | no problem... would you be able to run IC version there? | 2009-12-07 01:20:06 |
<schnaader> | would be able to run, but not to compile :) | 2009-12-07 01:20:21 |
<Shelwien> | ...119s again with different i/o... dunno | 2009-12-07 01:20:21 |
| thats ok | 2009-12-07 01:20:29 |
| ok, let me test IC version again first... | 2009-12-07 01:22:34 |
| 44.2s encoding | 2009-12-07 01:23:19 |
| 47s decoding | 2009-12-07 01:24:00 |
| guess i need to fix that mult back | 2009-12-07 01:24:12 |
| attempt #2 | 2009-12-07 01:24:41 |
| 41.891s | 2009-12-07 01:25:20 |
| 41.469s | 2009-12-07 01:26:02 |
| seems like i've got some improvement from replacing some templates with macros ;) | 2009-12-07 01:26:28 |
<schnaader> | :) | 2009-12-07 01:26:44 |
| btw, under a minute is quite fast for enwik9 (~20 MB/s, isn't it?) which would lead to I/O problems indeed | 2009-12-07 01:28:02 |
<Shelwien> | i'm running it on ramdrive | 2009-12-07 01:28:19 |
<schnaader> | OK, that's odd. perhaps gcc has problems with ramdrives, but I don't think so... | 2009-12-07 01:29:13 |
<Shelwien> | no, i tested different buffers and that multiplication | 2009-12-07 01:29:37 |
| its something more general | 2009-12-07 01:29:47 |
| http://www.ctxmodel.net/files/newbrc/newbrc_0.rar | 2009-12-07 01:30:07 |
| btw, its from "new bitwise rc" ;) | 2009-12-07 01:30:28 |
| meanwhile, got a new IC version here, would try to install and test | 2009-12-07 01:31:15 |
<schnaader> | my old gcc version doesn't like -fwhole-program and gives some warnings about alignment of C0/C1, but compiling works apart from that | 2009-12-07 01:33:13 |
<Shelwien> | yeah, and you can compare the speed ;) | 2009-12-07 01:33:39 |
<schnaader> | guess the original mtf.exe is compiled with IntelC, right? btw, sizes are 25088 for gcc, 76288 for the other one which is quite a difference | 2009-12-07 01:34:52 |
<Shelwien> | that's a static build | 2009-12-07 01:35:16 |
<schnaader> | ah, OK | 2009-12-07 01:35:28 |
<Shelwien> | it'd be around 30-40k with /MD | 2009-12-07 01:35:29 |
| but static is a bit faster so i usually compile it like that | 2009-12-07 01:36:04 |
| btw, there's no model | 2009-12-07 01:36:29 |
| it encodes bits with a fixed probability, a little skewed towards 0 bits | 2009-12-07 01:36:59 |
<schnaader> | guess that's why it's so fast :) | 2009-12-07 01:37:14 |
<Shelwien> | not really | 2009-12-07 01:37:21 |
| fpaq0pv4b is still somewhat waster | 2009-12-07 01:37:34 |
| *faster | 2009-12-07 01:37:36 |
| but there're more restrictions and rangecoder is different... a little redundant too | 2009-12-07 01:38:56 |
<schnaader> | guess I should try with enwik8 here instead, enwik9 seems to take some minutes... | 2009-12-07 01:42:20 |
<Shelwien> | sure ;) | 2009-12-07 01:43:32 |
<schnaader> | That would be a funny abuse of your p2p thing: adding enwik8 to the download list just to quickly shorten enwik9 :) | 2009-12-07 01:46:12 |
<Shelwien> | btw, that file still continues after 1G... dunno why nobody uses the whole file ;) | 2009-12-07 01:47:56 |
<schnaader> | yes, I know, it's 4.8 GB or something, think they just don't care because you won't get listed on LTCB or Hutter that way :) | 2009-12-07 01:48:35 |
| Strange... I think I'll try a second run... c 40,43/d 45,07 for gcc, c 10,17/d 22,22 for intel | 2009-12-07 01:51:57 |
<Shelwien> | ;) | 2009-12-07 01:52:07 |
| decoding seems kinda slow for intel too, though | 2009-12-07 01:52:36 |
<schnaader> | Ah, it wasn't 10,17, it was 20,17... my mistake | 2009-12-07 01:54:49 |
| makes more sense now :) | 2009-12-07 01:54:56 |
<Shelwien> | yeah | 2009-12-07 01:55:05 |
| its 3x difference here though | 2009-12-07 01:55:10 |
<schnaader> | CPU here is quite slow and I/O should be limited to around 30-50 MB/s, perhaps the difference just can't show that much | 2009-12-07 01:56:16 |
| although 20 seconds is 5 MB/s, so I/O shouldn't be a problem | 2009-12-07 01:57:03 |
| f.e. fastest THOR mode gives 4 seconds and 23.75 MB/s :) | 2009-12-07 01:57:51 |
<Shelwien> | well, it doesn't have to do a multiplication per data bit | 2009-12-07 01:58:33 |
| if anything, you can compare it to this - http://www.ctxmodel.net/files/fpaq0pv4b3.rar | 2009-12-07 01:59:18 |
<schnaader> | fpaq0pv4B_O3_xi.exe gives c 12.63/d 15.97 | 2009-12-07 02:02:40 |
<Shelwien> | kinda weird but ok | 2009-12-07 02:03:41 |
<schnaader> | btw, mtf's output for enwik8 is quite large (99,6 MB), but I guess that's normal | 2009-12-07 02:03:58 |
<Shelwien> | yes, that's intentional | 2009-12-07 02:04:12 |
| and probably the main reason for speed difference with fpaq0p too | 2009-12-07 02:04:29 |
| btw, i'd probably finally add some async i/o to the new coder | 2009-12-07 02:07:16 |
| the coroutine framework made it really easy | 2009-12-07 02:07:41 |
<schnaader> | talking about enwik, I saw you commented on my ISBN precompression, that was one of many items in a list I started when LTCB came out, but I stopped working on it as I realized that my PC is just too slow for experiments with enwik9 | 2009-12-07 02:07:49 |
<Shelwien> | i don't think thats really a problem | 2009-12-07 02:08:30 |
| you just don't have to run paq, that's all | 2009-12-07 02:08:39 |
| paq8 is not really a good compressor, though it may sound weird | 2009-12-07 02:10:00 |
<schnaader> | yes, think I could retry with something else like 7-Zip | 2009-12-07 02:10:03 |
<Shelwien> | not 7z, but ppmd/ppmonstr would be ok | 2009-12-07 02:10:24 |
<schnaader> | it's just that I didn't want to optimize size for some compressor and see that results will get worse for PAQ | 2009-12-07 02:10:37 |
<Shelwien> | lzma (as any LZ) is really bad at text compression | 2009-12-07 02:10:50 |
<schnaader> | I switched to calgary corpus after that. was fun, first I did most of the preprocessors as little COM files using ASM, later included them into the PAQ source directly | 2009-12-07 02:13:00 |
<Shelwien> | you know that, right? http://www.mailcom.com/challenge/ | 2009-12-07 02:13:27 |
<schnaader> | yes, that was why I did it :) | 2009-12-07 02:13:44 |
<Shelwien> | btw, wanna hear my idea about enwik compression? | 2009-12-07 02:14:13 |
<schnaader> | also had a look at SHA-1, but I guess it's not worth the effort - at least you have a better chance to improve compression instead :) | 2009-12-07 02:14:21 |
| sure | 2009-12-07 02:14:29 |
<Shelwien> | its very different from "general purpose" | 2009-12-07 02:15:02 |
| basically, multipass compression | 2009-12-07 02:15:25 |
| and btw kinda related to my approach to recompression too - which i described before | 2009-12-07 02:15:46 |
| so, a multipass lossy filter with coding of extra information to make it lossless | 2009-12-07 02:16:55 |
| there're many specific cases where compression can be improved by little tweaking | 2009-12-07 02:17:48 |
| like capital conversion text filters etc | 2009-12-07 02:18:02 |
<schnaader> | OK, that's an interesting approach. Some of my ideas also included very basic "lossy" things like "insert the same text here every time and afterwards just change some of the words so they get correct" | 2009-12-07 02:18:07 |
<Shelwien> | well, now to examples | 2009-12-07 02:18:42 |
| another popular text filter is "syntax stuffing" | 2009-12-07 02:19:04 |
| like we usually write "word," | 2009-12-07 02:19:29 |
| but the most common symbol in word context is usually space | 2009-12-07 02:19:58 |
| so alternatives are kinda bad - they mess up predictions | 2009-12-07 02:20:25 |
| so, we can insert a space into each place like that | 2009-12-07 02:21:10 |
| like s/([\w])([,.;])/$1 $2/g in regexp form | 2009-12-07 02:21:56 |
<schnaader> | yes, even just doing so and thus having a seperate "word stream" and a symbol stream like " , . ," should give better compression | 2009-12-07 02:22:09 |
<Shelwien> | not with paq | 2009-12-07 02:22:27 |
| but if we'd do just this space padding | 2009-12-07 02:22:46 |
| and encode the information to revert it back into a separate stream | 2009-12-07 02:23:03 |
| that can get us an improvement even with paq | 2009-12-07 02:23:21 |
| now, that "information encoding" is the point | 2009-12-07 02:23:43 |
| we'd need a "backward regexp" like s/([\w]) ([,.;])/$1$2/g | 2009-12-07 02:24:36 |
| but we can't just apply it as is and restore the data | 2009-12-07 02:25:04 |
| because there might be cases like that from before | 2009-12-07 02:25:22 |
| so, for each case | 2009-12-07 02:25:30 |
| we'd need to encode a flag - whether to perform the replacement or not | 2009-12-07 02:26:03 |
| also, such cases are not unrelated to context | 2009-12-07 02:26:35 |
| so ideally we'd need a full context model | 2009-12-07 02:26:49 |
| taking into account both data before and after the replacement point | 2009-12-07 02:27:06 |
| bi-directional context ;) | 2009-12-07 02:27:14 |
<schnaader> | :) | 2009-12-07 02:27:29 |
<Shelwien> | thus, there's still a place for heavy CM | 2009-12-07 02:27:36 |
| but there's a difference from paq approach | 2009-12-07 02:27:53 |
| passes are independent | 2009-12-07 02:28:02 |
| and there're usually not much of flags (comparing to whole enwik) | 2009-12-07 02:28:36 |
| so we should be able to collect more detailed statistics than paq8 | 2009-12-07 02:29:07 |
| and still not care about memory overflows etc | 2009-12-07 02:29:29 |
| so, as i see it | 2009-12-07 02:30:05 |
| to solve this task | 2009-12-07 02:30:09 |
| i have to write such a reversible regexp implementation | 2009-12-07 02:30:22 |
| and then optimize a model for each regexp (automatically) | 2009-12-07 02:30:56 |
<schnaader> | good luck with that - sounds promising | 2009-12-07 02:31:14 |
<Shelwien> | yeah | 2009-12-07 02:31:21 |
| for example, at some point | 2009-12-07 02:31:31 |
| we can start replacing words | 2009-12-07 02:31:41 |
| like with synonyms | 2009-12-07 02:31:49 |
| or just with a "<word>" tag | 2009-12-07 02:32:00 |
| thus, it would be possible to not only handle the simple direct contexts | 2009-12-07 02:32:34 |
| but also higher-level language dependencies | 2009-12-07 02:33:00 |
| like sentence structure etc | 2009-12-07 02:33:08 |
| also, it would be possible to take into account word distances and stuff like that | 2009-12-07 02:33:46 |
| which a sequential model can't handle because of memory problems | 2009-12-07 02:34:02 |
| ...but there's a small problem ;) | 2009-12-07 02:34:33 |
| such an enwik model would be very specific | 2009-12-07 02:34:56 |
| it won't be really applicable to anything else | 2009-12-07 02:35:06 |
| and doing it just for the prize | 2009-12-07 02:36:31 |
<schnaader> | yes, but this is the case with everything that works well on enwik, although the data is a nice example for text and a good mix, but there are some things like those redundant city parts that are very specific | 2009-12-07 02:36:54 |
<Shelwien> | would mean working for $3/hour (very optimistically) ;) | 2009-12-07 02:36:58 |
| well, unfortunately there's much more stuff beside "city parts" (which are afair not in enwik8 anyway) | 2009-12-07 02:37:48 |
<schnaader> | yes, there are mainly in enwik8, there's 1 or 2 entries of it in enwik8, I think, but I'm not sure | 2009-12-07 02:38:17 |
<Shelwien> | like xml markup, html markup, wiki markup etc | 2009-12-07 02:38:19 |
| including ISBN too ;) | 2009-12-07 02:38:29 |
<schnaader> | it really bothers me that removing XML tags doesn't improve compression for PAQ (there's something about it stated on the site) which seems just weird... | 2009-12-07 02:39:09 |
| and AFAIR, it's even about completely removing tags, not replace/optimize them like using xmlwrt | 2009-12-07 02:39:46 |
<Shelwien> | well, articles are rather big | 2009-12-07 02:39:56 |
| but afaik it still helped in my experiments | 2009-12-07 02:40:16 |
| if you want, i can post my enwik parser (one of)... its in perl though | 2009-12-07 02:40:47 |
<schnaader> | thanks, but got no perl here :) I've also written some programs to f.e. extract user/ID lists, so it wouldn't be too much work to write my own | 2009-12-07 02:41:57 |
<Shelwien> | in fact, i was thinking about doing it completely in perl ;) | 2009-12-07 02:42:31 |
| like, implementing these reversible regexps somehow | 2009-12-07 02:42:53 |
| and doing compression with an external coder | 2009-12-07 02:43:18 |
<schnaader> | I thought about starting some brute-force compression program (running every possible program either in ASM or some own language) and to have it run in background on things like calgary corpus and enwik. It would take MUCH time and might just not give any results at all in hundred years, but if it would, you'd have a part of your data compressed just perfect :) | 2009-12-07 02:50:32 |
<Shelwien> | well, that's what we're doing in a way (me and toffer at least) | 2009-12-07 02:51:37 |
<schnaader> | Although I doubt there are useful things smaller than 16 bytes even in ASM and bruteforcing till there will take some time ;) | 2009-12-07 02:51:47 |
<Shelwien> | well, sure thing that you won't get anywhere with x86 asm bruteforcing ;) | 2009-12-07 02:52:25 |
| you can try zpaq though ;) | 2009-12-07 02:52:38 |
<schnaader> | yes, that could actually be a nice try, although I still haven't managed to find time for reading the specifications and writing some own config files for it | 2009-12-07 02:56:09 |
*** dagdsg has joined the channel | 2009-12-07 02:56:24 |
<Shelwien> | well, i have mixed feelings about zpaq | 2009-12-07 02:57:09 |
| in a way, i wanted to make something similar for a long time | 2009-12-07 02:57:43 |
| but zpaq is completely different from what i wanted, even though it seems very similar if i'd try to describe it ;) | 2009-12-07 02:58:52 |
| for example, i have some parameter description syntax | 2009-12-07 02:59:51 |
| there're some .idx files with parameter types and values | 2009-12-07 03:00:12 |
| and a preprocessor which generates C++ from .idx files | 2009-12-07 03:00:39 |
| two different kinds of C++ in fact - one version for optimization and another for "release builds" | 2009-12-07 03:01:19 |
<schnaader> | :) | 2009-12-07 03:01:25 |
<Shelwien> | and i'd like to further extend this - by adding also some model description syntax | 2009-12-07 03:02:15 |
| as its commonly redundant in C++ - i have to copy-paste the same stuff with modified numbers/letters in a few places | 2009-12-07 03:03:01 |
| when i want to add a model component or something like that | 2009-12-07 03:03:25 |
| and now | 2009-12-07 03:03:40 |
| zpaq kinda has exactly that - the model description syntax | 2009-12-07 03:04:12 |
| but its completely useless for me | 2009-12-07 03:04:27 |
| quite a shock ;) | 2009-12-07 03:04:37 |
<schnaader> | OK, because it's different to your approach or just because it's useless :) ? | 2009-12-07 03:05:16 |
<Shelwien> | as far as i can see, just because its useless :( | 2009-12-07 03:06:08 |
<schnaader> | yes, I also have a very general concept that I was reminded of when zpaq appeared. it's basically about just describing your input data in a scripting way, f.e. you could just tell it "there's a byte at the next position that can take values 0, 5 and 10-129, and will be 0 most the time" in a very user-friendly way and the compression implementation would just look for the best way to compress the data for you. The scripts could | 2009-12-07 03:06:21 |
| get compiled, added to the compressed data and used for decompression. | 2009-12-07 03:06:21 |
<Shelwien> | http://sweetscape.com/010editor/templates.html | 2009-12-07 03:07:27 |
<schnaader> | yes, pretty much that way, just used for compression | 2009-12-07 03:09:43 |
| would be nice because if you'd release a new file format, just release the script for it and everybody can use it for analysing, detecting, preprocessing or directly compressing your data. | 2009-12-07 03:11:14 |
<Shelwien> | well, a structure definition can be used for compression directly | 2009-12-07 03:11:23 |
| even more, in fact, the compression is not that important | 2009-12-07 03:12:02 |
| its possible to write a filter | 2009-12-07 03:12:22 |
| which would parse the syntax and produce streams compressible by "universal" compressors | 2009-12-07 03:12:58 |
| Shkarin's durilca is the best example of that | 2009-12-07 03:13:16 |
| especially its x86 parser/disassembler | 2009-12-07 03:13:33 |
| ...but that's a different direction from what i talked about before | 2009-12-07 03:14:29 |
| there's also always some choice of model design elements | 2009-12-07 03:15:23 |
| ideally, heavier models would produce better predictions | 2009-12-07 03:16:10 |
| but usually we have to take speed into account | 2009-12-07 03:16:24 |
| and its normal to discard small improvements in compression which hurt the speed | 2009-12-07 03:17:02 |
| also the heavier models even don't really guarantee an improvement | 2009-12-07 03:17:40 |
| because it all works with limited precision | 2009-12-07 03:17:49 |
| and errors accumulate | 2009-12-07 03:18:05 |
| so, structure parsing is one thing | 2009-12-07 03:18:23 |
| but we also need readable model definitions for structure elements | 2009-12-07 03:19:00 |
| and a proper support for parameters in these models | 2009-12-07 03:19:43 |
| and i'm paying more attention to that side, i guess | 2009-12-07 03:20:23 |
| because its usually better to write format parsers directly in C/C++ | 2009-12-07 03:21:00 |
| (faster etc) | 2009-12-07 03:21:06 |
<schnaader> | yes, if you're a programmer, there's no need for those user-friendly shit :) | 2009-12-07 03:21:56 |
<Shelwien> | well, in fact, i've got much closer to getting my model definition syntax lately | 2009-12-07 03:23:04 |
| the main problem was always about selection of basic components | 2009-12-07 03:23:32 |
| and things I use now have much better mathematical foundations than before ;) | 2009-12-07 03:25:39 |
<schnaader> | hehe, guess there have been some calculations and experiments in the meantime :) | 2009-12-07 03:26:21 |
<Shelwien> | well, for example, I now understand how the paq mixer works ;) | 2009-12-07 03:27:31 |
| which was a problem for a while, because Matt doesn't know that ;) | 2009-12-07 03:28:32 |
| he just took some formulas from neural networks and rederived the gradient for update formula | 2009-12-07 03:29:52 |
<schnaader> | :) | 2009-12-07 03:30:18 |
| well, sometimes you're content if it just works and don't care why although it would be better :) | 2009-12-07 03:31:39 |
<Shelwien> | well, yeah | 2009-12-07 03:32:10 |
| but you see, as i want to do better than paq | 2009-12-07 03:32:30 |
| so i have to understand how it works, even if Matt doesn't ;) | 2009-12-07 03:32:48 |
<schnaader> | yeah, searching for such things one didn't completely understand and improving them seems like the best way to get better :) | 2009-12-07 03:34:30 |
| and after that, you can add your own ideas ;) | 2009-12-07 03:35:04 |
<Shelwien> | the most interesting thing in paq was something different though | 2009-12-07 03:35:51 |
| well, its kinda obvious, but looks surprising when you see it used in a compressor | 2009-12-07 03:36:20 |
| i mean the use of PRNG in counter updates | 2009-12-07 03:36:51 |
<schnaader> | yes, that was the first thing I changed when first seeing the PAQ code, setting the PRNG to output 0 always and see how it hurts the compression :) | 2009-12-07 03:37:51 |
<Shelwien> | well, its reasonable that adding 0.5 to an integer is the same as adding 1 with probability 0.5 | 2009-12-07 03:38:58 |
| but somehow surprising when it really works ;) | 2009-12-07 03:39:07 |
<schnaader> | :) | 2009-12-07 03:39:42 |
<Shelwien> | as a consequence, though, paq now can compress a block of zeroes into a few kb of random data ;) | 2009-12-07 03:39:49 |
<schnaader> | although you wouldn't need a PRNG here, you could just do some static approach like with image dithering, couldn't you? | 2009-12-07 03:40:08 |
<Shelwien> | that might require to keep some state somewhere | 2009-12-07 03:41:16 |
<schnaader> | ah, I see | 2009-12-07 03:41:28 |
<Shelwien> | PRNG is more universal in this case, yeah | 2009-12-07 03:41:37 |
<schnaader> | almost forgot that PRNG in PAQ... had wasted some time with it brute-forcing seeds to get the output a bit smaller (I think 10 bytes smaller was the best result I had)... | 2009-12-07 03:45:02 |
<Shelwien> | i tried to replace the rangecoder there instead | 2009-12-07 03:46:07 |
| got somewhat better results, especially with redundant data | 2009-12-07 03:46:45 |
| but still winning even 1000 bytes at enwik is kinda disappointing ;) | 2009-12-07 03:47:16 |
<schnaader> | :) | 2009-12-07 03:47:29 |
| What about an AI approach, btw? I doubt if there is an useful AI approach on enwik9, perhaps on the whole enwik file it will make more of a difference, but as it was one of the main intentions of LTCB, it's quite sad there haven't been (successful) attempts... | 2009-12-07 03:49:39 |
| Although you could say that some of the dictionary sorting and grammatical things that were done are somewhat external AI attempts | 2009-12-07 03:50:32 |
<Shelwien> | there's no such thing as AI approach imho | 2009-12-07 03:50:49 |
| well, i guess we can take the cyc database | 2009-12-07 03:51:10 |
| and try to somehow use it for enwik prediction | 2009-12-07 03:51:24 |
| but that won't be compatible with problem restrictions | 2009-12-07 03:52:35 |
| (decoder size etc) | 2009-12-07 03:52:46 |
| so, if anything, the "AI approach" would be to take into account more correlations in the data | 2009-12-07 03:53:57 |
| not only the direct sequential contexts | 2009-12-07 03:54:51 |
| but a lot of other things too, up to semantics if possible | 2009-12-07 03:55:22 |
| but on other hand, there's nothing "AI" in that | 2009-12-07 03:55:59 |
| analyzing sentence templates which i mentioned before is like that, for example, but there's nothing that unique in it | 2009-12-07 03:57:24 |
| the main problem is that there's no magical universal function | 2009-12-07 03:58:04 |
| so instead, we have to collect lots of different dependencies in data | 2009-12-07 03:59:02 |
| abd remove the redundancy corresponding to each of them | 2009-12-07 03:59:28 |
| ...and as if there's not enough of these in plain english | 2009-12-07 04:01:57 |
| enwik also has lots of artifical markup, which is relatively easy to interpret, but still requires writing specific parsers etc | 2009-12-07 04:03:06 |
<schnaader> | yes, actually the first items on my enwik list are just about seperating the articles or preprocessing the data to remove HTML characters etc. | 2009-12-07 04:05:44 |
<Shelwien> | there're lots of masked html stuff in there | 2009-12-07 04:06:34 |
| like <html> ;) | 2009-12-07 04:06:42 |
<schnaader> | Yes, and have you seen that ASCII table? Could just generate it straightforward if there wouldn't be that HTML shit :) | 2009-12-07 04:07:25 |
<Shelwien> | the problem is that its apparently impossible to just replace stuff everywhere | 2009-12-07 04:08:19 |
<schnaader> | I somehow still expect enwik to contain some more data that can be generated. I already found some tables, numbers and things like that, but it's quite time consuming to search for things like that. | 2009-12-07 04:08:30 |
<Shelwien> | i mean, exporting articles from xml | 2009-12-07 04:08:35 |
| and doing s/</</g is not completely reversible | 2009-12-07 04:08:54 |
| yeah | 2009-12-07 04:09:33 |
<schnaader> | Well, there are some unused byte values I thought about using for that, although there aren't enough to completely replace all HTML entities. | 2009-12-07 04:09:57 |
<Shelwien> | i think that the right way would be to do it incrementally | 2009-12-07 04:10:01 |
| like, take the first articles and properly compress it | 2009-12-07 04:10:22 |
| then generalized the rules to include the second article | 2009-12-07 04:10:41 |
| etc | 2009-12-07 04:10:42 |
| a lot of manual work either way | 2009-12-07 04:10:55 |
| *the first article | 2009-12-07 04:11:34 |
| * generalize | 2009-12-07 04:11:37 |
<schnaader> | btw, major problem for ISBN are the different formats. Most of the time there will be "ISBN xxxxxxxxxxx", but there are variations like "ISBN x xxxxxx xx x", ISBN "x-xxx-xxxxx-x"... | 2009-12-07 04:14:32 |
| And if detection gets to general, you'll change some numbers that aren't ISBN | 2009-12-07 04:15:23 |
<Shelwien> | well, i doubt that ISBN add that much redundancy there ;) | 2009-12-07 04:15:29 |
<schnaader> | No, not really :) Didn't count them, but I doubt it's more than 10000 of them there | 2009-12-07 04:15:53 |
<Shelwien> | and with controlled regexps like i described | 2009-12-07 04:16:05 |
| its not really a problem even if there'd be some mismatches | 2009-12-07 04:16:17 |
| ...we'd also need some tricky algorithms there too, though | 2009-12-07 04:19:43 |
| like optimal parsing, context clustering etc | 2009-12-07 04:20:07 |
| and also dictionary compression | 2009-12-07 04:20:14 |
| its not really necessary to compress a standalone dictionary there | 2009-12-07 04:20:56 |
| but its a good simple testfile for a morphology model | 2009-12-07 04:21:21 |
<schnaader> | :) | 2009-12-07 04:21:37 |
<Shelwien> | and i don't quite understand how to build that | 2009-12-07 04:21:44 |
| ...i guess its still good that its english though | 2009-12-07 04:23:08 |
| because it'd be even more complex with eg. russian | 2009-12-07 04:23:33 |
<schnaader> | finnish would be most extreme, I guess :) | 2009-12-07 04:25:24 |
<Shelwien> | i'd say chinese ;) | 2009-12-07 04:25:51 |
<schnaader> | OK, you won :) | 2009-12-07 04:25:58 |
* Shelwien recently suggested using chinese wiki dump (zhwiki) to Sami in his new benchmark | 2009-12-07 04:26:34 |
| btw, have you seen that german Wikipedia DVD result for Precomp? | 2009-12-07 04:27:39 |
<Shelwien> | ...guess not | 2009-12-07 04:29:11 |
| not sure what are you talking about even ;) | 2009-12-07 04:29:24 |
<schnaader> | http://schnaader.info/precomp_wiki_dvd_04dev.html | 2009-12-07 04:29:27 |
<Shelwien> | still don't quite understand what's that DVD | 2009-12-07 04:31:22 |
<schnaader> | They recently switched the "zeno" format they used there to something (hopefully) more efficient than zLib, I was quite shocked when I saw you could compress the DVD to almost half of its size | 2009-12-07 04:31:22 |
| AFAIK it's the whole Wikipedia with reduced images and without discussion entries so it fits on a DVD | 2009-12-07 04:32:09 |
<Shelwien> | the usual problem with DVDs is that they're fixed-size | 2009-12-07 04:32:23 |
| so people sometimes even hide something unnecessary there | 2009-12-07 04:32:52 |
| just to make the software to use up all the DVD space | 2009-12-07 04:33:15 |
<schnaader> | yeah, that's right. Although additional 2 GB could be used for better image quality or something similar in that case, I suppose. | 2009-12-07 04:34:10 |
<Shelwien> | does precomp find anything in enwik btw? ;) | 2009-12-07 04:35:32 |
<schnaader> | And the DVD isn't sold or things like that, I think, it's primary a download I think. | 2009-12-07 04:35:41 |
<Shelwien> | huh. then its really surprising | 2009-12-07 04:36:08 |
| they could use bzip at least | 2009-12-07 04:36:16 |
<schnaader> | I think that's what they use for the new "zeno" format, could've been LZMA, too, I don't remember | 2009-12-07 04:36:46 |
| Hehe, there are some GIF mismatches in enwik9 because GIF detection is looking for "GIF87"/"GIF89", but nothing relevant :) | 2009-12-07 04:37:17 |
<Shelwien> | i'm kinda not sure about LZMA being better than bzip2 for text compression | 2009-12-07 04:37:25 |
<schnaader> | Ah, found it, the new format is called "ZIM" - http://openzim.org/Main_Page - I also found a quote that says "article took 3 GB before, 1.4 GB with ZIM) | 2009-12-07 04:39:51 |
| They're using bzip2, lzma is an option, but not implemented | 2009-12-07 04:40:46 |
| http://openzim.org/ZIM_File_Format#Clusters | 2009-12-07 04:40:59 |
<Shelwien> | wonder why they don't participate in hutter challenge ;) | 2009-12-07 04:42:03 |
<schnaader> | http://openzim.org/ZIMwriter says "coming soon...", would've been nice to let it run over enwik9 and compare the result with plain bZip2 :) | 2009-12-07 04:44:56 |
| Although they seem to create some search indexes there, too which isn't that helpful :) | 2009-12-07 04:45:36 |
<Shelwien> | Zim is the surname of one of my friends here, i'd ask him about it ;) | 2009-12-07 04:48:23 |
| meanwhile, the question is how to find the reverse regexps | 2009-12-07 04:53:00 |
| i mean, like, automatically derive s/([\w]) ([,.;])/$1$2/g from s/([\w])([,.;])/$1 $2/g | 2009-12-07 04:53:56 |
<schnaader> | Just to make sure I understand the regexp: This changes "bla, bla.bla;" to "bla , bla .bla ;", right? | 2009-12-07 04:55:51 |
*** STalKer-Y has joined the channel | 2009-12-07 04:56:14 |
| it could be helpful to use some easier format you can transform to regexp and that (the easier format) can be reversed easier. | 2009-12-07 04:59:24 |
*** STalKer-X has left the channel | 2009-12-07 05:01:11 |
*** schnaader has left the channel | 2009-12-07 05:03:07 |
<Shelwien> | !next | 2009-12-07 05:07:32 |