*** pinc has left the channel2009-12-05 17:49:20
*** schnaader has left the channel2009-12-05 17:52:19
*** toffer_ has joined the channel2009-12-05 17:59:29
*** toffer has left the channel2009-12-05 18:01:58
*** mike_____ has joined the channel2009-12-05 18:19:39
*** schnaader has joined the channel2009-12-05 18:31:10
*** schnaader has left the channel2009-12-05 19:13:55
<mike_____> toffer: here is the result of ccmx on enwik8 on my system:2009-12-05 19:28:14
  97656.25 KiB -> 21138.73 KiB (ratio 21.65%, speed 871 KiB/s) 2009-12-05 19:28:16
<Shelwien> and m1?2009-12-05 19:29:09
 with similar memory setting?2009-12-05 19:29:24
<mike_____> Allocated 196999 kB.2009-12-05 19:32:19
 Encoding: 21286803/ 100000000 bytes (1.703 bpc), 76.68 s (1304 Kb/s)2009-12-05 19:32:19
 ccmx allocated 146MB2009-12-05 19:32:42
<Shelwien> still, it seems better and faster which is good2009-12-05 19:33:09
*** Shelwien has left the channel2009-12-05 23:03:59
*** Shelwien has joined the channel2009-12-05 23:04:42
<mike_____> btw, what does compbooks do?2009-12-05 23:05:51
*** compbooks has left the channel2009-12-05 23:23:36
*** mike_____ has left the channel2009-12-05 23:23:37
*** compbooks has joined the channel2009-12-05 23:24:14
* compbooks eats people's trouts2009-12-05 23:26:52
* Krugz slaps compbooks around a bit with a large trout2009-12-05 23:28:22
* compbooks eats the trout2009-12-05 23:59:53
<Krugz> lol slow 2009-12-06 00:00:05
* compbooks slept2009-12-06 00:00:18
 pff sleep2009-12-06 00:00:34
<Shelwien> actually its an iroffer bot2009-12-06 00:00:43
 xdcc list and whatever else2009-12-06 00:01:13
<Krugz> oic2009-12-06 00:01:19
*** Krugz has left the channel2009-12-06 00:13:40
*** Krugz has joined the channel2009-12-06 00:43:31
*** Skymmer has joined the channel2009-12-06 01:27:53
<Skymmer> Hi dudes2009-12-06 01:30:17
<Shelwien> hi2009-12-06 01:30:22
<Skymmer> Ah :) You'll not spoof me. It's not you, its your bot :))2009-12-06 01:31:40
 Fast reaction2009-12-06 01:31:57
* Shelwien is thinking that it means that he's a "dude" now2009-12-06 01:32:51
 Nice ;}2009-12-06 01:33:30
 Anyway, I'm here to do the thing that I don't like2009-12-06 01:33:40
<Shelwien> ?2009-12-06 01:33:58
<Skymmer> To ask something to do... well... listen2009-12-06 01:34:55
 There is one program which I using. It kinda slow but there are sources for it. So I thought if somebody experienced with Intel Compiler can compile it. It also can be a good test for IC capabilities.2009-12-06 01:38:04
<Shelwien> well, i can try probably2009-12-06 01:38:27
<Skymmer> Damn... I'll feeling bad to ask it. Maybe you're busy? *shame*2009-12-06 01:39:31
<Shelwien> not atm2009-12-06 01:39:41
<Skymmer> OK2009-12-06 01:39:46
<Shelwien> i mean not busy atm2009-12-06 01:39:58
<Skymmer> http://omion.dyndns.org/mp3packer/mp3packer-1.20_src.zip2009-12-06 01:40:29
 More details (if needed) at:2009-12-06 01:40:29
 http://www.hydrogenaudio.org/forums/index.php?showtopic=323792009-12-06 01:40:29
<Shelwien> that's ocaml, not C2009-12-06 01:41:30
 so its unrelated to IntelC or whatever2009-12-06 01:43:44
 but btw mp3zip can do something similar too2009-12-06 01:43:55
<Skymmer> Sad. How about this:2009-12-06 01:45:28
 http://www.fftw.org/fftw-3.2.2.tar.gz2009-12-06 01:45:28
 http://www.fftw.org/fftw-3.3alpha1.tar.gz2009-12-06 01:45:28
<Shelwien> not sure what you want to get out of that2009-12-06 01:48:33
<Skymmer> libfftw3-3.dll2009-12-06 01:48:41
<Shelwien> well, i'd better not, i guess2009-12-06 01:49:07
 its probably possible, but i'd need something to check it with etc - issues are possible2009-12-06 01:49:44
 "Intel C: you can also use the Intel compilers under VC++ (see below). This may produce marginally faster code than the GNU C compiler, but is probably not worth it for most users. Be cautious with the compiler flags—turning on every optimization under the sun usually makes FFTW slower."2009-12-06 01:50:54
 http://www.ece.cmu.edu/~franzf/fftw.org/2009-12-06 01:51:19
 ;)2009-12-06 01:51:20
<Skymmer> Damn :)2009-12-06 01:54:05
 Ok, last trial2009-12-06 01:54:20
 http://files.monkeysaudio.com/MAC_SDK_406.zip2009-12-06 01:54:24
<Shelwien> dll again or what?2009-12-06 01:54:56
<Skymmer> mac.exe2009-12-06 01:55:09
 Ehhh... Source\Console\ I presume2009-12-06 01:59:22
 Lame with it2009-12-06 01:59:32
<Shelwien> well, dunno2009-12-06 02:14:16
 i tried but there're syntax errors now2009-12-06 02:14:32
 "pointer to incomplete class not allowed" etc2009-12-06 02:14:44
<Skymmer> No problem.2009-12-06 02:18:52
 BTW, what you phrase means: but btw mp3zip can do something similar too2009-12-06 02:19:32
<Shelwien> CBR->VBR2009-12-06 02:19:50
 you have the console mp3zip, right?2009-12-06 02:19:59
<Skymmer> sure2009-12-06 02:20:04
 -c ?2009-12-06 02:20:15
<Shelwien> you can run it like mp3zip -c 1.mp3 1.mpx; mp3zip -d 1.mpx 1unp.mp32009-12-06 02:20:16
 yeah2009-12-06 02:20:19
<Skymmer> Oh no...2009-12-06 02:20:27
<Shelwien> its not that smart though2009-12-06 02:20:40
 would just discard the LAME tag or something2009-12-06 02:20:52
 well, i can make it do it in a single pass too2009-12-06 02:21:22
 not that there's any sense to2009-12-06 02:22:02
<Skymmer> not only. The problem is that OUT file has no correct Xing VBR info so its lenght shown incorrectly:2009-12-06 02:25:46
 Processed: 6:02:102009-12-06 02:25:46
 Original: 0:54:122009-12-06 02:25:46
 and more:2009-12-06 02:26:07
<Shelwien> well, as i said, it wasn't an intentional feature anyway ;)2009-12-06 02:26:43
<Skymmer> foobar's "Differences found in 1 out of 1 track pairs.2009-12-06 02:26:51
 Comparing:2009-12-06 02:26:51
 "C:\SHIT\SOFT\ARC\_Shelwien\mp3zip\out.mp3"2009-12-06 02:26:51
 "C:\SHIT\SOFT\ARC\_Shelwien\mp3zip\Test.mp3"2009-12-06 02:26:51
 Length mismatch : 54:12.520249 vs 54:12.453333, 143436143 vs 143433192 samples2009-12-06 02:26:51
 "2009-12-06 02:26:51
 its foobar's "bit-compare" results2009-12-06 02:27:24
*** Shelwien has left the channel2009-12-06 02:28:23
*** Shelwien has joined the channel2009-12-06 02:28:28
<Shelwien> but still, my packed mp3s might be different in size from mp3pack2009-12-06 02:28:36
<Skymmer> Ehhh... don't look at the "SHIT" name of folder. its my name of Download folder :))2009-12-06 02:29:13
<STalKer-Y> i thought it was an abbrevation2009-12-06 02:30:41
<Skymmer> I don't mean the size. I mean the content.2009-12-06 02:31:14
<Shelwien> the content is the same2009-12-06 02:31:45
 that "length mismatch" is probably due to fixed frame size in mp32009-12-06 02:32:07
 so actual PCM size is in framesize increments2009-12-06 02:32:52
 but maybe the LAME tag contains the precise source file length2009-12-06 02:33:09
 and mp3zip discards that2009-12-06 02:33:16
<Skymmer> I think its because LAME gapless info indroduced in 3.90.3 is missing in OUT file2009-12-06 02:33:23
 ah yes2009-12-06 02:33:28
<Shelwien> anyway it keeps the tag in default mode2009-12-06 02:33:56
 so its ok ;)2009-12-06 02:34:00
<Skymmer> out.mp3 86 513 9982009-12-06 02:34:24
 Test.mp3 85 225 9042009-12-06 02:34:24
<Shelwien> but i guess i'd have to find the description for that tag the next time2009-12-06 02:34:32
<Skymmer> damn2009-12-06 02:34:47
 out.mp3 86 513 9982009-12-06 02:34:47
 Test.mp3 85 225 9042009-12-06 02:34:48
<Shelwien> expanded it?2009-12-06 02:35:00
<toffer> hi & gn8 guys2009-12-06 02:37:04
<Skymmer> sorry, what you mean by "expanding it"?2009-12-06 02:37:09
<toffer> just came home again from a club2009-12-06 02:37:12
<Shelwien> hi ;)2009-12-06 02:37:28
<Skymmer> Aloha!2009-12-06 02:37:54
<Shelwien> skymmer: i mean that mp3zip output apparently larger than input2009-12-06 02:38:16
<Skymmer> yes.2009-12-06 02:38:28
<Shelwien> no luck ;)2009-12-06 02:38:52
<toffer> mhhh spaghetti2009-12-06 02:39:24
<Shelwien> what?2009-12-06 02:39:36
<Skymmer> spaghetti its the new lossless audio compressor2009-12-06 02:40:06
 :))2009-12-06 02:40:13
<toffer> ^^2009-12-06 02:40:14
<Skymmer> Toffer, why you so amazed about spaghetti? have you smoked something? ;)2009-12-06 02:42:23
<Shelwien> spaghetti?2009-12-06 02:42:53
<toffer> no, not at all2009-12-06 02:42:59
 but i'm really hungry right now. and my girlfriend wants to eat that stuff, too2009-12-06 02:43:18
<Skymmer> What kind of music was in the club?2009-12-06 02:43:57
<toffer> well it was some kind of "huge hall" they played everything. i mostly like back musikc2009-12-06 02:48:35
 musci2009-12-06 02:48:36
 music2009-12-06 02:48:38
<Skymmer> :)) Haaa.. You drunk probably. Not offensive. Just curious ;)2009-12-06 02:52:37
<toffer> partially2009-12-06 02:55:05
<Skymmer> I'm pretty sure that there wasn't music like this one *devil*2009-12-06 03:01:16
 http://skymmer.narod.ru/misc/Glenn.mp32009-12-06 03:01:18
<toffer> not gonna listen to anything right now2009-12-06 03:06:31
 otherwise everybody gets awake2009-12-06 03:06:39
<Skymmer> Bye people. Gonna sleep...2009-12-06 03:14:49
*** Skymmer has left the channel2009-12-06 03:15:28
<toffer> gn8 from me, too2009-12-06 03:15:45
*** toffer has left the channel2009-12-06 03:15:49
*** STalKer-X has joined the channel2009-12-06 05:00:24
*** STalKer-Y has left the channel2009-12-06 05:03:45
*** Krugz has left the channel2009-12-06 06:03:56
*** mike_____ has joined the channel2009-12-06 11:38:36
<Shelwien> http://encode.dreamhosters.com/showthread.php?t=5112009-12-06 14:54:25
*** mike_____ has left the channel2009-12-06 15:48:52
*** pinc has joined the channel2009-12-06 19:13:00
*** Krugz has joined the channel2009-12-06 20:23:11
*** Krugz has left the channel2009-12-06 20:27:44
*** Krugz has joined the channel2009-12-06 20:39:12
*** pinc has left the channel2009-12-06 20:40:30
*** Krugz has left the channel2009-12-06 21:04:08
*** Krugz has joined the channel2009-12-06 21:06:52
 there's a conspiracy!2009-12-06 21:11:11
 compilers store a timestamp into exe header2009-12-06 21:11:29
 so that i won't be able to determine whether two exes are equal by comparing their hashes2009-12-06 21:11:56
* Shelwien is bruteforcing compiler options2009-12-06 21:12:22
<Krugz> lol2009-12-06 21:13:16
*** Krugz has left the channel2009-12-06 21:14:16
*** Krugz has joined the channel2009-12-06 21:14:55
*** Krugz has left the channel2009-12-06 21:16:22
*** Krugz has joined the channel2009-12-06 21:17:01
*** STalKer-X has left the channel2009-12-06 22:56:42
*** Shelwien has left the channel2009-12-06 23:04:08
*** Shelwien has joined the channel2009-12-06 23:04:13
*** STalKer-X has joined the channel2009-12-06 23:19:03
*** schnaader has joined the channel2009-12-07 00:44:54
<schnaader> hi @ all - kinda late, I know, but I thought I could at least have a look who's there2009-12-07 00:45:20
<Shelwien> hi2009-12-07 00:45:35
 i have my usual problem with gcc and templates here2009-12-07 00:46:09
<schnaader> what was the compiler you talked about that has timestamps? I did a bit of research because I had similar issues sometimes and found that at least GCC doesn't seem to include timestamps.2009-12-07 00:46:55
<Shelwien> MS linker does2009-12-07 00:47:10
<schnaader> hm... never did that much template things, so I fear I can't help you there2009-12-07 00:47:16
<Shelwien> well, gcc is annoying as hell here2009-12-07 00:47:30
<schnaader> ah, I guess that was where I got the problem, too, MSVC2009-12-07 00:47:36
<Shelwien> it can't compile some code with which MSC/IntelC don't have any problems2009-12-07 00:47:48
 and i don't quite know how to solve it2009-12-07 00:48:10
 its not the first time too...2009-12-07 00:48:20
 to be specific2009-12-07 00:49:07
 if i try to use something like2009-12-07 00:49:13
 template< int flag > class A : public B<flag> { ... }2009-12-07 00:49:39
<schnaader> you could try to post a question or search for similar problems at http://stackoverflow.com/ - they are really quick in giving very good answers especially to C questions there2009-12-07 00:49:42
<Shelwien> gcc doesn't see anything from template B there2009-12-07 00:50:09
 dumb thing2009-12-07 00:50:12
 and i don't know how to search for it2009-12-07 00:50:30
<schnaader> If you can strip it down to a short example and some text describing your problem, I could also post it for you on SO, so you wouldn't need to register2009-12-07 00:53:11
 there seem to be some template FAQs around, but they all seem to handle different things as far as I can tell...2009-12-07 00:53:39
<Shelwien> ...i've also got another internal error from IntelC while trying to make it portable ;)2009-12-07 00:54:55
<schnaader> internal errors suck :) I had to do some workaround once where the Delphi compiler would throw one when code was compiled with the command-line compiler, but the IDE would just compile it fine. rewriting the code a bit solved it, although both versions seemed to be perfectly valid code...2009-12-07 00:57:24
<Shelwien> well, i managed to make a workaround with macros etc here2009-12-07 00:58:27
 now have some strange linking problems though2009-12-07 00:58:52
 ...2009-12-07 00:59:14
 i made a new rangecoder using couroutine template2009-12-07 00:59:34
 seems pretty nice2009-12-07 00:59:51
 now trying to compare gcc vs intelc2009-12-07 01:00:07
 at least gcc version worked after compiling2009-12-07 01:00:35
 but now i have to do something about the compiler options...2009-12-07 01:00:47
 any suggestions about gcc options btw?2009-12-07 01:03:02
<schnaader> do you need templates here for speed or just for easier changes/readability?2009-12-07 01:03:16
 hm.. -O2/-O3 -Os -s -march=... are those I usually use, didn't care about it much so far2009-12-07 01:03:59
 I really enjoyed the forum discussion about those GCC automatic profiling things, these could be handy to gain the last percents of speed out of some code :)2009-12-07 01:04:38
<Shelwien> for speed mainly2009-12-07 01:04:59
 ok, testing2009-12-07 01:05:21
 intelc time was ~42.5s for enwik92009-12-07 01:05:45
<schnaader> ah, that's bad.. would've recommended not using templates if it wouldn't have been speed :)2009-12-07 01:05:49
<Shelwien> for readability its important too2009-12-07 01:06:07
 well, i know how people usually solve these problems though2009-12-07 01:07:16
 they copy-paste stuff2009-12-07 01:07:27
 ...wow2009-12-07 01:07:35
 131s with gcc 4.3/mingw2009-12-07 01:07:51
 crazy2009-12-07 01:07:53
<schnaader> The PGO optimization should be easy: -fprofile-generate, run, recompile with -fprofile-use2009-12-07 01:08:09
<Shelwien> sure... its not a PGO problem yet though2009-12-07 01:08:38
<schnaader> whoa, that's 3 times faster with IntelC... either GCC is really bad here or IntelC has some neat tricks :)2009-12-07 01:08:50
<Shelwien> guess gcc is being crazy about that int64 multiplication2009-12-07 01:09:07
<schnaader> int64 = long long, or something homemade?2009-12-07 01:09:29
<Shelwien> unsigned long long2009-12-07 01:09:43
<schnaader> hm.. never experienced any major speed decreases with it and I had to use it for some Project Euler programs...2009-12-07 01:10:14
<Shelwien> thing is, that qword version is actually faster with intelc than alternative 32-bit multiplication2009-12-07 01:10:24
 148s decoding2009-12-07 01:10:42
 something is majorly wrong here...2009-12-07 01:10:58
<schnaader> so how does the 32-bit version perform with gcc, then?2009-12-07 01:11:07
<Shelwien> i'd try again with different options2009-12-07 01:12:12
 maybe it was because of inlining or unrolling2009-12-07 01:12:20
 this time exe is twice smaller2009-12-07 01:12:34
<schnaader> I would blame unrolling in that case :)2009-12-07 01:12:56
<Shelwien> ...but doesn't seem to be faster... still works2009-12-07 01:13:04
 ...it also can be a problem with i/o i guess2009-12-07 01:13:57
 129s this time too2009-12-07 01:14:21
 ok, running with 32-bit mult2009-12-07 01:15:52
 ...again, i guess no luck2009-12-07 01:16:48
 119s this time2009-12-07 01:17:38
<schnaader> you could also try to disable some of the IntelC optimizations, perhaps it's one of those that just really helps2009-12-07 01:18:22
<Shelwien> no2009-12-07 01:18:35
 it's never been slower than 60s2009-12-07 01:18:57
 and i don't even use PGO with IntelC atm2009-12-07 01:19:06
<schnaader> I've got g++ 3.4.5 here, could test it with this one :))2009-12-07 01:19:28
<Shelwien> no problem... would you be able to run IC version there?2009-12-07 01:20:06
<schnaader> would be able to run, but not to compile :)2009-12-07 01:20:21
<Shelwien> ...119s again with different i/o... dunno2009-12-07 01:20:21
 thats ok2009-12-07 01:20:29
 ok, let me test IC version again first...2009-12-07 01:22:34
 44.2s encoding2009-12-07 01:23:19
 47s decoding2009-12-07 01:24:00
 guess i need to fix that mult back2009-12-07 01:24:12
 attempt #22009-12-07 01:24:41
 41.891s2009-12-07 01:25:20
 41.469s2009-12-07 01:26:02
 seems like i've got some improvement from replacing some templates with macros ;)2009-12-07 01:26:28
<schnaader> :)2009-12-07 01:26:44
 btw, under a minute is quite fast for enwik9 (~20 MB/s, isn't it?) which would lead to I/O problems indeed2009-12-07 01:28:02
<Shelwien> i'm running it on ramdrive2009-12-07 01:28:19
<schnaader> OK, that's odd. perhaps gcc has problems with ramdrives, but I don't think so...2009-12-07 01:29:13
<Shelwien> no, i tested different buffers and that multiplication2009-12-07 01:29:37
 its something more general2009-12-07 01:29:47
 http://www.ctxmodel.net/files/newbrc/newbrc_0.rar2009-12-07 01:30:07
 btw, its from "new bitwise rc" ;)2009-12-07 01:30:28
 meanwhile, got a new IC version here, would try to install and test2009-12-07 01:31:15
<schnaader> my old gcc version doesn't like -fwhole-program and gives some warnings about alignment of C0/C1, but compiling works apart from that2009-12-07 01:33:13
<Shelwien> yeah, and you can compare the speed ;)2009-12-07 01:33:39
<schnaader> guess the original mtf.exe is compiled with IntelC, right? btw, sizes are 25088 for gcc, 76288 for the other one which is quite a difference2009-12-07 01:34:52
<Shelwien> that's a static build2009-12-07 01:35:16
<schnaader> ah, OK2009-12-07 01:35:28
<Shelwien> it'd be around 30-40k with /MD2009-12-07 01:35:29
 but static is a bit faster so i usually compile it like that2009-12-07 01:36:04
 btw, there's no model2009-12-07 01:36:29
 it encodes bits with a fixed probability, a little skewed towards 0 bits2009-12-07 01:36:59
<schnaader> guess that's why it's so fast :)2009-12-07 01:37:14
<Shelwien> not really2009-12-07 01:37:21
 fpaq0pv4b is still somewhat waster2009-12-07 01:37:34
 *faster2009-12-07 01:37:36
 but there're more restrictions and rangecoder is different... a little redundant too2009-12-07 01:38:56
<schnaader> guess I should try with enwik8 here instead, enwik9 seems to take some minutes...2009-12-07 01:42:20
<Shelwien> sure ;)2009-12-07 01:43:32
<schnaader> That would be a funny abuse of your p2p thing: adding enwik8 to the download list just to quickly shorten enwik9 :)2009-12-07 01:46:12
<Shelwien> btw, that file still continues after 1G... dunno why nobody uses the whole file ;)2009-12-07 01:47:56
<schnaader> yes, I know, it's 4.8 GB or something, think they just don't care because you won't get listed on LTCB or Hutter that way :)2009-12-07 01:48:35
 Strange... I think I'll try a second run... c 40,43/d 45,07 for gcc, c 10,17/d 22,22 for intel2009-12-07 01:51:57
<Shelwien> ;)2009-12-07 01:52:07
 decoding seems kinda slow for intel too, though2009-12-07 01:52:36
<schnaader> Ah, it wasn't 10,17, it was 20,17... my mistake2009-12-07 01:54:49
 makes more sense now :)2009-12-07 01:54:56
<Shelwien> yeah2009-12-07 01:55:05
 its 3x difference here though2009-12-07 01:55:10
<schnaader> CPU here is quite slow and I/O should be limited to around 30-50 MB/s, perhaps the difference just can't show that much2009-12-07 01:56:16
 although 20 seconds is 5 MB/s, so I/O shouldn't be a problem2009-12-07 01:57:03
 f.e. fastest THOR mode gives 4 seconds and 23.75 MB/s :)2009-12-07 01:57:51
<Shelwien> well, it doesn't have to do a multiplication per data bit2009-12-07 01:58:33
 if anything, you can compare it to this - http://www.ctxmodel.net/files/fpaq0pv4b3.rar2009-12-07 01:59:18
<schnaader> fpaq0pv4B_O3_xi.exe gives c 12.63/d 15.972009-12-07 02:02:40
<Shelwien> kinda weird but ok2009-12-07 02:03:41
<schnaader> btw, mtf's output for enwik8 is quite large (99,6 MB), but I guess that's normal2009-12-07 02:03:58
<Shelwien> yes, that's intentional2009-12-07 02:04:12
 and probably the main reason for speed difference with fpaq0p too2009-12-07 02:04:29
 btw, i'd probably finally add some async i/o to the new coder2009-12-07 02:07:16
 the coroutine framework made it really easy2009-12-07 02:07:41
<schnaader> talking about enwik, I saw you commented on my ISBN precompression, that was one of many items in a list I started when LTCB came out, but I stopped working on it as I realized that my PC is just too slow for experiments with enwik92009-12-07 02:07:49
<Shelwien> i don't think thats really a problem2009-12-07 02:08:30
 you just don't have to run paq, that's all2009-12-07 02:08:39
 paq8 is not really a good compressor, though it may sound weird2009-12-07 02:10:00
<schnaader> yes, think I could retry with something else like 7-Zip2009-12-07 02:10:03
<Shelwien> not 7z, but ppmd/ppmonstr would be ok2009-12-07 02:10:24
<schnaader> it's just that I didn't want to optimize size for some compressor and see that results will get worse for PAQ2009-12-07 02:10:37
<Shelwien> lzma (as any LZ) is really bad at text compression2009-12-07 02:10:50
<schnaader> I switched to calgary corpus after that. was fun, first I did most of the preprocessors as little COM files using ASM, later included them into the PAQ source directly2009-12-07 02:13:00
<Shelwien> you know that, right? http://www.mailcom.com/challenge/2009-12-07 02:13:27
<schnaader> yes, that was why I did it :)2009-12-07 02:13:44
<Shelwien> btw, wanna hear my idea about enwik compression?2009-12-07 02:14:13
<schnaader> also had a look at SHA-1, but I guess it's not worth the effort - at least you have a better chance to improve compression instead :)2009-12-07 02:14:21
 sure2009-12-07 02:14:29
<Shelwien> its very different from "general purpose"2009-12-07 02:15:02
 basically, multipass compression2009-12-07 02:15:25
 and btw kinda related to my approach to recompression too - which i described before2009-12-07 02:15:46
 so, a multipass lossy filter with coding of extra information to make it lossless2009-12-07 02:16:55
 there're many specific cases where compression can be improved by little tweaking2009-12-07 02:17:48
 like capital conversion text filters etc2009-12-07 02:18:02
<schnaader> OK, that's an interesting approach. Some of my ideas also included very basic "lossy" things like "insert the same text here every time and afterwards just change some of the words so they get correct"2009-12-07 02:18:07
<Shelwien> well, now to examples2009-12-07 02:18:42
 another popular text filter is "syntax stuffing"2009-12-07 02:19:04
 like we usually write "word,"2009-12-07 02:19:29
 but the most common symbol in word context is usually space2009-12-07 02:19:58
 so alternatives are kinda bad - they mess up predictions2009-12-07 02:20:25
 so, we can insert a space into each place like that2009-12-07 02:21:10
 like s/([\w])([,.;])/$1 $2/g in regexp form2009-12-07 02:21:56
<schnaader> yes, even just doing so and thus having a seperate "word stream" and a symbol stream like " , . ," should give better compression2009-12-07 02:22:09
<Shelwien> not with paq2009-12-07 02:22:27
 but if we'd do just this space padding2009-12-07 02:22:46
 and encode the information to revert it back into a separate stream2009-12-07 02:23:03
 that can get us an improvement even with paq2009-12-07 02:23:21
 now, that "information encoding" is the point2009-12-07 02:23:43
 we'd need a "backward regexp" like s/([\w]) ([,.;])/$1$2/g2009-12-07 02:24:36
 but we can't just apply it as is and restore the data2009-12-07 02:25:04
 because there might be cases like that from before 2009-12-07 02:25:22
 so, for each case2009-12-07 02:25:30
 we'd need to encode a flag - whether to perform the replacement or not2009-12-07 02:26:03
 also, such cases are not unrelated to context2009-12-07 02:26:35
 so ideally we'd need a full context model2009-12-07 02:26:49
 taking into account both data before and after the replacement point2009-12-07 02:27:06
 bi-directional context ;)2009-12-07 02:27:14
<schnaader> :)2009-12-07 02:27:29
<Shelwien> thus, there's still a place for heavy CM2009-12-07 02:27:36
 but there's a difference from paq approach2009-12-07 02:27:53
 passes are independent2009-12-07 02:28:02
 and there're usually not much of flags (comparing to whole enwik)2009-12-07 02:28:36
 so we should be able to collect more detailed statistics than paq82009-12-07 02:29:07
 and still not care about memory overflows etc2009-12-07 02:29:29
 so, as i see it2009-12-07 02:30:05
 to solve this task2009-12-07 02:30:09
 i have to write such a reversible regexp implementation2009-12-07 02:30:22
 and then optimize a model for each regexp (automatically)2009-12-07 02:30:56
<schnaader> good luck with that - sounds promising2009-12-07 02:31:14
<Shelwien> yeah2009-12-07 02:31:21
 for example, at some point2009-12-07 02:31:31
 we can start replacing words2009-12-07 02:31:41
 like with synonyms2009-12-07 02:31:49
 or just with a "<word>" tag2009-12-07 02:32:00
 thus, it would be possible to not only handle the simple direct contexts2009-12-07 02:32:34
 but also higher-level language dependencies2009-12-07 02:33:00
 like sentence structure etc2009-12-07 02:33:08
 also, it would be possible to take into account word distances and stuff like that2009-12-07 02:33:46
 which a sequential model can't handle because of memory problems2009-12-07 02:34:02
 ...but there's a small problem ;)2009-12-07 02:34:33
 such an enwik model would be very specific2009-12-07 02:34:56
 it won't be really applicable to anything else2009-12-07 02:35:06
 and doing it just for the prize2009-12-07 02:36:31
<schnaader> yes, but this is the case with everything that works well on enwik, although the data is a nice example for text and a good mix, but there are some things like those redundant city parts that are very specific2009-12-07 02:36:54
<Shelwien> would mean working for $3/hour (very optimistically) ;)2009-12-07 02:36:58
 well, unfortunately there's much more stuff beside "city parts" (which are afair not in enwik8 anyway)2009-12-07 02:37:48
<schnaader> yes, there are mainly in enwik8, there's 1 or 2 entries of it in enwik8, I think, but I'm not sure2009-12-07 02:38:17
<Shelwien> like xml markup, html markup, wiki markup etc2009-12-07 02:38:19
 including ISBN too ;)2009-12-07 02:38:29
<schnaader> it really bothers me that removing XML tags doesn't improve compression for PAQ (there's something about it stated on the site) which seems just weird...2009-12-07 02:39:09
 and AFAIR, it's even about completely removing tags, not replace/optimize them like using xmlwrt2009-12-07 02:39:46
<Shelwien> well, articles are rather big2009-12-07 02:39:56
 but afaik it still helped in my experiments2009-12-07 02:40:16
 if you want, i can post my enwik parser (one of)... its in perl though2009-12-07 02:40:47
<schnaader> thanks, but got no perl here :) I've also written some programs to f.e. extract user/ID lists, so it wouldn't be too much work to write my own2009-12-07 02:41:57
<Shelwien> in fact, i was thinking about doing it completely in perl ;)2009-12-07 02:42:31
 like, implementing these reversible regexps somehow2009-12-07 02:42:53
 and doing compression with an external coder2009-12-07 02:43:18
<schnaader> I thought about starting some brute-force compression program (running every possible program either in ASM or some own language) and to have it run in background on things like calgary corpus and enwik. It would take MUCH time and might just not give any results at all in hundred years, but if it would, you'd have a part of your data compressed just perfect :)2009-12-07 02:50:32
<Shelwien> well, that's what we're doing in a way (me and toffer at least)2009-12-07 02:51:37
<schnaader> Although I doubt there are useful things smaller than 16 bytes even in ASM and bruteforcing till there will take some time ;)2009-12-07 02:51:47
<Shelwien> well, sure thing that you won't get anywhere with x86 asm bruteforcing ;)2009-12-07 02:52:25
 you can try zpaq though ;)2009-12-07 02:52:38
<schnaader> yes, that could actually be a nice try, although I still haven't managed to find time for reading the specifications and writing some own config files for it2009-12-07 02:56:09
*** dagdsg has joined the channel2009-12-07 02:56:24
<Shelwien> well, i have mixed feelings about zpaq2009-12-07 02:57:09
 in a way, i wanted to make something similar for a long time2009-12-07 02:57:43
 but zpaq is completely different from what i wanted, even though it seems very similar if i'd try to describe it ;)2009-12-07 02:58:52
 for example, i have some parameter description syntax2009-12-07 02:59:51
 there're some .idx files with parameter types and values2009-12-07 03:00:12
 and a preprocessor which generates C++ from .idx files2009-12-07 03:00:39
 two different kinds of C++ in fact - one version for optimization and another for "release builds"2009-12-07 03:01:19
<schnaader> :)2009-12-07 03:01:25
<Shelwien> and i'd like to further extend this - by adding also some model description syntax2009-12-07 03:02:15
 as its commonly redundant in C++ - i have to copy-paste the same stuff with modified numbers/letters in a few places2009-12-07 03:03:01
 when i want to add a model component or something like that2009-12-07 03:03:25
 and now2009-12-07 03:03:40
 zpaq kinda has exactly that - the model description syntax2009-12-07 03:04:12
 but its completely useless for me2009-12-07 03:04:27
 quite a shock ;)2009-12-07 03:04:37
<schnaader> OK, because it's different to your approach or just because it's useless :) ?2009-12-07 03:05:16
<Shelwien> as far as i can see, just because its useless :(2009-12-07 03:06:08
<schnaader> yes, I also have a very general concept that I was reminded of when zpaq appeared. it's basically about just describing your input data in a scripting way, f.e. you could just tell it "there's a byte at the next position that can take values 0, 5 and 10-129, and will be 0 most the time" in a very user-friendly way and the compression implementation would just look for the best way to compress the data for you. The scripts could2009-12-07 03:06:21
 get compiled, added to the compressed data and used for decompression.2009-12-07 03:06:21
<Shelwien> http://sweetscape.com/010editor/templates.html2009-12-07 03:07:27
<schnaader> yes, pretty much that way, just used for compression2009-12-07 03:09:43
 would be nice because if you'd release a new file format, just release the script for it and everybody can use it for analysing, detecting, preprocessing or directly compressing your data.2009-12-07 03:11:14
<Shelwien> well, a structure definition can be used for compression directly2009-12-07 03:11:23
 even more, in fact, the compression is not that important2009-12-07 03:12:02
 its possible to write a filter2009-12-07 03:12:22
 which would parse the syntax and produce streams compressible by "universal" compressors2009-12-07 03:12:58
 Shkarin's durilca is the best example of that2009-12-07 03:13:16
 especially its x86 parser/disassembler2009-12-07 03:13:33
 ...but that's a different direction from what i talked about before2009-12-07 03:14:29
 there's also always some choice of model design elements2009-12-07 03:15:23
 ideally, heavier models would produce better predictions2009-12-07 03:16:10
 but usually we have to take speed into account2009-12-07 03:16:24
 and its normal to discard small improvements in compression which hurt the speed2009-12-07 03:17:02
 also the heavier models even don't really guarantee an improvement2009-12-07 03:17:40
 because it all works with limited precision2009-12-07 03:17:49
 and errors accumulate2009-12-07 03:18:05
 so, structure parsing is one thing2009-12-07 03:18:23
 but we also need readable model definitions for structure elements2009-12-07 03:19:00
 and a proper support for parameters in these models2009-12-07 03:19:43
 and i'm paying more attention to that side, i guess2009-12-07 03:20:23
 because its usually better to write format parsers directly in C/C++2009-12-07 03:21:00
 (faster etc)2009-12-07 03:21:06
<schnaader> yes, if you're a programmer, there's no need for those user-friendly shit :)2009-12-07 03:21:56
<Shelwien> well, in fact, i've got much closer to getting my model definition syntax lately2009-12-07 03:23:04
 the main problem was always about selection of basic components2009-12-07 03:23:32
 and things I use now have much better mathematical foundations than before ;)2009-12-07 03:25:39
<schnaader> hehe, guess there have been some calculations and experiments in the meantime :)2009-12-07 03:26:21
<Shelwien> well, for example, I now understand how the paq mixer works ;)2009-12-07 03:27:31
 which was a problem for a while, because Matt doesn't know that ;)2009-12-07 03:28:32
 he just took some formulas from neural networks and rederived the gradient for update formula2009-12-07 03:29:52
<schnaader> :)2009-12-07 03:30:18
 well, sometimes you're content if it just works and don't care why although it would be better :)2009-12-07 03:31:39
<Shelwien> well, yeah2009-12-07 03:32:10
 but you see, as i want to do better than paq2009-12-07 03:32:30
 so i have to understand how it works, even if Matt doesn't ;)2009-12-07 03:32:48
<schnaader> yeah, searching for such things one didn't completely understand and improving them seems like the best way to get better :)2009-12-07 03:34:30
 and after that, you can add your own ideas ;)2009-12-07 03:35:04
<Shelwien> the most interesting thing in paq was something different though2009-12-07 03:35:51
 well, its kinda obvious, but looks surprising when you see it used in a compressor2009-12-07 03:36:20
 i mean the use of PRNG in counter updates2009-12-07 03:36:51
<schnaader> yes, that was the first thing I changed when first seeing the PAQ code, setting the PRNG to output 0 always and see how it hurts the compression :)2009-12-07 03:37:51
<Shelwien> well, its reasonable that adding 0.5 to an integer is the same as adding 1 with probability 0.52009-12-07 03:38:58
 but somehow surprising when it really works ;)2009-12-07 03:39:07
<schnaader> :)2009-12-07 03:39:42
<Shelwien> as a consequence, though, paq now can compress a block of zeroes into a few kb of random data ;)2009-12-07 03:39:49
<schnaader> although you wouldn't need a PRNG here, you could just do some static approach like with image dithering, couldn't you?2009-12-07 03:40:08
<Shelwien> that might require to keep some state somewhere2009-12-07 03:41:16
<schnaader> ah, I see2009-12-07 03:41:28
<Shelwien> PRNG is more universal in this case, yeah2009-12-07 03:41:37
<schnaader> almost forgot that PRNG in PAQ... had wasted some time with it brute-forcing seeds to get the output a bit smaller (I think 10 bytes smaller was the best result I had)...2009-12-07 03:45:02
<Shelwien> i tried to replace the rangecoder there instead2009-12-07 03:46:07
 got somewhat better results, especially with redundant data2009-12-07 03:46:45
 but still winning even 1000 bytes at enwik is kinda disappointing ;)2009-12-07 03:47:16
<schnaader> :)2009-12-07 03:47:29
 What about an AI approach, btw? I doubt if there is an useful AI approach on enwik9, perhaps on the whole enwik file it will make more of a difference, but as it was one of the main intentions of LTCB, it's quite sad there haven't been (successful) attempts...2009-12-07 03:49:39
 Although you could say that some of the dictionary sorting and grammatical things that were done are somewhat external AI attempts2009-12-07 03:50:32
<Shelwien> there's no such thing as AI approach imho2009-12-07 03:50:49
 well, i guess we can take the cyc database2009-12-07 03:51:10
 and try to somehow use it for enwik prediction2009-12-07 03:51:24
 but that won't be compatible with problem restrictions2009-12-07 03:52:35
 (decoder size etc)2009-12-07 03:52:46
 so, if anything, the "AI approach" would be to take into account more correlations in the data2009-12-07 03:53:57
 not only the direct sequential contexts2009-12-07 03:54:51
 but a lot of other things too, up to semantics if possible2009-12-07 03:55:22
 but on other hand, there's nothing "AI" in that2009-12-07 03:55:59
 analyzing sentence templates which i mentioned before is like that, for example, but there's nothing that unique in it2009-12-07 03:57:24
 the main problem is that there's no magical universal function2009-12-07 03:58:04
 so instead, we have to collect lots of different dependencies in data2009-12-07 03:59:02
 abd remove the redundancy corresponding to each of them2009-12-07 03:59:28
 ...and as if there's not enough of these in plain english2009-12-07 04:01:57
 enwik also has lots of artifical markup, which is relatively easy to interpret, but still requires writing specific parsers etc2009-12-07 04:03:06
<schnaader> yes, actually the first items on my enwik list are just about seperating the articles or preprocessing the data to remove HTML characters etc.2009-12-07 04:05:44
<Shelwien> there're lots of masked html stuff in there2009-12-07 04:06:34
 like <html> ;)2009-12-07 04:06:42
<schnaader> Yes, and have you seen that ASCII table? Could just generate it straightforward if there wouldn't be that HTML shit :)2009-12-07 04:07:25
<Shelwien> the problem is that its apparently impossible to just replace stuff everywhere2009-12-07 04:08:19
<schnaader> I somehow still expect enwik to contain some more data that can be generated. I already found some tables, numbers and things like that, but it's quite time consuming to search for things like that.2009-12-07 04:08:30
<Shelwien> i mean, exporting articles from xml2009-12-07 04:08:35
 and doing s/</</g is not completely reversible2009-12-07 04:08:54
 yeah2009-12-07 04:09:33
<schnaader> Well, there are some unused byte values I thought about using for that, although there aren't enough to completely replace all HTML entities.2009-12-07 04:09:57
<Shelwien> i think that the right way would be to do it incrementally2009-12-07 04:10:01
 like, take the first articles and properly compress it2009-12-07 04:10:22
 then generalized the rules to include the second article2009-12-07 04:10:41
 etc 2009-12-07 04:10:42
 a lot of manual work either way2009-12-07 04:10:55
 *the first article2009-12-07 04:11:34
 * generalize2009-12-07 04:11:37
<schnaader> btw, major problem for ISBN are the different formats. Most of the time there will be "ISBN xxxxxxxxxxx", but there are variations like "ISBN x xxxxxx xx x", ISBN "x-xxx-xxxxx-x"...2009-12-07 04:14:32
 And if detection gets to general, you'll change some numbers that aren't ISBN2009-12-07 04:15:23
<Shelwien> well, i doubt that ISBN add that much redundancy there ;)2009-12-07 04:15:29
<schnaader> No, not really :) Didn't count them, but I doubt it's more than 10000 of them there2009-12-07 04:15:53
<Shelwien> and with controlled regexps like i described2009-12-07 04:16:05
 its not really a problem even if there'd be some mismatches2009-12-07 04:16:17
 ...we'd also need some tricky algorithms there too, though2009-12-07 04:19:43
 like optimal parsing, context clustering etc2009-12-07 04:20:07
 and also dictionary compression2009-12-07 04:20:14
 its not really necessary to compress a standalone dictionary there2009-12-07 04:20:56
 but its a good simple testfile for a morphology model2009-12-07 04:21:21
<schnaader> :)2009-12-07 04:21:37
<Shelwien> and i don't quite understand how to build that2009-12-07 04:21:44
 ...i guess its still good that its english though2009-12-07 04:23:08
 because it'd be even more complex with eg. russian2009-12-07 04:23:33
<schnaader> finnish would be most extreme, I guess :)2009-12-07 04:25:24
<Shelwien> i'd say chinese ;)2009-12-07 04:25:51
<schnaader> OK, you won :)2009-12-07 04:25:58
* Shelwien recently suggested using chinese wiki dump (zhwiki) to Sami in his new benchmark2009-12-07 04:26:34
 btw, have you seen that german Wikipedia DVD result for Precomp?2009-12-07 04:27:39
<Shelwien> ...guess not2009-12-07 04:29:11
 not sure what are you talking about even ;)2009-12-07 04:29:24
<schnaader> http://schnaader.info/precomp_wiki_dvd_04dev.html2009-12-07 04:29:27
<Shelwien> still don't quite understand what's that DVD2009-12-07 04:31:22
<schnaader> They recently switched the "zeno" format they used there to something (hopefully) more efficient than zLib, I was quite shocked when I saw you could compress the DVD to almost half of its size2009-12-07 04:31:22
 AFAIK it's the whole Wikipedia with reduced images and without discussion entries so it fits on a DVD2009-12-07 04:32:09
<Shelwien> the usual problem with DVDs is that they're fixed-size2009-12-07 04:32:23
 so people sometimes even hide something unnecessary there2009-12-07 04:32:52
 just to make the software to use up all the DVD space2009-12-07 04:33:15
<schnaader> yeah, that's right. Although additional 2 GB could be used for better image quality or something similar in that case, I suppose.2009-12-07 04:34:10
<Shelwien> does precomp find anything in enwik btw? ;)2009-12-07 04:35:32
<schnaader> And the DVD isn't sold or things like that, I think, it's primary a download I think.2009-12-07 04:35:41
<Shelwien> huh. then its really surprising2009-12-07 04:36:08
 they could use bzip at least2009-12-07 04:36:16
<schnaader> I think that's what they use for the new "zeno" format, could've been LZMA, too, I don't remember2009-12-07 04:36:46
 Hehe, there are some GIF mismatches in enwik9 because GIF detection is looking for "GIF87"/"GIF89", but nothing relevant :)2009-12-07 04:37:17
<Shelwien> i'm kinda not sure about LZMA being better than bzip2 for text compression2009-12-07 04:37:25
<schnaader> Ah, found it, the new format is called "ZIM" - http://openzim.org/Main_Page - I also found a quote that says "article took 3 GB before, 1.4 GB with ZIM)2009-12-07 04:39:51
 They're using bzip2, lzma is an option, but not implemented2009-12-07 04:40:46
 http://openzim.org/ZIM_File_Format#Clusters2009-12-07 04:40:59
<Shelwien> wonder why they don't participate in hutter challenge ;)2009-12-07 04:42:03
<schnaader> http://openzim.org/ZIMwriter says "coming soon...", would've been nice to let it run over enwik9 and compare the result with plain bZip2 :)2009-12-07 04:44:56
 Although they seem to create some search indexes there, too which isn't that helpful :)2009-12-07 04:45:36
<Shelwien> Zim is the surname of one of my friends here, i'd ask him about it ;)2009-12-07 04:48:23
 meanwhile, the question is how to find the reverse regexps2009-12-07 04:53:00
 i mean, like, automatically derive s/([\w]) ([,.;])/$1$2/g from s/([\w])([,.;])/$1 $2/g2009-12-07 04:53:56
<schnaader> Just to make sure I understand the regexp: This changes "bla, bla.bla;" to "bla , bla .bla ;", right?2009-12-07 04:55:51
*** STalKer-Y has joined the channel2009-12-07 04:56:14
 it could be helpful to use some easier format you can transform to regexp and that (the easier format) can be reversed easier.2009-12-07 04:59:24
*** STalKer-X has left the channel2009-12-07 05:01:11
*** schnaader has left the channel2009-12-07 05:03:07
<Shelwien> !next2009-12-07 05:07:32