*** pinc has left the channel		2009-12-05 17:49:20
*** schnaader has left the channel		2009-12-05 17:52:19
*** toffer_ has joined the channel		2009-12-05 17:59:29
*** toffer has left the channel		2009-12-05 18:01:58
*** mike_____ has joined the channel		2009-12-05 18:19:39
*** schnaader has joined the channel		2009-12-05 18:31:10
*** schnaader has left the channel		2009-12-05 19:13:55
<mike_____>	toffer: here is the result of ccmx on enwik8 on my system:	2009-12-05 19:28:14
	97656.25 KiB -> 21138.73 KiB (ratio 21.65%, speed 871 KiB/s)	2009-12-05 19:28:16
<Shelwien>	and m1?	2009-12-05 19:29:09
	with similar memory setting?	2009-12-05 19:29:24
<mike_____>	Allocated 196999 kB.	2009-12-05 19:32:19
	Encoding: 21286803/ 100000000 bytes (1.703 bpc), 76.68 s (1304 Kb/s)	2009-12-05 19:32:19
	ccmx allocated 146MB	2009-12-05 19:32:42
<Shelwien>	still, it seems better and faster which is good	2009-12-05 19:33:09
*** Shelwien has left the channel		2009-12-05 23:03:59
*** Shelwien has joined the channel		2009-12-05 23:04:42
<mike_____>	btw, what does compbooks do?	2009-12-05 23:05:51
*** compbooks has left the channel		2009-12-05 23:23:36
*** mike_____ has left the channel		2009-12-05 23:23:37
*** compbooks has joined the channel		2009-12-05 23:24:14
* compbooks eats people's trouts		2009-12-05 23:26:52
* Krugz slaps compbooks around a bit with a large trout		2009-12-05 23:28:22
* compbooks eats the trout		2009-12-05 23:59:53
<Krugz>	lol slow	2009-12-06 00:00:05
* compbooks slept		2009-12-06 00:00:18
	pff sleep	2009-12-06 00:00:34
<Shelwien>	actually its an iroffer bot	2009-12-06 00:00:43
	xdcc list and whatever else	2009-12-06 00:01:13
<Krugz>	oic	2009-12-06 00:01:19
*** Krugz has left the channel		2009-12-06 00:13:40
*** Krugz has joined the channel		2009-12-06 00:43:31
*** Skymmer has joined the channel		2009-12-06 01:27:53
<Skymmer>	Hi dudes	2009-12-06 01:30:17
<Shelwien>	hi	2009-12-06 01:30:22
<Skymmer>	Ah :) You'll not spoof me. It's not you, its your bot :))	2009-12-06 01:31:40
	Fast reaction	2009-12-06 01:31:57
* Shelwien is thinking that it means that he's a "dude" now		2009-12-06 01:32:51
	Nice ;}	2009-12-06 01:33:30
	Anyway, I'm here to do the thing that I don't like	2009-12-06 01:33:40
<Shelwien>	?	2009-12-06 01:33:58
<Skymmer>	To ask something to do... well... listen	2009-12-06 01:34:55
	There is one program which I using. It kinda slow but there are sources for it. So I thought if somebody experienced with Intel Compiler can compile it. It also can be a good test for IC capabilities.	2009-12-06 01:38:04
<Shelwien>	well, i can try probably	2009-12-06 01:38:27
<Skymmer>	Damn... I'll feeling bad to ask it. Maybe you're busy? shame	2009-12-06 01:39:31
<Shelwien>	not atm	2009-12-06 01:39:41
<Skymmer>	OK	2009-12-06 01:39:46
<Shelwien>	i mean not busy atm	2009-12-06 01:39:58
<Skymmer>	http://omion.dyndns.org/mp3packer/mp3packer-1.20_src.zip	2009-12-06 01:40:29
	More details (if needed) at:	2009-12-06 01:40:29
	http://www.hydrogenaudio.org/forums/index.php?showtopic=32379	2009-12-06 01:40:29
<Shelwien>	that's ocaml, not C	2009-12-06 01:41:30
	so its unrelated to IntelC or whatever	2009-12-06 01:43:44
	but btw mp3zip can do something similar too	2009-12-06 01:43:55
<Skymmer>	Sad. How about this:	2009-12-06 01:45:28
	http://www.fftw.org/fftw-3.2.2.tar.gz	2009-12-06 01:45:28
	http://www.fftw.org/fftw-3.3alpha1.tar.gz	2009-12-06 01:45:28
<Shelwien>	not sure what you want to get out of that	2009-12-06 01:48:33
<Skymmer>	libfftw3-3.dll	2009-12-06 01:48:41
<Shelwien>	well, i'd better not, i guess	2009-12-06 01:49:07
	its probably possible, but i'd need something to check it with etc - issues are possible	2009-12-06 01:49:44
	"Intel C: you can also use the Intel compilers under VC++ (see below). This may produce marginally faster code than the GNU C compiler, but is probably not worth it for most users. Be cautious with the compiler flags�turning on every optimization under the sun usually makes FFTW slower."	2009-12-06 01:50:54
	http://www.ece.cmu.edu/~franzf/fftw.org/	2009-12-06 01:51:19
	;)	2009-12-06 01:51:20
<Skymmer>	Damn :)	2009-12-06 01:54:05
	Ok, last trial	2009-12-06 01:54:20
	http://files.monkeysaudio.com/MAC_SDK_406.zip	2009-12-06 01:54:24
<Shelwien>	dll again or what?	2009-12-06 01:54:56
<Skymmer>	mac.exe	2009-12-06 01:55:09
	Ehhh... Source\Console\ I presume	2009-12-06 01:59:22
	Lame with it	2009-12-06 01:59:32
<Shelwien>	well, dunno	2009-12-06 02:14:16
	i tried but there're syntax errors now	2009-12-06 02:14:32
	"pointer to incomplete class not allowed" etc	2009-12-06 02:14:44
<Skymmer>	No problem.	2009-12-06 02:18:52
	BTW, what you phrase means: but btw mp3zip can do something similar too	2009-12-06 02:19:32
<Shelwien>	CBR->VBR	2009-12-06 02:19:50
	you have the console mp3zip, right?	2009-12-06 02:19:59
<Skymmer>	sure	2009-12-06 02:20:04
	-c ?	2009-12-06 02:20:15
<Shelwien>	you can run it like mp3zip -c 1.mp3 1.mpx; mp3zip -d 1.mpx 1unp.mp3	2009-12-06 02:20:16
	yeah	2009-12-06 02:20:19
<Skymmer>	Oh no...	2009-12-06 02:20:27
<Shelwien>	its not that smart though	2009-12-06 02:20:40
	would just discard the LAME tag or something	2009-12-06 02:20:52
	well, i can make it do it in a single pass too	2009-12-06 02:21:22
	not that there's any sense to	2009-12-06 02:22:02
<Skymmer>	not only. The problem is that OUT file has no correct Xing VBR info so its lenght shown incorrectly:	2009-12-06 02:25:46
	Processed: 6:02:10	2009-12-06 02:25:46
	Original: 0:54:12	2009-12-06 02:25:46
	and more:	2009-12-06 02:26:07
<Shelwien>	well, as i said, it wasn't an intentional feature anyway ;)	2009-12-06 02:26:43
<Skymmer>	foobar's "Differences found in 1 out of 1 track pairs.	2009-12-06 02:26:51
	Comparing:	2009-12-06 02:26:51
	"C:\SHIT\SOFT\ARC\_Shelwien\mp3zip\out.mp3"	2009-12-06 02:26:51
	"C:\SHIT\SOFT\ARC\_Shelwien\mp3zip\Test.mp3"	2009-12-06 02:26:51
	Length mismatch : 54:12.520249 vs 54:12.453333, 143436143 vs 143433192 samples	2009-12-06 02:26:51
	"	2009-12-06 02:26:51
	its foobar's "bit-compare" results	2009-12-06 02:27:24
*** Shelwien has left the channel		2009-12-06 02:28:23
*** Shelwien has joined the channel		2009-12-06 02:28:28
<Shelwien>	but still, my packed mp3s might be different in size from mp3pack	2009-12-06 02:28:36
<Skymmer>	Ehhh... don't look at the "SHIT" name of folder. its my name of Download folder :))	2009-12-06 02:29:13
<STalKer-Y>	i thought it was an abbrevation	2009-12-06 02:30:41
<Skymmer>	I don't mean the size. I mean the content.	2009-12-06 02:31:14
<Shelwien>	the content is the same	2009-12-06 02:31:45
	that "length mismatch" is probably due to fixed frame size in mp3	2009-12-06 02:32:07
	so actual PCM size is in framesize increments	2009-12-06 02:32:52
	but maybe the LAME tag contains the precise source file length	2009-12-06 02:33:09
	and mp3zip discards that	2009-12-06 02:33:16
<Skymmer>	I think its because LAME gapless info indroduced in 3.90.3 is missing in OUT file	2009-12-06 02:33:23
	ah yes	2009-12-06 02:33:28
<Shelwien>	anyway it keeps the tag in default mode	2009-12-06 02:33:56
	so its ok ;)	2009-12-06 02:34:00
<Skymmer>	out.mp3 86�513�998	2009-12-06 02:34:24
	Test.mp3 85�225�904	2009-12-06 02:34:24
<Shelwien>	but i guess i'd have to find the description for that tag the next time	2009-12-06 02:34:32
<Skymmer>	damn	2009-12-06 02:34:47
	out.mp3 86�513�998	2009-12-06 02:34:47
	Test.mp3 85�225�904	2009-12-06 02:34:48
<Shelwien>	expanded it?	2009-12-06 02:35:00
<toffer>	hi & gn8 guys	2009-12-06 02:37:04
<Skymmer>	sorry, what you mean by "expanding it"?	2009-12-06 02:37:09
<toffer>	just came home again from a club	2009-12-06 02:37:12
<Shelwien>	hi ;)	2009-12-06 02:37:28
<Skymmer>	Aloha!	2009-12-06 02:37:54
<Shelwien>	skymmer: i mean that mp3zip output apparently larger than input	2009-12-06 02:38:16
<Skymmer>	yes.	2009-12-06 02:38:28
<Shelwien>	no luck ;)	2009-12-06 02:38:52
<toffer>	mhhh spaghetti	2009-12-06 02:39:24
<Shelwien>	what?	2009-12-06 02:39:36
<Skymmer>	spaghetti its the new lossless audio compressor	2009-12-06 02:40:06
	:))	2009-12-06 02:40:13
<toffer>	^^	2009-12-06 02:40:14
<Skymmer>	Toffer, why you so amazed about spaghetti? have you smoked something? ;)	2009-12-06 02:42:23
<Shelwien>	spaghetti?	2009-12-06 02:42:53
<toffer>	no, not at all	2009-12-06 02:42:59
	but i'm really hungry right now. and my girlfriend wants to eat that stuff, too	2009-12-06 02:43:18
<Skymmer>	What kind of music was in the club?	2009-12-06 02:43:57
<toffer>	well it was some kind of "huge hall" they played everything. i mostly like back musikc	2009-12-06 02:48:35
	musci	2009-12-06 02:48:36
	music	2009-12-06 02:48:38
<Skymmer>	:)) Haaa.. You drunk probably. Not offensive. Just curious ;)	2009-12-06 02:52:37
<toffer>	partially	2009-12-06 02:55:05
<Skymmer>	I'm pretty sure that there wasn't music like this one devil	2009-12-06 03:01:16
	http://skymmer.narod.ru/misc/Glenn.mp3	2009-12-06 03:01:18
<toffer>	not gonna listen to anything right now	2009-12-06 03:06:31
	otherwise everybody gets awake	2009-12-06 03:06:39
<Skymmer>	Bye people. Gonna sleep...	2009-12-06 03:14:49
*** Skymmer has left the channel		2009-12-06 03:15:28
<toffer>	gn8 from me, too	2009-12-06 03:15:45
*** toffer has left the channel		2009-12-06 03:15:49
*** STalKer-X has joined the channel		2009-12-06 05:00:24
*** STalKer-Y has left the channel		2009-12-06 05:03:45
*** Krugz has left the channel		2009-12-06 06:03:56
*** mike_____ has joined the channel		2009-12-06 11:38:36
<Shelwien>	http://encode.dreamhosters.com/showthread.php?t=511	2009-12-06 14:54:25
*** mike_____ has left the channel		2009-12-06 15:48:52
*** pinc has joined the channel		2009-12-06 19:13:00
*** Krugz has joined the channel		2009-12-06 20:23:11
*** Krugz has left the channel		2009-12-06 20:27:44
*** Krugz has joined the channel		2009-12-06 20:39:12
*** pinc has left the channel		2009-12-06 20:40:30
*** Krugz has left the channel		2009-12-06 21:04:08
*** Krugz has joined the channel		2009-12-06 21:06:52
	there's a conspiracy!	2009-12-06 21:11:11
	compilers store a timestamp into exe header	2009-12-06 21:11:29
	so that i won't be able to determine whether two exes are equal by comparing their hashes	2009-12-06 21:11:56
* Shelwien is bruteforcing compiler options		2009-12-06 21:12:22
<Krugz>	lol	2009-12-06 21:13:16
*** Krugz has left the channel		2009-12-06 21:14:16
*** Krugz has joined the channel		2009-12-06 21:14:55
*** Krugz has left the channel		2009-12-06 21:16:22
*** Krugz has joined the channel		2009-12-06 21:17:01
*** STalKer-X has left the channel		2009-12-06 22:56:42
*** Shelwien has left the channel		2009-12-06 23:04:08
*** Shelwien has joined the channel		2009-12-06 23:04:13
*** STalKer-X has joined the channel		2009-12-06 23:19:03
*** schnaader has joined the channel		2009-12-07 00:44:54
<schnaader>	hi @ all - kinda late, I know, but I thought I could at least have a look who's there	2009-12-07 00:45:20
<Shelwien>	hi	2009-12-07 00:45:35
	i have my usual problem with gcc and templates here	2009-12-07 00:46:09
<schnaader>	what was the compiler you talked about that has timestamps? I did a bit of research because I had similar issues sometimes and found that at least GCC doesn't seem to include timestamps.	2009-12-07 00:46:55
<Shelwien>	MS linker does	2009-12-07 00:47:10
<schnaader>	hm... never did that much template things, so I fear I can't help you there	2009-12-07 00:47:16
<Shelwien>	well, gcc is annoying as hell here	2009-12-07 00:47:30
<schnaader>	ah, I guess that was where I got the problem, too, MSVC	2009-12-07 00:47:36
<Shelwien>	it can't compile some code with which MSC/IntelC don't have any problems	2009-12-07 00:47:48
	and i don't quite know how to solve it	2009-12-07 00:48:10
	its not the first time too...	2009-12-07 00:48:20
	to be specific	2009-12-07 00:49:07
	if i try to use something like	2009-12-07 00:49:13
	template< int flag > class A : public B<flag> { ... }	2009-12-07 00:49:39
<schnaader>	you could try to post a question or search for similar problems at http://stackoverflow.com/ - they are really quick in giving very good answers especially to C questions there	2009-12-07 00:49:42
<Shelwien>	gcc doesn't see anything from template B there	2009-12-07 00:50:09
	dumb thing	2009-12-07 00:50:12
	and i don't know how to search for it	2009-12-07 00:50:30
<schnaader>	If you can strip it down to a short example and some text describing your problem, I could also post it for you on SO, so you wouldn't need to register	2009-12-07 00:53:11
	there seem to be some template FAQs around, but they all seem to handle different things as far as I can tell...	2009-12-07 00:53:39
<Shelwien>	...i've also got another internal error from IntelC while trying to make it portable ;)	2009-12-07 00:54:55
<schnaader>	internal errors suck :) I had to do some workaround once where the Delphi compiler would throw one when code was compiled with the command-line compiler, but the IDE would just compile it fine. rewriting the code a bit solved it, although both versions seemed to be perfectly valid code...	2009-12-07 00:57:24
<Shelwien>	well, i managed to make a workaround with macros etc here	2009-12-07 00:58:27
	now have some strange linking problems though	2009-12-07 00:58:52
	...	2009-12-07 00:59:14
	i made a new rangecoder using couroutine template	2009-12-07 00:59:34
	seems pretty nice	2009-12-07 00:59:51
	now trying to compare gcc vs intelc	2009-12-07 01:00:07
	at least gcc version worked after compiling	2009-12-07 01:00:35
	but now i have to do something about the compiler options...	2009-12-07 01:00:47
	any suggestions about gcc options btw?	2009-12-07 01:03:02
<schnaader>	do you need templates here for speed or just for easier changes/readability?	2009-12-07 01:03:16
	hm.. -O2/-O3 -Os -s -march=... are those I usually use, didn't care about it much so far	2009-12-07 01:03:59
	I really enjoyed the forum discussion about those GCC automatic profiling things, these could be handy to gain the last percents of speed out of some code :)	2009-12-07 01:04:38
<Shelwien>	for speed mainly	2009-12-07 01:04:59
	ok, testing	2009-12-07 01:05:21
	intelc time was ~42.5s for enwik9	2009-12-07 01:05:45
<schnaader>	ah, that's bad.. would've recommended not using templates if it wouldn't have been speed :)	2009-12-07 01:05:49
<Shelwien>	for readability its important too	2009-12-07 01:06:07
	well, i know how people usually solve these problems though	2009-12-07 01:07:16
	they copy-paste stuff	2009-12-07 01:07:27
	...wow	2009-12-07 01:07:35
	131s with gcc 4.3/mingw	2009-12-07 01:07:51
	crazy	2009-12-07 01:07:53
<schnaader>	The PGO optimization should be easy: -fprofile-generate, run, recompile with -fprofile-use	2009-12-07 01:08:09
<Shelwien>	sure... its not a PGO problem yet though	2009-12-07 01:08:38
<schnaader>	whoa, that's 3 times faster with IntelC... either GCC is really bad here or IntelC has some neat tricks :)	2009-12-07 01:08:50
<Shelwien>	guess gcc is being crazy about that int64 multiplication	2009-12-07 01:09:07
<schnaader>	int64 = long long, or something homemade?	2009-12-07 01:09:29
<Shelwien>	unsigned long long	2009-12-07 01:09:43
<schnaader>	hm.. never experienced any major speed decreases with it and I had to use it for some Project Euler programs...	2009-12-07 01:10:14
<Shelwien>	thing is, that qword version is actually faster with intelc than alternative 32-bit multiplication	2009-12-07 01:10:24
	148s decoding	2009-12-07 01:10:42
	something is majorly wrong here...	2009-12-07 01:10:58
<schnaader>	so how does the 32-bit version perform with gcc, then?	2009-12-07 01:11:07
<Shelwien>	i'd try again with different options	2009-12-07 01:12:12
	maybe it was because of inlining or unrolling	2009-12-07 01:12:20
	this time exe is twice smaller	2009-12-07 01:12:34
<schnaader>	I would blame unrolling in that case :)	2009-12-07 01:12:56
<Shelwien>	...but doesn't seem to be faster... still works	2009-12-07 01:13:04
	...it also can be a problem with i/o i guess	2009-12-07 01:13:57
	129s this time too	2009-12-07 01:14:21
	ok, running with 32-bit mult	2009-12-07 01:15:52
	...again, i guess no luck	2009-12-07 01:16:48
	119s this time	2009-12-07 01:17:38
<schnaader>	you could also try to disable some of the IntelC optimizations, perhaps it's one of those that just really helps	2009-12-07 01:18:22
<Shelwien>	no	2009-12-07 01:18:35
	it's never been slower than 60s	2009-12-07 01:18:57
	and i don't even use PGO with IntelC atm	2009-12-07 01:19:06
<schnaader>	I've got g++ 3.4.5 here, could test it with this one :))	2009-12-07 01:19:28
<Shelwien>	no problem... would you be able to run IC version there?	2009-12-07 01:20:06
<schnaader>	would be able to run, but not to compile :)	2009-12-07 01:20:21
<Shelwien>	...119s again with different i/o... dunno	2009-12-07 01:20:21
	thats ok	2009-12-07 01:20:29
	ok, let me test IC version again first...	2009-12-07 01:22:34
	44.2s encoding	2009-12-07 01:23:19
	47s decoding	2009-12-07 01:24:00
	guess i need to fix that mult back	2009-12-07 01:24:12
	attempt #2	2009-12-07 01:24:41
	41.891s	2009-12-07 01:25:20
	41.469s	2009-12-07 01:26:02
	seems like i've got some improvement from replacing some templates with macros ;)	2009-12-07 01:26:28
<schnaader>	:)	2009-12-07 01:26:44
	btw, under a minute is quite fast for enwik9 (~20 MB/s, isn't it?) which would lead to I/O problems indeed	2009-12-07 01:28:02
<Shelwien>	i'm running it on ramdrive	2009-12-07 01:28:19
<schnaader>	OK, that's odd. perhaps gcc has problems with ramdrives, but I don't think so...	2009-12-07 01:29:13
<Shelwien>	no, i tested different buffers and that multiplication	2009-12-07 01:29:37
	its something more general	2009-12-07 01:29:47
	http://www.ctxmodel.net/files/newbrc/newbrc_0.rar	2009-12-07 01:30:07
	btw, its from "new bitwise rc" ;)	2009-12-07 01:30:28
	meanwhile, got a new IC version here, would try to install and test	2009-12-07 01:31:15
<schnaader>	my old gcc version doesn't like -fwhole-program and gives some warnings about alignment of C0/C1, but compiling works apart from that	2009-12-07 01:33:13
<Shelwien>	yeah, and you can compare the speed ;)	2009-12-07 01:33:39
<schnaader>	guess the original mtf.exe is compiled with IntelC, right? btw, sizes are 25088 for gcc, 76288 for the other one which is quite a difference	2009-12-07 01:34:52
<Shelwien>	that's a static build	2009-12-07 01:35:16
<schnaader>	ah, OK	2009-12-07 01:35:28
<Shelwien>	it'd be around 30-40k with /MD	2009-12-07 01:35:29
	but static is a bit faster so i usually compile it like that	2009-12-07 01:36:04
	btw, there's no model	2009-12-07 01:36:29
	it encodes bits with a fixed probability, a little skewed towards 0 bits	2009-12-07 01:36:59
<schnaader>	guess that's why it's so fast :)	2009-12-07 01:37:14
<Shelwien>	not really	2009-12-07 01:37:21
	fpaq0pv4b is still somewhat waster	2009-12-07 01:37:34
	*faster	2009-12-07 01:37:36
	but there're more restrictions and rangecoder is different... a little redundant too	2009-12-07 01:38:56
<schnaader>	guess I should try with enwik8 here instead, enwik9 seems to take some minutes...	2009-12-07 01:42:20
<Shelwien>	sure ;)	2009-12-07 01:43:32
<schnaader>	That would be a funny abuse of your p2p thing: adding enwik8 to the download list just to quickly shorten enwik9 :)	2009-12-07 01:46:12
<Shelwien>	btw, that file still continues after 1G... dunno why nobody uses the whole file ;)	2009-12-07 01:47:56
<schnaader>	yes, I know, it's 4.8 GB or something, think they just don't care because you won't get listed on LTCB or Hutter that way :)	2009-12-07 01:48:35
	Strange... I think I'll try a second run... c 40,43/d 45,07 for gcc, c 10,17/d 22,22 for intel	2009-12-07 01:51:57
<Shelwien>	;)	2009-12-07 01:52:07
	decoding seems kinda slow for intel too, though	2009-12-07 01:52:36
<schnaader>	Ah, it wasn't 10,17, it was 20,17... my mistake	2009-12-07 01:54:49
	makes more sense now :)	2009-12-07 01:54:56
<Shelwien>	yeah	2009-12-07 01:55:05
	its 3x difference here though	2009-12-07 01:55:10
<schnaader>	CPU here is quite slow and I/O should be limited to around 30-50 MB/s, perhaps the difference just can't show that much	2009-12-07 01:56:16
	although 20 seconds is 5 MB/s, so I/O shouldn't be a problem	2009-12-07 01:57:03
	f.e. fastest THOR mode gives 4 seconds and 23.75 MB/s :)	2009-12-07 01:57:51
<Shelwien>	well, it doesn't have to do a multiplication per data bit	2009-12-07 01:58:33
	if anything, you can compare it to this - http://www.ctxmodel.net/files/fpaq0pv4b3.rar	2009-12-07 01:59:18
<schnaader>	fpaq0pv4B_O3_xi.exe gives c 12.63/d 15.97	2009-12-07 02:02:40
<Shelwien>	kinda weird but ok	2009-12-07 02:03:41
<schnaader>	btw, mtf's output for enwik8 is quite large (99,6 MB), but I guess that's normal	2009-12-07 02:03:58
<Shelwien>	yes, that's intentional	2009-12-07 02:04:12
	and probably the main reason for speed difference with fpaq0p too	2009-12-07 02:04:29
	btw, i'd probably finally add some async i/o to the new coder	2009-12-07 02:07:16
	the coroutine framework made it really easy	2009-12-07 02:07:41
<schnaader>	talking about enwik, I saw you commented on my ISBN precompression, that was one of many items in a list I started when LTCB came out, but I stopped working on it as I realized that my PC is just too slow for experiments with enwik9	2009-12-07 02:07:49
<Shelwien>	i don't think thats really a problem	2009-12-07 02:08:30
	you just don't have to run paq, that's all	2009-12-07 02:08:39
	paq8 is not really a good compressor, though it may sound weird	2009-12-07 02:10:00
<schnaader>	yes, think I could retry with something else like 7-Zip	2009-12-07 02:10:03
<Shelwien>	not 7z, but ppmd/ppmonstr would be ok	2009-12-07 02:10:24
<schnaader>	it's just that I didn't want to optimize size for some compressor and see that results will get worse for PAQ	2009-12-07 02:10:37
<Shelwien>	lzma (as any LZ) is really bad at text compression	2009-12-07 02:10:50
<schnaader>	I switched to calgary corpus after that. was fun, first I did most of the preprocessors as little COM files using ASM, later included them into the PAQ source directly	2009-12-07 02:13:00
<Shelwien>	you know that, right? http://www.mailcom.com/challenge/	2009-12-07 02:13:27
<schnaader>	yes, that was why I did it :)	2009-12-07 02:13:44
<Shelwien>	btw, wanna hear my idea about enwik compression?	2009-12-07 02:14:13
<schnaader>	also had a look at SHA-1, but I guess it's not worth the effort - at least you have a better chance to improve compression instead :)	2009-12-07 02:14:21
	sure	2009-12-07 02:14:29
<Shelwien>	its very different from "general purpose"	2009-12-07 02:15:02
	basically, multipass compression	2009-12-07 02:15:25
	and btw kinda related to my approach to recompression too - which i described before	2009-12-07 02:15:46
	so, a multipass lossy filter with coding of extra information to make it lossless	2009-12-07 02:16:55
	there're many specific cases where compression can be improved by little tweaking	2009-12-07 02:17:48
	like capital conversion text filters etc	2009-12-07 02:18:02
<schnaader>	OK, that's an interesting approach. Some of my ideas also included very basic "lossy" things like "insert the same text here every time and afterwards just change some of the words so they get correct"	2009-12-07 02:18:07
<Shelwien>	well, now to examples	2009-12-07 02:18:42
	another popular text filter is "syntax stuffing"	2009-12-07 02:19:04
	like we usually write "word,"	2009-12-07 02:19:29
	but the most common symbol in word context is usually space	2009-12-07 02:19:58
	so alternatives are kinda bad - they mess up predictions	2009-12-07 02:20:25
	so, we can insert a space into each place like that	2009-12-07 02:21:10
	like s/([\w])([,.;])/$1 $2/g in regexp form	2009-12-07 02:21:56
<schnaader>	yes, even just doing so and thus having a seperate "word stream" and a symbol stream like " , . ," should give better compression	2009-12-07 02:22:09
<Shelwien>	not with paq	2009-12-07 02:22:27
	but if we'd do just this space padding	2009-12-07 02:22:46
	and encode the information to revert it back into a separate stream	2009-12-07 02:23:03
	that can get us an improvement even with paq	2009-12-07 02:23:21
	now, that "information encoding" is the point	2009-12-07 02:23:43
	we'd need a "backward regexp" like s/([\w]) ([,.;])/$1$2/g	2009-12-07 02:24:36
	but we can't just apply it as is and restore the data	2009-12-07 02:25:04
	because there might be cases like that from before	2009-12-07 02:25:22
	so, for each case	2009-12-07 02:25:30
	we'd need to encode a flag - whether to perform the replacement or not	2009-12-07 02:26:03
	also, such cases are not unrelated to context	2009-12-07 02:26:35
	so ideally we'd need a full context model	2009-12-07 02:26:49
	taking into account both data before and after the replacement point	2009-12-07 02:27:06
	bi-directional context ;)	2009-12-07 02:27:14
<schnaader>	:)	2009-12-07 02:27:29
<Shelwien>	thus, there's still a place for heavy CM	2009-12-07 02:27:36
	but there's a difference from paq approach	2009-12-07 02:27:53
	passes are independent	2009-12-07 02:28:02
	and there're usually not much of flags (comparing to whole enwik)	2009-12-07 02:28:36
	so we should be able to collect more detailed statistics than paq8	2009-12-07 02:29:07
	and still not care about memory overflows etc	2009-12-07 02:29:29
	so, as i see it	2009-12-07 02:30:05
	to solve this task	2009-12-07 02:30:09
	i have to write such a reversible regexp implementation	2009-12-07 02:30:22
	and then optimize a model for each regexp (automatically)	2009-12-07 02:30:56
<schnaader>	good luck with that - sounds promising	2009-12-07 02:31:14
<Shelwien>	yeah	2009-12-07 02:31:21
	for example, at some point	2009-12-07 02:31:31
	we can start replacing words	2009-12-07 02:31:41
	like with synonyms	2009-12-07 02:31:49
	or just with a "<word>" tag	2009-12-07 02:32:00
	thus, it would be possible to not only handle the simple direct contexts	2009-12-07 02:32:34
	but also higher-level language dependencies	2009-12-07 02:33:00
	like sentence structure etc	2009-12-07 02:33:08
	also, it would be possible to take into account word distances and stuff like that	2009-12-07 02:33:46
	which a sequential model can't handle because of memory problems	2009-12-07 02:34:02
	...but there's a small problem ;)	2009-12-07 02:34:33
	such an enwik model would be very specific	2009-12-07 02:34:56
	it won't be really applicable to anything else	2009-12-07 02:35:06
	and doing it just for the prize	2009-12-07 02:36:31
<schnaader>	yes, but this is the case with everything that works well on enwik, although the data is a nice example for text and a good mix, but there are some things like those redundant city parts that are very specific	2009-12-07 02:36:54
<Shelwien>	would mean working for $3/hour (very optimistically) ;)	2009-12-07 02:36:58
	well, unfortunately there's much more stuff beside "city parts" (which are afair not in enwik8 anyway)	2009-12-07 02:37:48
<schnaader>	yes, there are mainly in enwik8, there's 1 or 2 entries of it in enwik8, I think, but I'm not sure	2009-12-07 02:38:17
<Shelwien>	like xml markup, html markup, wiki markup etc	2009-12-07 02:38:19
	including ISBN too ;)	2009-12-07 02:38:29
<schnaader>	it really bothers me that removing XML tags doesn't improve compression for PAQ (there's something about it stated on the site) which seems just weird...	2009-12-07 02:39:09
	and AFAIR, it's even about completely removing tags, not replace/optimize them like using xmlwrt	2009-12-07 02:39:46
<Shelwien>	well, articles are rather big	2009-12-07 02:39:56
	but afaik it still helped in my experiments	2009-12-07 02:40:16
	if you want, i can post my enwik parser (one of)... its in perl though	2009-12-07 02:40:47
<schnaader>	thanks, but got no perl here :) I've also written some programs to f.e. extract user/ID lists, so it wouldn't be too much work to write my own	2009-12-07 02:41:57
<Shelwien>	in fact, i was thinking about doing it completely in perl ;)	2009-12-07 02:42:31
	like, implementing these reversible regexps somehow	2009-12-07 02:42:53
	and doing compression with an external coder	2009-12-07 02:43:18
<schnaader>	I thought about starting some brute-force compression program (running every possible program either in ASM or some own language) and to have it run in background on things like calgary corpus and enwik. It would take MUCH time and might just not give any results at all in hundred years, but if it would, you'd have a part of your data compressed just perfect :)	2009-12-07 02:50:32
<Shelwien>	well, that's what we're doing in a way (me and toffer at least)	2009-12-07 02:51:37
<schnaader>	Although I doubt there are useful things smaller than 16 bytes even in ASM and bruteforcing till there will take some time ;)	2009-12-07 02:51:47
<Shelwien>	well, sure thing that you won't get anywhere with x86 asm bruteforcing ;)	2009-12-07 02:52:25
	you can try zpaq though ;)	2009-12-07 02:52:38
<schnaader>	yes, that could actually be a nice try, although I still haven't managed to find time for reading the specifications and writing some own config files for it	2009-12-07 02:56:09
*** dagdsg has joined the channel		2009-12-07 02:56:24
<Shelwien>	well, i have mixed feelings about zpaq	2009-12-07 02:57:09
	in a way, i wanted to make something similar for a long time	2009-12-07 02:57:43
	but zpaq is completely different from what i wanted, even though it seems very similar if i'd try to describe it ;)	2009-12-07 02:58:52
	for example, i have some parameter description syntax	2009-12-07 02:59:51
	there're some .idx files with parameter types and values	2009-12-07 03:00:12
	and a preprocessor which generates C++ from .idx files	2009-12-07 03:00:39
	two different kinds of C++ in fact - one version for optimization and another for "release builds"	2009-12-07 03:01:19
<schnaader>	:)	2009-12-07 03:01:25
<Shelwien>	and i'd like to further extend this - by adding also some model description syntax	2009-12-07 03:02:15
	as its commonly redundant in C++ - i have to copy-paste the same stuff with modified numbers/letters in a few places	2009-12-07 03:03:01
	when i want to add a model component or something like that	2009-12-07 03:03:25
	and now	2009-12-07 03:03:40
	zpaq kinda has exactly that - the model description syntax	2009-12-07 03:04:12
	but its completely useless for me	2009-12-07 03:04:27
	quite a shock ;)	2009-12-07 03:04:37
<schnaader>	OK, because it's different to your approach or just because it's useless :) ?	2009-12-07 03:05:16
<Shelwien>	as far as i can see, just because its useless :(	2009-12-07 03:06:08
<schnaader>	yes, I also have a very general concept that I was reminded of when zpaq appeared. it's basically about just describing your input data in a scripting way, f.e. you could just tell it "there's a byte at the next position that can take values 0, 5 and 10-129, and will be 0 most the time" in a very user-friendly way and the compression implementation would just look for the best way to compress the data for you. The scripts could	2009-12-07 03:06:21
	get compiled, added to the compressed data and used for decompression.	2009-12-07 03:06:21
<Shelwien>	http://sweetscape.com/010editor/templates.html	2009-12-07 03:07:27
<schnaader>	yes, pretty much that way, just used for compression	2009-12-07 03:09:43
	would be nice because if you'd release a new file format, just release the script for it and everybody can use it for analysing, detecting, preprocessing or directly compressing your data.	2009-12-07 03:11:14
<Shelwien>	well, a structure definition can be used for compression directly	2009-12-07 03:11:23
	even more, in fact, the compression is not that important	2009-12-07 03:12:02
	its possible to write a filter	2009-12-07 03:12:22
	which would parse the syntax and produce streams compressible by "universal" compressors	2009-12-07 03:12:58
	Shkarin's durilca is the best example of that	2009-12-07 03:13:16
	especially its x86 parser/disassembler	2009-12-07 03:13:33
	...but that's a different direction from what i talked about before	2009-12-07 03:14:29
	there's also always some choice of model design elements	2009-12-07 03:15:23
	ideally, heavier models would produce better predictions	2009-12-07 03:16:10
	but usually we have to take speed into account	2009-12-07 03:16:24
	and its normal to discard small improvements in compression which hurt the speed	2009-12-07 03:17:02
	also the heavier models even don't really guarantee an improvement	2009-12-07 03:17:40
	because it all works with limited precision	2009-12-07 03:17:49
	and errors accumulate	2009-12-07 03:18:05
	so, structure parsing is one thing	2009-12-07 03:18:23
	but we also need readable model definitions for structure elements	2009-12-07 03:19:00
	and a proper support for parameters in these models	2009-12-07 03:19:43
	and i'm paying more attention to that side, i guess	2009-12-07 03:20:23
	because its usually better to write format parsers directly in C/C++	2009-12-07 03:21:00
	(faster etc)	2009-12-07 03:21:06
<schnaader>	yes, if you're a programmer, there's no need for those user-friendly shit :)	2009-12-07 03:21:56
<Shelwien>	well, in fact, i've got much closer to getting my model definition syntax lately	2009-12-07 03:23:04
	the main problem was always about selection of basic components	2009-12-07 03:23:32
	and things I use now have much better mathematical foundations than before ;)	2009-12-07 03:25:39
<schnaader>	hehe, guess there have been some calculations and experiments in the meantime :)	2009-12-07 03:26:21
<Shelwien>	well, for example, I now understand how the paq mixer works ;)	2009-12-07 03:27:31
	which was a problem for a while, because Matt doesn't know that ;)	2009-12-07 03:28:32
	he just took some formulas from neural networks and rederived the gradient for update formula	2009-12-07 03:29:52
<schnaader>	:)	2009-12-07 03:30:18
	well, sometimes you're content if it just works and don't care why although it would be better :)	2009-12-07 03:31:39
<Shelwien>	well, yeah	2009-12-07 03:32:10
	but you see, as i want to do better than paq	2009-12-07 03:32:30
	so i have to understand how it works, even if Matt doesn't ;)	2009-12-07 03:32:48
<schnaader>	yeah, searching for such things one didn't completely understand and improving them seems like the best way to get better :)	2009-12-07 03:34:30
	and after that, you can add your own ideas ;)	2009-12-07 03:35:04
<Shelwien>	the most interesting thing in paq was something different though	2009-12-07 03:35:51
	well, its kinda obvious, but looks surprising when you see it used in a compressor	2009-12-07 03:36:20
	i mean the use of PRNG in counter updates	2009-12-07 03:36:51
<schnaader>	yes, that was the first thing I changed when first seeing the PAQ code, setting the PRNG to output 0 always and see how it hurts the compression :)	2009-12-07 03:37:51
<Shelwien>	well, its reasonable that adding 0.5 to an integer is the same as adding 1 with probability 0.5	2009-12-07 03:38:58
	but somehow surprising when it really works ;)	2009-12-07 03:39:07
<schnaader>	:)	2009-12-07 03:39:42
<Shelwien>	as a consequence, though, paq now can compress a block of zeroes into a few kb of random data ;)	2009-12-07 03:39:49
<schnaader>	although you wouldn't need a PRNG here, you could just do some static approach like with image dithering, couldn't you?	2009-12-07 03:40:08
<Shelwien>	that might require to keep some state somewhere	2009-12-07 03:41:16
<schnaader>	ah, I see	2009-12-07 03:41:28
<Shelwien>	PRNG is more universal in this case, yeah	2009-12-07 03:41:37
<schnaader>	almost forgot that PRNG in PAQ... had wasted some time with it brute-forcing seeds to get the output a bit smaller (I think 10 bytes smaller was the best result I had)...	2009-12-07 03:45:02
<Shelwien>	i tried to replace the rangecoder there instead	2009-12-07 03:46:07
	got somewhat better results, especially with redundant data	2009-12-07 03:46:45
	but still winning even 1000 bytes at enwik is kinda disappointing ;)	2009-12-07 03:47:16
<schnaader>	:)	2009-12-07 03:47:29
	What about an AI approach, btw? I doubt if there is an useful AI approach on enwik9, perhaps on the whole enwik file it will make more of a difference, but as it was one of the main intentions of LTCB, it's quite sad there haven't been (successful) attempts...	2009-12-07 03:49:39
	Although you could say that some of the dictionary sorting and grammatical things that were done are somewhat external AI attempts	2009-12-07 03:50:32
<Shelwien>	there's no such thing as AI approach imho	2009-12-07 03:50:49
	well, i guess we can take the cyc database	2009-12-07 03:51:10
	and try to somehow use it for enwik prediction	2009-12-07 03:51:24
	but that won't be compatible with problem restrictions	2009-12-07 03:52:35
	(decoder size etc)	2009-12-07 03:52:46
	so, if anything, the "AI approach" would be to take into account more correlations in the data	2009-12-07 03:53:57
	not only the direct sequential contexts	2009-12-07 03:54:51
	but a lot of other things too, up to semantics if possible	2009-12-07 03:55:22
	but on other hand, there's nothing "AI" in that	2009-12-07 03:55:59
	analyzing sentence templates which i mentioned before is like that, for example, but there's nothing that unique in it	2009-12-07 03:57:24
	the main problem is that there's no magical universal function	2009-12-07 03:58:04
	so instead, we have to collect lots of different dependencies in data	2009-12-07 03:59:02
	abd remove the redundancy corresponding to each of them	2009-12-07 03:59:28
	...and as if there's not enough of these in plain english	2009-12-07 04:01:57
	enwik also has lots of artifical markup, which is relatively easy to interpret, but still requires writing specific parsers etc	2009-12-07 04:03:06
<schnaader>	yes, actually the first items on my enwik list are just about seperating the articles or preprocessing the data to remove HTML characters etc.	2009-12-07 04:05:44
<Shelwien>	there're lots of masked html stuff in there	2009-12-07 04:06:34
	like <html> ;)	2009-12-07 04:06:42
<schnaader>	Yes, and have you seen that ASCII table? Could just generate it straightforward if there wouldn't be that HTML shit :)	2009-12-07 04:07:25
<Shelwien>	the problem is that its apparently impossible to just replace stuff everywhere	2009-12-07 04:08:19
<schnaader>	I somehow still expect enwik to contain some more data that can be generated. I already found some tables, numbers and things like that, but it's quite time consuming to search for things like that.	2009-12-07 04:08:30
<Shelwien>	i mean, exporting articles from xml	2009-12-07 04:08:35
	and doing s/</</g is not completely reversible	2009-12-07 04:08:54
	yeah	2009-12-07 04:09:33
<schnaader>	Well, there are some unused byte values I thought about using for that, although there aren't enough to completely replace all HTML entities.	2009-12-07 04:09:57
<Shelwien>	i think that the right way would be to do it incrementally	2009-12-07 04:10:01
	like, take the first articles and properly compress it	2009-12-07 04:10:22
	then generalized the rules to include the second article	2009-12-07 04:10:41
	etc	2009-12-07 04:10:42
	a lot of manual work either way	2009-12-07 04:10:55
	*the first article	2009-12-07 04:11:34
	* generalize	2009-12-07 04:11:37
<schnaader>	btw, major problem for ISBN are the different formats. Most of the time there will be "ISBN xxxxxxxxxxx", but there are variations like "ISBN x xxxxxx xx x", ISBN "x-xxx-xxxxx-x"...	2009-12-07 04:14:32
	And if detection gets to general, you'll change some numbers that aren't ISBN	2009-12-07 04:15:23
<Shelwien>	well, i doubt that ISBN add that much redundancy there ;)	2009-12-07 04:15:29
<schnaader>	No, not really :) Didn't count them, but I doubt it's more than 10000 of them there	2009-12-07 04:15:53
<Shelwien>	and with controlled regexps like i described	2009-12-07 04:16:05
	its not really a problem even if there'd be some mismatches	2009-12-07 04:16:17
	...we'd also need some tricky algorithms there too, though	2009-12-07 04:19:43
	like optimal parsing, context clustering etc	2009-12-07 04:20:07
	and also dictionary compression	2009-12-07 04:20:14
	its not really necessary to compress a standalone dictionary there	2009-12-07 04:20:56
	but its a good simple testfile for a morphology model	2009-12-07 04:21:21
<schnaader>	:)	2009-12-07 04:21:37
<Shelwien>	and i don't quite understand how to build that	2009-12-07 04:21:44
	...i guess its still good that its english though	2009-12-07 04:23:08
	because it'd be even more complex with eg. russian	2009-12-07 04:23:33
<schnaader>	finnish would be most extreme, I guess :)	2009-12-07 04:25:24
<Shelwien>	i'd say chinese ;)	2009-12-07 04:25:51
<schnaader>	OK, you won :)	2009-12-07 04:25:58
* Shelwien recently suggested using chinese wiki dump (zhwiki) to Sami in his new benchmark		2009-12-07 04:26:34
	btw, have you seen that german Wikipedia DVD result for Precomp?	2009-12-07 04:27:39
<Shelwien>	...guess not	2009-12-07 04:29:11
	not sure what are you talking about even ;)	2009-12-07 04:29:24
<schnaader>	http://schnaader.info/precomp_wiki_dvd_04dev.html	2009-12-07 04:29:27
<Shelwien>	still don't quite understand what's that DVD	2009-12-07 04:31:22
<schnaader>	They recently switched the "zeno" format they used there to something (hopefully) more efficient than zLib, I was quite shocked when I saw you could compress the DVD to almost half of its size	2009-12-07 04:31:22
	AFAIK it's the whole Wikipedia with reduced images and without discussion entries so it fits on a DVD	2009-12-07 04:32:09
<Shelwien>	the usual problem with DVDs is that they're fixed-size	2009-12-07 04:32:23
	so people sometimes even hide something unnecessary there	2009-12-07 04:32:52
	just to make the software to use up all the DVD space	2009-12-07 04:33:15
<schnaader>	yeah, that's right. Although additional 2 GB could be used for better image quality or something similar in that case, I suppose.	2009-12-07 04:34:10
<Shelwien>	does precomp find anything in enwik btw? ;)	2009-12-07 04:35:32
<schnaader>	And the DVD isn't sold or things like that, I think, it's primary a download I think.	2009-12-07 04:35:41
<Shelwien>	huh. then its really surprising	2009-12-07 04:36:08
	they could use bzip at least	2009-12-07 04:36:16
<schnaader>	I think that's what they use for the new "zeno" format, could've been LZMA, too, I don't remember	2009-12-07 04:36:46
	Hehe, there are some GIF mismatches in enwik9 because GIF detection is looking for "GIF87"/"GIF89", but nothing relevant :)	2009-12-07 04:37:17
<Shelwien>	i'm kinda not sure about LZMA being better than bzip2 for text compression	2009-12-07 04:37:25
<schnaader>	Ah, found it, the new format is called "ZIM" - http://openzim.org/Main_Page - I also found a quote that says "article took 3 GB before, 1.4 GB with ZIM)	2009-12-07 04:39:51
	They're using bzip2, lzma is an option, but not implemented	2009-12-07 04:40:46
	http://openzim.org/ZIM_File_Format#Clusters	2009-12-07 04:40:59
<Shelwien>	wonder why they don't participate in hutter challenge ;)	2009-12-07 04:42:03
<schnaader>	http://openzim.org/ZIMwriter says "coming soon...", would've been nice to let it run over enwik9 and compare the result with plain bZip2 :)	2009-12-07 04:44:56
	Although they seem to create some search indexes there, too which isn't that helpful :)	2009-12-07 04:45:36
<Shelwien>	Zim is the surname of one of my friends here, i'd ask him about it ;)	2009-12-07 04:48:23
	meanwhile, the question is how to find the reverse regexps	2009-12-07 04:53:00
	i mean, like, automatically derive s/([\w]) ([,.;])/$1$2/g from s/([\w])([,.;])/$1 $2/g	2009-12-07 04:53:56
<schnaader>	Just to make sure I understand the regexp: This changes "bla, bla.bla;" to "bla , bla .bla ;", right?	2009-12-07 04:55:51
*** STalKer-Y has joined the channel		2009-12-07 04:56:14
	it could be helpful to use some easier format you can transform to regexp and that (the easier format) can be reversed easier.	2009-12-07 04:59:24
*** STalKer-X has left the channel		2009-12-07 05:01:11
*** schnaader has left the channel		2009-12-07 05:03:07
<Shelwien>	!next	2009-12-07 05:07:32