*** Shelwien has left the channel | 2009-09-09 02:53:59 |
*** pinc has joined the channel | 2009-09-09 07:08:27 |
*** pinc has left the channel | 2009-09-09 08:22:31 |
*** pinc has joined the channel | 2009-09-09 08:45:45 |
*** Shelwien has joined the channel | 2009-09-09 11:01:08 |
<osman> | here is a really lazy programmer: http://imafrogg.com/blog/jpeg-text-compression/ | 2009-09-09 11:15:47 |
| :) | 2009-09-09 11:15:49 |
<Shelwien> | ... | 2009-09-09 11:17:08 |
| as funny as it may sound, there's some sense in using the visual text representation for compression | 2009-09-09 11:21:17 |
| (and other tasks too) | 2009-09-09 11:21:50 |
<osman> | it's another topic i think | 2009-09-09 11:22:09 |
| i know what do you exactly mean | 2009-09-09 11:22:18 |
<Shelwien> | like, how spammers write some keywords in their mails? | 2009-09-09 11:22:19 |
| also the same applies to the audio version ;) | 2009-09-09 11:23:18 |
| but both are only usable as contexts, not as the main stream | 2009-09-09 11:23:41 |
<osman> | yep. that's the point IMO | 2009-09-09 11:29:52 |
<Shelwien> | ... | 2009-09-09 11:34:32 |
| btw, what do you think about my static compression idea? | 2009-09-09 11:35:14 |
| i mean, the one with log2(c[i]^c) contexts? | 2009-09-09 11:35:42 |
| i was thinking about fma-delta | 2009-09-09 11:36:47 |
| and, well, if there's a data window, and larger window is better | 2009-09-09 11:37:23 |
| then it might be reasonable to compress the data in there ;) | 2009-09-09 11:37:38 |
| and then, for hashing there's also a sense to use some compression | 2009-09-09 11:39:03 |
| so it seems more practical to use the same coding for both | 2009-09-09 11:39:37 |
| but hashing requires that coding to be completely static | 2009-09-09 11:39:53 |
| because otherwise hashes for different files won't match ;) | 2009-09-09 11:40:14 |
<osman> | looks interesting at least :) | 2009-09-09 11:44:06 |
<Shelwien> | do you understand the idea? | 2009-09-09 11:44:23 |
| basically its like extended RLE | 2009-09-09 11:44:31 |
<osman> | why do you use log2(c[i]^c) as context? | 2009-09-09 11:44:42 |
<Shelwien> | number of matchin MSBs actually | 2009-09-09 11:44:58 |
| in context byte and next byte | 2009-09-09 11:45:12 |
<osman> | ah..ok | 2009-09-09 11:45:32 |
<Shelwien> | and of course i mean to use multiple such contexts | 2009-09-09 11:45:35 |
| like 4-5-6 | 2009-09-09 11:45:41 |
<osman> | so, it's somehow an extended REP like coder? | 2009-09-09 11:45:53 |
<Shelwien> | err... what is? | 2009-09-09 11:46:05 |
<osman> | i mean whole idea | 2009-09-09 11:46:24 |
| with over greater distance than actual window size | 2009-09-09 11:46:39 |
<Shelwien> | in a way, more or less | 2009-09-09 11:46:48 |
| its an engine for fast finding of long matches | 2009-09-09 11:47:10 |
| i already posted the remote diff-patch kit based on that | 2009-09-09 11:47:35 |
<osman> | yeah. i remember | 2009-09-09 11:47:49 |
<Shelwien> | and next i'm thinking to write a tool similar to xdelta | 2009-09-09 11:48:08 |
| but that requires to keep a data window for better efficiency | 2009-09-09 11:48:30 |
| and i'm thinking that it might be cool to compress the data in window ;) | 2009-09-09 11:48:53 |
| btw, here's that game of mine - http://shelwien.googlepages.com/hopters.com | 2009-09-09 11:55:42 |
| seems to be ok at 50k in dosbox too | 2009-09-09 11:57:47 |
| arrows/ASWD and left/right shifts | 2009-09-09 11:58:11 |
<osman> | hehe... | 2009-09-09 11:59:23 |
| there is a funny bug | 2009-09-09 11:59:28 |
| even after "exploding" i can still shoot :) | 2009-09-09 11:59:40 |
<Shelwien> | its not a bug ;) | 2009-09-09 11:59:47 |
| its justice ;) | 2009-09-09 11:59:50 |
<osman> | i have a "ultra-futuristic" helicopter now. i can move exploded helicopter with almost no effort 8-) | 2009-09-09 12:00:47 |
<Shelwien> | ;) | 2009-09-09 12:01:05 |
<osman> | i think it's really good. i can't see any technical differences between dangerous dave (afair) or yours | 2009-09-09 12:01:57 |
| and at that time i really like "dave" :) | 2009-09-09 12:02:09 |
<Shelwien> | yeah, its actually even playable | 2009-09-09 12:02:20 |
| and there was even some networking support %) | 2009-09-09 12:02:49 |
| very weird though | 2009-09-09 12:02:57 |
<osman> | it could be good if i could play against to machine | 2009-09-09 12:03:14 |
<Shelwien> | i made a 2nd keyboard emulator TSR ;) | 2009-09-09 12:03:20 |
<osman> | playing with "myself" made thinking | 2009-09-09 12:03:35 |
| cool %) | 2009-09-09 12:03:44 |
<Shelwien> | it was transmitting the keypresses from a different machine | 2009-09-09 12:03:48 |
| and pushing them into local keyboard controller | 2009-09-09 12:04:06 |
| btw, its undocumented | 2009-09-09 12:04:12 |
| but there was a way to store your own value into port 60 | 2009-09-09 12:04:35 |
| and generate IRQ1 even | 2009-09-09 12:04:39 |
| it was originally made for MK3 fights though ;) | 2009-09-09 12:05:18 |
| to avoid keyboard blocking ;) | 2009-09-09 12:05:23 |
<osman> | at win9x time, i have tried to read keyboard, comport, mouse etc | 2009-09-09 12:05:40 |
| and thought as like that "what if i try to read all ports in a specific range" %) | 2009-09-09 12:06:03 |
| voila! i had got a "guarantee" computer freezer :) | 2009-09-09 12:06:23 |
<Shelwien> | well, dunno how to get that with reading | 2009-09-09 12:06:46 |
| but writing would certainly work | 2009-09-09 12:06:58 |
<osman> | with "in" instruction | 2009-09-09 12:06:58 |
<Shelwien> | for example, there was that 8042 timer | 2009-09-09 12:07:08 |
| one channel of which controlled the memory refresh\ | 2009-09-09 12:07:18 |
| so it was possible to make programs to run a little faster | 2009-09-09 12:07:51 |
| with the risk of memory loss | 2009-09-09 12:08:03 |
<osman> | %) | 2009-09-09 12:08:57 |
*** pinc has left the channel | 2009-09-09 15:07:47 |
<Shelwien> | btw | 2009-09-09 16:23:57 |
| how to print exactly what i want on my printer still remains the question | 2009-09-09 16:24:15 |
| by i've got another idea | 2009-09-09 16:24:23 |
| instead, i can show a picture on the screen ;) | 2009-09-09 16:24:45 |
| and take a photo | 2009-09-09 16:24:56 |
| and then recover the information out of it | 2009-09-09 16:25:09 |
| its a considerably different task | 2009-09-09 16:25:38 |
| but would be still a good application for my error-correction ideas ;) | 2009-09-09 16:26:02 |
*** Simon|B has joined the channel | 2009-09-09 17:44:58 |
*** toffer has joined the channel | 2009-09-09 17:45:24 |
<toffer> | hi | 2009-09-09 17:46:45 |
<Shelwien> | hi | 2009-09-09 17:46:53 |
* Shelwien is writing the log2(c^c[i]) static coder | 2009-09-09 17:47:16 |
<toffer> | sorry that i hardly participate - the deadline for my thesis is approaching ^^ | 2009-09-09 17:47:17 |
| ? | 2009-09-09 17:48:03 |
<Shelwien> | i told you before | 2009-09-09 17:48:18 |
| that i'd like to use some compression before hashing etc | 2009-09-09 17:48:41 |
| to improve randomness etc | 2009-09-09 17:48:47 |
| but it has to be a static model, same for all files | 2009-09-09 17:49:14 |
<toffer> | "before" was some time ago | 2009-09-09 17:49:36 |
<Shelwien> | so i'd invented something like extended RLE | 2009-09-09 17:49:37 |
<toffer> | and how does it work? | 2009-09-09 17:50:58 |
| "in short" | 2009-09-09 17:51:03 |
| since i'd leave in ~20 minutes | 2009-09-09 17:51:12 |
<Shelwien> | as i said... c[i] are previous symbols, and c is current one | 2009-09-09 17:51:37 |
| and context is something like | 2009-09-09 17:51:53 |
<toffer> | well i read the expression differently :) | 2009-09-09 17:52:10 |
<Shelwien> | log2(c^c[0]) | 2009-09-09 17:52:11 |
<toffer> | ok | 2009-09-09 17:52:19 |
<Shelwien> | log2(c[0]^c[1]) | 2009-09-09 17:52:20 |
| etc | 2009-09-09 17:52:20 |
<toffer> | i do remember that | 2009-09-09 17:52:21 |
<Shelwien> | basically the number of matching MSBs in symbols | 2009-09-09 17:52:39 |
| well, it works more or less | 2009-09-09 17:52:58 |
<toffer> | do you have any results alread? | 2009-09-09 17:52:59 |
<Shelwien> | with order-4 like that | 2009-09-09 17:53:07 |
| 9*9*9*9 contexts | 2009-09-09 17:53:13 |
<toffer> | just 9 ? | 2009-09-09 17:53:33 |
<Shelwien> | 3.1M->2.1M calgary.tar compression | 2009-09-09 17:53:39 |
| matching bits | 2009-09-09 17:53:49 |
| 0..8 | 2009-09-09 17:53:52 |
<toffer> | ok | 2009-09-09 17:54:39 |
<Shelwien> | have to do more tests | 2009-09-09 17:54:53 |
| and maybe extend the context | 2009-09-09 17:54:58 |
<toffer> | i guess that kind of context quantisation is well suited for redundant data | 2009-09-09 17:55:03 |
<Shelwien> | but i think this would be usable | 2009-09-09 17:55:08 |
<toffer> | i mean directly | 2009-09-09 17:55:25 |
<Shelwien> | the whole point is | 2009-09-09 17:55:26 |
| to compress redundant data | 2009-09-09 17:55:31 |
<toffer> | not as a generator | 2009-09-09 17:55:37 |
<Shelwien> | and to not expand anything | 2009-09-09 17:55:39 |
| and it has to be a static model | 2009-09-09 17:55:48 |
| and that's the idea i've got | 2009-09-09 17:56:02 |
| maybe you can suggest something else to apply in this case? | 2009-09-09 17:56:54 |
<toffer> | well i cannot imagine anything which would be that fast | 2009-09-09 17:57:34 |
| since it's just a lookupp | 2009-09-09 17:57:39 |
| lookup | 2009-09-09 17:57:41 |
<Shelwien> | yeah, but i'm talking about the model | 2009-09-09 17:57:55 |
| do you have any alternative ideas for a model | 2009-09-09 17:58:12 |
| which would be static | 2009-09-09 17:58:20 |
| would allow some compression sometimes | 2009-09-09 17:58:31 |
| and would not significantly expand anything | 2009-09-09 17:58:39 |
| despite being static | 2009-09-09 17:58:43 |
<toffer> | some alphabet decomposition based on prefix codes (e.g. huffman) | 2009-09-09 17:59:27 |
| would hardly expand anything | 2009-09-09 17:59:38 |
| and provide some compression | 2009-09-09 17:59:47 |
<Shelwien> | well, obviously i plan to use huffman with this coding | 2009-09-09 17:59:54 |
| but plain static huffman won't work | 2009-09-09 18:00:05 |
<toffer> | not static | 2009-09-09 18:00:09 |
| dynamic | 2009-09-09 18:00:14 |
<Shelwien> | not static can't be used in this case | 2009-09-09 18:00:21 |
<toffer> | but that's still a two pass process | 2009-09-09 18:00:23 |
| you can store the tree | 2009-09-09 18:00:36 |
<Shelwien> | as i need encoded block hashes in different files | 2009-09-09 18:00:39 |
| to match | 2009-09-09 18:00:41 |
| (for equal substrings) | 2009-09-09 18:01:06 |
<toffer> | it's for your diff? | 2009-09-09 18:01:15 |
<Shelwien> | for all of it | 2009-09-09 18:01:24 |
| i've started writing it now | 2009-09-09 18:01:33 |
<toffer> | well storing a huffman tree would be bad for a diff | 2009-09-09 18:01:40 |
<Shelwien> | because fma-delta needs a data window | 2009-09-09 18:01:45 |
<toffer> | but still be acceptable for compression | 2009-09-09 18:01:47 |
<Shelwien> | and more data would fit into the window in compressed form | 2009-09-09 18:02:01 |
| and i need this for better hashing of redundant data anyway | 2009-09-09 18:02:23 |
<toffer> | what about reusing unused symbols? | 2009-09-09 18:02:51 |
<Shelwien> | diff just won't work with a stored huffman tree ;) | 2009-09-09 18:02:52 |
<toffer> | that's why i asked for the application | 2009-09-09 18:03:05 |
<Shelwien> | and LZ-like algos won't work either ;) | 2009-09-09 18:03:12 |
<toffer> | or extending the alphabet to 9 bit and do some ngram replacement | 2009-09-09 18:03:18 |
<Shelwien> | won't work | 2009-09-09 18:03:30 |
<toffer> | why? | 2009-09-09 18:03:35 |
| mh well ok for diffing it won't | 2009-09-09 18:03:59 |
<Shelwien> | ngram replacement might help, but there're plans to use such filters separately from FMA engine anyway | 2009-09-09 18:05:09 |
| (FMA = far match analysis) | 2009-09-09 18:05:19 |
| and shrinking the alphabet | 2009-09-09 18:05:44 |
| won't work because some files would have full alphabet | 2009-09-09 18:05:56 |
| and the same substrings | 2009-09-09 18:06:00 |
<toffer> | i cannot imagine anything atm | 2009-09-09 18:07:55 |
<Shelwien> | why, there's a lot | 2009-09-09 18:08:10 |
<toffer> | at least nothing which isn't adaptive | 2009-09-09 18:08:11 |
<Shelwien> | for example, MTF can be applicable | 2009-09-09 18:08:16 |
| with some restrictions | 2009-09-09 18:08:29 |
<toffer> | but mtf is adaptive | 2009-09-09 18:08:46 |
<Shelwien> | yeah, but its adaptivity can be contained in a small window | 2009-09-09 18:09:03 |
| and i only need such a coding | 2009-09-09 18:09:36 |
| that in the equal 512-byte blocks in different files | 2009-09-09 18:09:52 |
| hashes of at least one 256-byte substring would match | 2009-09-09 18:10:05 |
| but mtf has a different problem | 2009-09-09 18:10:44 |
| i don't know how to prevent it from being redundant on random data ;) | 2009-09-09 18:11:05 |
<toffer> | maybe you should restate the exact requirements | 2009-09-09 18:12:11 |
<Shelwien> | i need a model, which would provide some compression for redundant data | 2009-09-09 18:13:10 |
| and won't expand random etc data | 2009-09-09 18:13:18 |
| and codes of substrings in different files encodings | 2009-09-09 18:14:02 |
| have to still match if strings match | 2009-09-09 18:14:11 |
*** pinc has joined the channel | 2009-09-09 18:15:48 |
<toffer> | gonna leave now. back again later on | 2009-09-09 18:23:00 |
| bye | 2009-09-09 18:23:02 |
*** toffer has left the channel | 2009-09-09 18:23:06 |
<Shelwien> | ;) | 2009-09-09 18:23:11 |
*** asmodean has left the channel | 2009-09-09 18:48:00 |
*** pinc has left the channel | 2009-09-09 18:48:00 |
*** Simon|B has left the channel | 2009-09-09 18:48:00 |
*** Shelwien has left the channel | 2009-09-09 18:48:00 |
*** osman has left the channel | 2009-09-09 18:48:00 |
*** Shelwien has joined the channel | 2009-09-09 18:48:01 |
*** pinc has joined the channel | 2009-09-09 18:48:01 |
*** Simon|B has joined the channel | 2009-09-09 18:48:01 |
*** osman has joined the channel | 2009-09-09 18:48:01 |
*** asmodean has joined the channel | 2009-09-09 18:48:01 |
* ChanServ This channel has been registered with ChanServ. | 2009-09-09 18:48:01 |
<osman> | hi shelwien | 2009-09-09 19:11:17 |
| seems i have found something weird again :) | 2009-09-09 19:11:27 |
| you know pattern matching is a important part of an archiver | 2009-09-09 19:11:56 |
| so, i've worked on it. | 2009-09-09 19:12:03 |
| but, at a time, i realized that actually we can't easily do it. because, unicode coding is variable and so, we can't work on arrays | 2009-09-09 19:12:43 |
| for ensuring my idea, i've looked at sami's fnmatch and 7-zip wildcards source | 2009-09-09 19:13:21 |
| they are all "assume" as strings are basically arrays and each independent array element represent a single character | 2009-09-09 19:14:15 |
| so, at the end, both 7zip and sami's work should fail on asian languages with "?" wildcards %) | 2009-09-09 19:14:59 |
| what do you think about it? | 2009-09-09 19:29:56 |
<Shelwien> | there's probably a lot of other problems anyway | 2009-09-09 19:39:28 |
| like sami's works imho don't support filename shortcuts like PROGRA~1 for "Program Files" | 2009-09-09 19:40:04 |
| don't remember about nz, but "archiver template" doesn't for sure | 2009-09-09 19:40:29 |
| also, i don't think that console archivers actually need anything more complex than *.exe | 2009-09-09 19:41:37 |
<osman> | imagine if someone tries to only "archive" with 3 letters and they will surely use "???" as pattern | 2009-09-09 19:46:46 |
<Shelwien> | yeah, you can imagine anything, but did you ever use something like that? ;) | 2009-09-09 19:47:31 |
<osman> | but, in asian languages each unicode codepoint sometimes > 0xFFFF, so, both "archiver template" and 7zip will fail to match correctly | 2009-09-09 19:47:52 |
| you are right. i didn't use. but what if some use? ;) | 2009-09-09 19:48:16 |
| i wouldn't call that as "unicode" support | 2009-09-09 19:48:28 |
<Shelwien> | there're GUIs etc anyway | 2009-09-09 19:48:29 |
| which normally don't have such features at all ;) | 2009-09-09 19:48:45 |
<osman> | even winrar can fail in that area ;) | 2009-09-09 19:48:49 |
<Shelwien> | whatever | 2009-09-09 19:49:11 |
| i'm just trying to say that building a perfect pattern matcher | 2009-09-09 19:49:20 |
| might be not practical | 2009-09-09 19:49:25 |
<osman> | because i didn't see any special handling of string in unrar source. afair, filename stored as UTF-16 in archiver | 2009-09-09 19:49:50 |
<Shelwien> | at least, if it'd slow down the file enumeration for more common patterns | 2009-09-09 19:49:59 |
| but well | 2009-09-09 19:50:38 |
| if we're gonna work with utf8 anyway | 2009-09-09 19:50:46 |
| then supporting this makes sense ;) | 2009-09-09 19:51:03 |
<osman> | yeah. don't forget. i'm working on both linux and windows simultanesly now. | 2009-09-09 19:51:25 |
| so, i'm considering both utf-8 and utf-16 | 2009-09-09 19:51:38 |
<Shelwien> | why? | 2009-09-09 19:51:49 |
<osman> | for taking some ideas, i have just downloaded linux kernel %) | 2009-09-09 19:51:53 |
<Shelwien> | just convert utf-16 to utf-8 | 2009-09-09 19:51:56 |
<osman> | i realized that working with utf-8 can be a high overload | 2009-09-09 19:52:29 |
| so, i'll use utf-16 under windows and utf-8 under posix compliant OSes | 2009-09-09 19:52:46 |
| for only internal representation | 2009-09-09 19:53:13 |
| but, in archive data etc, i'll always use utf-8 | 2009-09-09 19:53:33 |
| "my heart will go on utf-8" :) | 2009-09-09 19:53:50 |
<Shelwien> | what kind of "overload"? | 2009-09-09 19:54:05 |
| i don't think that utf8-utf16 conversion would be any slower than wstrcpy (or how its called) | 2009-09-09 19:54:54 |
<osman> | conversion on API calls and checking surrogates for ensuring character length | 2009-09-09 19:54:59 |
<Shelwien> | dunno | 2009-09-09 19:55:18 |
| i think that utf8 would be actually faster as it would be more compact | 2009-09-09 19:55:32 |
<osman> | ahhh...actually even my str length function is wrong now %) | 2009-09-09 19:55:44 |
| seems using two different handling could cause a real "headache" %) | 2009-09-09 19:56:13 |
*** pinc|mirror has joined the channel | 2009-09-09 19:56:13 |
<Shelwien> | its very easy to count symbols in utf8 strings | 2009-09-09 19:56:19 |
| as you can just ignore some codes | 2009-09-09 19:56:35 |
<osman> | do you know a "shortcut"? | 2009-09-09 19:56:36 |
<Shelwien> | ? | 2009-09-09 19:56:48 |
<osman> | i mean a easy way | 2009-09-09 19:57:02 |
| without handling surrogates | 2009-09-09 19:57:10 |
<Shelwien> | as i said... in utf8 it seems simple | 2009-09-09 19:57:22 |
<osman> | more preciesly less branches | 2009-09-09 19:57:26 |
<Shelwien> | just ignore the 10xxxxxx codes | 2009-09-09 19:57:41 |
*** pinc has left the channel | 2009-09-09 19:59:55 |
<osman> | len += ((c & 128) != 0) or something like that? | 2009-09-09 20:00:28 |
*** pinc|mirror has left the channel | 2009-09-09 20:00:53 |
<Shelwien> | not exactly | 2009-09-09 20:01:29 |
| (c & 0xC0) != 0xC0 | 2009-09-09 20:01:41 |
<osman> | 7zip has been frozen while extracting linux kernel %) | 2009-09-09 20:03:27 |
<Shelwien> | ? | 2009-09-09 20:04:02 |
<osman> | i mean did not respond for a long time | 2009-09-09 20:04:36 |
| btw, why do almost all archivers first extract files to temp and then move the actual extraction target? | 2009-09-09 20:32:20 |
<Shelwien> | "all"? | 2009-09-09 20:32:39 |
| freearc maybe, as its weird | 2009-09-09 20:32:50 |
| though as to reasons | 2009-09-09 20:33:22 |
<osman> | 7zip and rar do that too | 2009-09-09 20:33:38 |
<Shelwien> | the destination file might exist | 2009-09-09 20:33:39 |
| and if extracted file has the same name | 2009-09-09 20:33:54 |
| but, for example, is broken | 2009-09-09 20:34:04 |
| they make sure that it won't overwrite anything | 2009-09-09 20:34:15 |
| or something | 2009-09-09 20:34:21 |
<osman> | they can ask at least | 2009-09-09 20:34:29 |
<Shelwien> | anyway, they extract stuff to tempfiles, yeah | 2009-09-09 20:34:35 |
<osman> | this both doubles required time and disk space | 2009-09-09 20:34:46 |
<Shelwien> | but i think they should create these tempfiles on the target drive | 2009-09-09 20:34:49 |
| otherwise it takes too long to move the data | 2009-09-09 20:35:11 |
<osman> | all of them creates at temp directory which is irrelevant to target drive. so, i always have to "clean" my C: drive | 2009-09-09 20:35:48 |
<Shelwien> | dunno really | 2009-09-09 20:36:35 |
<osman> | it's really annoying for me | 2009-09-09 20:36:49 |
| i sometimes could not extract some iso files or dvd movies | 2009-09-09 20:37:05 |
<Shelwien> | i still don't think that console rar works like that | 2009-09-09 20:37:23 |
<osman> | it might not be | 2009-09-09 20:37:40 |
<Shelwien> | ...huh?! %) | 2009-09-09 20:42:52 |
| seems that my msb coders compresses archives ;) | 2009-09-09 20:43:25 |
| a little ;) | 2009-09-09 20:43:28 |
<osman> | i mean console rar might not fit "extract to temp" rule | 2009-09-09 20:43:43 |
| you mean even compressed data? | 2009-09-09 20:43:53 |
<Shelwien> | well, original rar 269456 bytes | 2009-09-09 20:44:10 |
| compressed 269003 | 2009-09-09 20:44:15 |
<osman> | for a static coder, it's very good IMO | 2009-09-09 20:44:41 |
<Shelwien> | well, i suspect that's because of statistics | 2009-09-09 20:45:09 |
<osman> | http://www.koders.com/c/fid856C2F4B1D04931B2005712C658E2DC3D181154E.aspx | 2009-09-09 20:57:09 |
| seems everyone is not perfect :/ | 2009-09-09 20:57:21 |
| this source also does not take utf-8 variable property into account | 2009-09-09 20:58:03 |
<Shelwien> | ...and nobody cares ;) | 2009-09-09 20:58:28 |
*** Simon|B has left the channel | 2009-09-09 20:59:21 |
<osman> | are you sure? | 2009-09-09 21:03:23 |
| asian people are really angry with who developed unicode set. because most of their characters are in range > 0xFFFF | 2009-09-09 21:04:04 |
<Shelwien> | not japanese i think ;) | 2009-09-09 21:04:44 |
<osman> | if we consider that there are ~3 billion chinese. and considering whole world population is around ~5-6 billion. we should take care IMO :) | 2009-09-09 21:04:47 |
<Shelwien> | its not that bad actually ;) | 2009-09-09 21:05:29 |
<osman> | you know that most spoken language is actually chinese not english :) | 2009-09-09 21:05:31 |
<Shelwien> | sure | 2009-09-09 21:06:08 |
| english is not even second apparently ;) | 2009-09-09 21:06:18 |
*** toffer has joined the channel | 2009-09-09 21:12:40 |
| toffer: i made the coder and it compresses book1 to ~570k | 2009-09-09 21:17:25 |
| and what's more funny, it compresses archives %) | 2009-09-09 21:17:37 |
<toffer> | hi | 2009-09-09 21:19:07 |
| archives still have a header and stuff like this | 2009-09-09 21:19:15 |
<Shelwien> | yeah | 2009-09-09 21:19:26 |
| <osman> you mean even compressed data? | 2009-09-09 21:19:35 |
| <Shelwien> well, original rar 269456 bytes | 2009-09-09 21:19:35 |
| <Shelwien> compressed 269003 | 2009-09-09 21:19:35 |
<toffer> | that's just 400 bytes | 2009-09-09 21:19:51 |
<Shelwien> | yeah, but its not expanded ;) | 2009-09-09 21:20:07 |
| which is good ;) | 2009-09-09 21:20:10 |
<osman> | then try to compress a 7zip or winrk archive :) afair, their headers are also compressed | 2009-09-09 21:20:14 |
<Shelwien> | some m1*.7z | 2009-09-09 21:21:06 |
| 78510 -> 78159 ;) | 2009-09-09 21:21:17 |
<osman> | hehe | 2009-09-09 21:21:27 |
<toffer> | i'd only count that if it scales on large archives | 2009-09-09 21:21:42 |
<Shelwien> | probably does, if there're lots of files | 2009-09-09 21:22:01 |
| there's probably some small redundancy | 2009-09-09 21:22:16 |
<toffer> | (if thre're not lots files in the header) | 2009-09-09 21:22:19 |
<Shelwien> | like and rc stream start/end etc | 2009-09-09 21:22:23 |
<toffer> | file names and stuff like that | 2009-09-09 21:22:26 |
<Shelwien> | scales | 2009-09-09 21:23:10 |
<osman> | what about your mkv video test? it's really hard to compress IMO | 2009-09-09 21:23:30 |
<Shelwien> | 3k difference on 10M zip archive | 2009-09-09 21:23:31 |
| wow... | 2009-09-09 21:24:25 |
<toffer> | and how much kb does zip save if you zip the zipfile again ... zip! :) | 2009-09-09 21:24:31 |
<Shelwien> | 23k on that mkv | 2009-09-09 21:25:01 |
<osman> | hehe. it might outperform at least BIT :) | 2009-09-09 21:25:23 |
<Shelwien> | well, some of that is certainly due to statistics volume | 2009-09-09 21:26:24 |
| its not perfectly static yet | 2009-09-09 21:26:36 |
| but things like 3k and 23k are certainly much larger than stats | 2009-09-09 21:27:04 |
| i think that's because its able to detect compressible substrings | 2009-09-09 21:28:02 |
| i mean, if there're not much msb matches in context, it just leaves it alone | 2009-09-09 21:28:55 |
| seems like not quite bad algo for detection and maybe segmentation | 2009-09-09 21:29:53 |
<osman> | do you use trunc(log2(c[i]^c) * k) or just trunc(log2(c[i]^c)) ? | 2009-09-09 21:31:02 |
| i mean 9 contexts or more? | 2009-09-09 21:31:22 |
<Shelwien> | "just" and i don't really use log2 at all ;) | 2009-09-09 21:31:36 |
| there 9^4 contexts | 2009-09-09 21:31:47 |
<osman> | yep. last one is actually a bsr instruction :) | 2009-09-09 21:31:55 |
<Shelwien> | LUT in my case | 2009-09-09 21:32:05 |
<osman> | try bsr. it might help.... but maybe not. because, you have a single LUT and it can be highly cached | 2009-09-09 21:32:46 |
<Shelwien> | actually i'd have a single LUT per whole context index | 2009-09-09 21:33:11 |
| well, maybe | 2009-09-09 21:33:27 |
| i mean, these *9 are not really good ;) | 2009-09-09 21:34:02 |
| even if they're done via LEA's actually ;) | 2009-09-09 21:34:20 |
<osman> | :) | 2009-09-09 21:34:41 |
<Shelwien> | wonder if i should move the case bit to lsb or something %) | 2009-09-09 21:36:41 |
<osman> | it might scale like before :) | 2009-09-09 21:37:04 |
| because lsbs are mostly noisy | 2009-09-09 21:37:15 |
<Shelwien> | i mean, A/a case | 2009-09-09 21:37:24 |
<osman> | aa...ok. got it | 2009-09-09 21:37:51 |
| it can help :) | 2009-09-09 21:37:55 |
| just optimize your reoder for that :) | 2009-09-09 21:38:16 |
<Shelwien> | i thought that too | 2009-09-09 21:38:26 |
<osman> | it may more helpful | 2009-09-09 21:38:27 |
<Shelwien> | not reorder, just bit order in the byte ;) | 2009-09-09 21:38:41 |
<osman> | if you are not lazy as me, then why not? :) | 2009-09-09 21:39:09 |
| i would probably start reoder optimization and sleep after that :) | 2009-09-09 21:39:26 |
<Shelwien> | well, i'd do that | 2009-09-09 21:39:29 |
| i'd have to convert it to huffman anyway | 2009-09-09 21:40:00 |
<osman> | btw, i realized that actually GCC comes from another dimension of the space %) it won't compile most of my sources %) | 2009-09-09 21:42:14 |
| *it doesn't compile | 2009-09-09 21:42:24 |
<Shelwien> | yeah | 2009-09-09 21:42:35 |
| the main problem is that it not only has a whole different runtime library | 2009-09-09 21:42:54 |
| but also has some annoying C++ syntax incompatibilities | 2009-09-09 21:43:10 |
<osman> | yep. definitely | 2009-09-09 21:43:24 |
| probably i'll use intelc for posix platforms in the end %) | 2009-09-09 21:43:57 |
<Shelwien> | yeah, might be a good idea | 2009-09-09 21:44:12 |
| though i didn't hear about IC for freebsd | 2009-09-09 21:44:27 |
<osman> | freebsd is posix compliant too. if i could even "execute" some simple command in freebsd, i would test my linux compile in there | 2009-09-09 21:45:09 |
| freebsd is a really nightmare | 2009-09-09 21:45:19 |
| it eventually crashes after starting GUI | 2009-09-09 21:45:39 |
| i can't use it in vmware | 2009-09-09 21:46:15 |
| just i can only see prompt | 2009-09-09 21:46:27 |
<Shelwien> | well, its vmware problem, not freebsd's | 2009-09-09 21:46:41 |
<osman> | most of commands are incompatible with linux distros' | 2009-09-09 21:46:45 |
| if i could not run it, then i can't test it right? :) so, it doesn't matter it's about vmware or not | 2009-09-09 21:47:30 |
| :) | 2009-09-09 21:47:33 |
| seems i'll start to test macos x :) | 2009-09-09 21:48:22 |
| it's posix compliant too :) | 2009-09-09 21:48:34 |
| "In UTF-8, characters outside the basic multilingual plane are not a special case. UTF-16 is often mistaken to be the obsolete constant-length UCS-2 encoding, leading to code that works for most text but suddenly fails for non-BMP characters. It's better to implement support for the entire range of Unicode from the start." | 2009-09-09 22:12:10 |
| from Wikipedia :) | 2009-09-09 22:12:15 |
*** toffer has left the channel | 2009-09-09 22:13:33 |
| "...Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version" i think this is a really good reason to use utf8 :) | 2009-09-09 22:15:18 |
<Shelwien> | err... i think many things take more spaces in utf-16 than in utf-8 ;) | 2009-09-09 22:26:15 |
<osman> | but, considering asian languages...it is a bit surprise to see utf-8 is more compact | 2009-09-09 22:26:54 |
<Shelwien> | you know, there're spaces and stuff too | 2009-09-09 22:27:44 |
<osman> | yep. that's the point in here actually :) | 2009-09-09 22:28:14 |
*** toffer has joined the channel | 2009-09-09 22:41:37 |
*** toffer has left the channel | 2009-09-09 23:45:13 |
<Shelwien> | !next | 2009-09-09 23:55:00 |