*** pinc has left the channel | 2009-10-18 12:17:46 |
*** pinc has joined the channel | 2009-10-18 12:23:55 |
*** Simon|B has joined the channel | 2009-10-18 15:04:59 |
<Simon|B> | hi | 2009-10-18 15:05:10 |
<Shelwien> | hi ;) | 2009-10-18 15:05:39 |
| a problem | 2009-10-18 15:20:23 |
| there's a set of integers x[] | 2009-10-18 15:21:12 |
| and we have a suspicion that a large percent of these | 2009-10-18 15:21:45 |
| can be represented as x[i]=a+b[i]*c | 2009-10-18 15:22:15 |
| so, how to efficiently find these? | 2009-10-18 15:22:40 |
<Simon|B> | are doing any interesting things at the moment? | 2009-10-18 15:22:41 |
<Shelwien> | dunno what's insteresting for you ;) | 2009-10-18 15:23:03 |
<Simon|B> | are you talking to me or anyone else? | 2009-10-18 15:23:08 |
| many things ;) | 2009-10-18 15:23:17 |
<Shelwien> | as to "problem" - that's for you ;) | 2009-10-18 15:23:29 |
| othere's here are mostly silent ;) | 2009-10-18 15:23:43 |
<Simon|B> | yeah thought so | 2009-10-18 15:23:51 |
<Shelwien> | and as to things | 2009-10-18 15:24:11 |
| i'm working on my backup for one job | 2009-10-18 15:24:26 |
<Simon|B> | yes I remember | 2009-10-18 15:24:37 |
<Shelwien> | and posted a few versions here | 2009-10-18 15:24:43 |
| and just made an iphone network stream player for my codec for another job | 2009-10-18 15:25:25 |
<Simon|B> | sounds nice. I am thinking at the moment to buy a new handy and have the problem between iphone - N900 :-D | 2009-10-18 15:26:38 |
| Hard decision I guess | 2009-10-18 15:27:17 |
<Shelwien> | well, iphone would be more convenient for me, i guess ;) | 2009-10-18 15:27:37 |
| though actually i can't say i like iphone as a phone or as a mp3 player | 2009-10-18 15:29:01 |
| there's a couple of convenient apps for it though | 2009-10-18 15:29:31 |
| but i really like how it has a complete unix environment | 2009-10-18 15:29:57 |
| so i can develop programs for it on windows (with cygwin toolchain) | 2009-10-18 15:30:32 |
| and run them with sftp/ssh on iphone | 2009-10-18 15:30:55 |
| also, its popular, and there're lots of resources | 2009-10-18 15:31:21 |
<pinc> | could you provide some examples of x[i]? | 2009-10-18 15:40:46 |
<Shelwien> | well, "obviously" i'm talking about structure detection in files | 2009-10-18 15:41:18 |
<pinc> | it seems to be straight linear regression | 2009-10-18 15:41:31 |
| from your words at least )) | 2009-10-18 15:41:47 |
<Shelwien> | so x[i] are basically offsets of some string occurences or something | 2009-10-18 15:42:09 |
| and dunno about it being so straight | 2009-10-18 15:42:18 |
| as they're not necessarily all like that | 2009-10-18 15:42:35 |
| basically we need to find "c" values to do a more thorough test | 2009-10-18 15:43:29 |
| so a straightforward solution | 2009-10-18 15:43:53 |
| would be something like collecting statistics on x[i]%c occurences | 2009-10-18 15:44:15 |
| for some set of c values | 2009-10-18 15:44:21 |
<pinc> | I think you should start from the data | 2009-10-18 15:44:24 |
<Shelwien> | but that doesn't seem like a good idea | 2009-10-18 15:44:44 |
| because division is slow | 2009-10-18 15:44:50 |
<pinc> | you want to predict next offset by linear predictor, right? | 2009-10-18 15:45:27 |
<Shelwien> | no | 2009-10-18 15:45:34 |
| i want to detect the record size in the file | 2009-10-18 15:45:55 |
| if there's a table | 2009-10-18 15:46:04 |
<pinc> | now its clearer )) | 2009-10-18 15:46:23 |
| take a look on method to recognize random generators period | 2009-10-18 15:46:42 |
<Shelwien> | well, but the table can be embedded in the data | 2009-10-18 15:46:57 |
| or there can be multiple tables with different record sizes | 2009-10-18 15:47:13 |
<pinc> | ok, I've got your idea | 2009-10-18 15:47:37 |
<Shelwien> | it includes stuff like detection of symbol alignment | 2009-10-18 15:48:34 |
| for example, ascii/unicode text | 2009-10-18 15:49:04 |
| by relation between offsets of bitstrings with multiple occurences | 2009-10-18 15:49:28 |
<pinc> | xor data with multiple alignments | 2009-10-18 15:50:49 |
| alignment with least amplitude will be probable record size | 2009-10-18 15:51:18 |
<Shelwien> | well, it might be ok with 2^n alignments only | 2009-10-18 15:51:59 |
<pinc> | if on the next step you get amplitude much more than average - probably table has ended | 2009-10-18 15:52:05 |
<Shelwien> | but i'm talking about more general case here | 2009-10-18 15:52:08 |
<pinc> | I can't see any problem with non 2^n offsets | 2009-10-18 15:52:31 |
<Shelwien> | basically periods up to at least 64k are quite usual | 2009-10-18 15:52:37 |
| the problem is trying them all | 2009-10-18 15:52:49 |
<pinc> | 64K for one record? | 2009-10-18 15:52:57 |
<Shelwien> | 64 bits? | 2009-10-18 15:53:11 |
| for an image? | 2009-10-18 15:53:14 |
| 64k | 2009-10-18 15:53:19 |
<pinc> | you mess the things up | 2009-10-18 15:53:39 |
<Shelwien> | not really | 2009-10-18 15:53:46 |
<pinc> | you are talking about DB or images? | 2009-10-18 15:53:55 |
<Shelwien> | i'm talking about universal structure detector here | 2009-10-18 15:53:57 |
<pinc> | images don't have record structure | 2009-10-18 15:54:07 |
<Shelwien> | they do | 2009-10-18 15:54:13 |
| they're still tables | 2009-10-18 15:54:20 |
| and if you collect the stats of string matches for an image | 2009-10-18 15:54:58 |
| you'd likely get a periodic distribution | 2009-10-18 15:55:17 |
<pinc> | tests to the studio! (c) | 2009-10-18 15:55:23 |
<Shelwien> | just look at a raw image in any hex/text viewer | 2009-10-18 15:55:58 |
| (uncompressed i mean) | 2009-10-18 15:56:09 |
<Simon|B> | sorry I had to go out. As to IPhpne vs N900. Yes but N900 has Unix system too and you can easier build/port applications to it I bet | 2009-10-18 15:56:30 |
<pinc> | for me it looks highly improbable to get distinguished periods from the image except of obvious _width_ period | 2009-10-18 15:56:30 |
<Shelwien> | obviously that's what i was talking about | 2009-10-18 15:56:55 |
<pinc> | but you can easily get width from the header instead of calculating it | 2009-10-18 15:56:58 |
<Simon|B> | Additional IPhone has to be cracked and this can be dangerous | 2009-10-18 15:57:06 |
<Shelwien> | for example, you can have an image resource in executable | 2009-10-18 15:57:13 |
| for which you won't know the dimensions | 2009-10-18 15:57:22 |
| anyway, the header is not a solution ;) | 2009-10-18 15:57:39 |
<pinc> | well, the things messed again | 2009-10-18 15:58:30 |
<Shelwien> | dunno why | 2009-10-18 15:58:41 |
<pinc> | are you talking about raster image? | 2009-10-18 15:58:42 |
| or executable image? | 2009-10-18 15:58:48 |
<Shelwien> | i'm talking about compression obviously | 2009-10-18 15:58:49 |
<pinc> | compression is different | 2009-10-18 15:58:57 |
<Shelwien> | and detection of tables in the data | 2009-10-18 15:59:02 |
<pinc> | there is no silver bullet | 2009-10-18 15:59:04 |
<Shelwien> | sure | 2009-10-18 15:59:12 |
<pinc> | which works for every data sample | 2009-10-18 15:59:13 |
<Shelwien> | but as i said | 2009-10-18 15:59:19 |
| i can find some matching strings | 2009-10-18 15:59:32 |
| and analyse their distribution | 2009-10-18 15:59:43 |
<pinc> | lets stick for one type of data and discover it deeper | 2009-10-18 15:59:45 |
<Shelwien> | and if there's a period | 2009-10-18 15:59:51 |
| i can use that | 2009-10-18 15:59:56 |
| also its clear how to do it dumb way | 2009-10-18 16:00:08 |
| by collecting the stats of x[i]%c for all c, as i said | 2009-10-18 16:00:23 |
| but that's slow | 2009-10-18 16:00:28 |
| so i'm asking for a faster solution | 2009-10-18 16:00:35 |
<pinc> | you shouldn't collect it for all c, just for prime positions only )) | 2009-10-18 16:01:03 |
| but it sped up things only to log factor | 2009-10-18 16:01:34 |
<Shelwien> | yeah, that's more like what i'd like to hear ;) | 2009-10-18 16:01:53 |
| though its not that simple | 2009-10-18 16:01:58 |
| as you won't detect byte alignment in text | 2009-10-18 16:02:08 |
| by only testing for 2 ;) | 2009-10-18 16:02:19 |
| one idea i've just got | 2009-10-18 16:02:48 |
<pinc> | what is 'byte alignment'? did we talk about bits before? | 2009-10-18 16:02:53 |
<Shelwien> | is simply rewriting the %c via multiplication | 2009-10-18 16:03:07 |
| and vectorizing the calculations | 2009-10-18 16:03:18 |
| as to bits... | 2009-10-18 16:03:37 |
| i want a universal method | 2009-10-18 16:03:44 |
| and thing with bits is a reasonable example too | 2009-10-18 16:04:21 |
| ... | 2009-10-18 16:05:18 |
<pinc> | I think you should take some corpus of sample data and test ideas there | 2009-10-18 16:05:26 |
<Shelwien> | well, its a very general idea overall | 2009-10-18 16:05:31 |
| i'm just getting tired by these ad hoc "counters" and stuff used for modelling | 2009-10-18 16:06:02 |
| and the correct way to do it | 2009-10-18 16:06:28 |
| should be taking the data | 2009-10-18 16:06:40 |
| and clustering the contexts by some similarity | 2009-10-18 16:06:52 |
| and it seems reasonable | 2009-10-18 16:07:17 |
<pinc> | are you able to do it manually? e.g. for text? | 2009-10-18 16:07:20 |
<Shelwien> | to use the context matches as seeds for the clusters | 2009-10-18 16:07:42 |
| so the question appears about how to detect similarity between these matches | 2009-10-18 16:08:26 |
<pinc> | hamming? )) | 2009-10-18 16:08:57 |
<Shelwien> | and periodicity is part of that | 2009-10-18 16:08:58 |
<pinc> | levenstein | 2009-10-18 16:09:12 |
| minimal squares | 2009-10-18 16:09:16 |
<Shelwien> | i'm talking about a different thing kinda | 2009-10-18 16:09:33 |
| its unreasonable to do heavy comparisons for all pairs of strings etc | 2009-10-18 16:10:05 |
<pinc> | any data can be clustered using some metric | 2009-10-18 16:10:15 |
<Shelwien> | but then, usually there's some amount of full matches | 2009-10-18 16:10:34 |
<pinc> | computational complexity depends on metric choosen | 2009-10-18 16:10:42 |
<Shelwien> | well, atm i'm only asking how to detect periodicity | 2009-10-18 16:12:26 |
| in offsets of some samples | 2009-10-18 16:12:37 |
| preferably at O(N) or O(N*Log(N)) time | 2009-10-18 16:13:06 |
<pinc> | give me example of the data | 2009-10-18 16:13:30 |
<Shelwien> | it can be any data | 2009-10-18 16:13:59 |
| i'm talking about universal analysis | 2009-10-18 16:14:13 |
| but if there's some period in the data | 2009-10-18 16:14:35 |
| it has better to be detected | 2009-10-18 16:14:42 |
| in the case of text | 2009-10-18 16:15:36 |
| it would be 8 or 16 bit alignment of symbols | 2009-10-18 16:15:48 |
| and then lines maybe | 2009-10-18 16:16:07 |
<pinc> | use xor. 8/16bit alignment will be recognized quite quicky | 2009-10-18 16:16:47 |
<Shelwien> | xor of what? | 2009-10-18 16:16:58 |
| and how? | 2009-10-18 16:17:03 |
| as i said, there's an array of offsets | 2009-10-18 16:17:12 |
<pinc> | of data with data+alignment for various alignment | 2009-10-18 16:17:26 |
<Shelwien> | some subsets of which might be in the form a+b[i]*c | 2009-10-18 16:17:41 |
| and you're suggesting bruteforce, which is not a real option | 2009-10-18 16:18:20 |
| ...don't you have some more interesting ideas? like using FFT? ;) | 2009-10-18 16:19:12 |
<pinc> | I have to think )) | 2009-10-18 16:20:09 |
| linear regression could be done in many ways )) | 2009-10-18 16:20:45 |
<Shelwien> | well, current solution looks like only testing all powers of prime numbers | 2009-10-18 16:21:23 |
| and doing it with a vectorized multiplication instead of division | 2009-10-18 16:21:39 |
| btw | 2009-10-18 16:24:48 |
| as to samples of data to apply that | 2009-10-18 16:24:56 |
| its a real problem too, in fact | 2009-10-18 16:25:05 |
| well, i'm thinking to use BWT data as usual | 2009-10-18 16:25:17 |
| but that's not really a perfect sample | 2009-10-18 16:25:40 |
| but using text or executables or audio or images | 2009-10-18 16:26:03 |
| would be even worse | 2009-10-18 16:26:14 |
| because there's a lot of external info about these, especially text | 2009-10-18 16:26:30 |
| so no sense to apply such analysis to raw data | 2009-10-18 16:26:49 |
| but on other hand, parsers to produce the data for which there would be such sense | 2009-10-18 16:27:25 |
| are quite troublesome to write as well | 2009-10-18 16:27:48 |
*** Simon|B has left the channel | 2009-10-18 16:49:45 |
| !next | 2009-10-18 16:59:52 |