*** pinc has left the channel2009-10-18 12:17:46
*** pinc has joined the channel2009-10-18 12:23:55
*** Simon|B has joined the channel2009-10-18 15:04:59
<Simon|B> hi2009-10-18 15:05:10
<Shelwien> hi ;)2009-10-18 15:05:39
 a problem2009-10-18 15:20:23
 there's a set of integers x[]2009-10-18 15:21:12
 and we have a suspicion that a large percent of these2009-10-18 15:21:45
 can be represented as x[i]=a+b[i]*c2009-10-18 15:22:15
 so, how to efficiently find these?2009-10-18 15:22:40
<Simon|B> are doing any interesting things at the moment?2009-10-18 15:22:41
<Shelwien> dunno what's insteresting for you ;)2009-10-18 15:23:03
<Simon|B> are you talking to me or anyone else?2009-10-18 15:23:08
 many things ;)2009-10-18 15:23:17
<Shelwien> as to "problem" - that's for you ;)2009-10-18 15:23:29
 othere's here are mostly silent ;)2009-10-18 15:23:43
<Simon|B> yeah thought so2009-10-18 15:23:51
<Shelwien> and as to things2009-10-18 15:24:11
 i'm working on my backup for one job2009-10-18 15:24:26
<Simon|B> yes I remember2009-10-18 15:24:37
<Shelwien> and posted a few versions here2009-10-18 15:24:43
 and just made an iphone network stream player for my codec for another job2009-10-18 15:25:25
<Simon|B> sounds nice. I am thinking at the moment to buy a new handy and have the problem between iphone - N900 :-D2009-10-18 15:26:38
 Hard decision I guess2009-10-18 15:27:17
<Shelwien> well, iphone would be more convenient for me, i guess ;)2009-10-18 15:27:37
 though actually i can't say i like iphone as a phone or as a mp3 player2009-10-18 15:29:01
 there's a couple of convenient apps for it though2009-10-18 15:29:31
 but i really like how it has a complete unix environment2009-10-18 15:29:57
 so i can develop programs for it on windows (with cygwin toolchain)2009-10-18 15:30:32
 and run them with sftp/ssh on iphone2009-10-18 15:30:55
 also, its popular, and there're lots of resources2009-10-18 15:31:21
<pinc> could you provide some examples of x[i]?2009-10-18 15:40:46
<Shelwien> well, "obviously" i'm talking about structure detection in files2009-10-18 15:41:18
<pinc> it seems to be straight linear regression2009-10-18 15:41:31
 from your words at least ))2009-10-18 15:41:47
<Shelwien> so x[i] are basically offsets of some string occurences or something2009-10-18 15:42:09
 and dunno about it being so straight2009-10-18 15:42:18
 as they're not necessarily all like that2009-10-18 15:42:35
 basically we need to find "c" values to do a more thorough test2009-10-18 15:43:29
 so a straightforward solution2009-10-18 15:43:53
 would be something like collecting statistics on x[i]%c occurences2009-10-18 15:44:15
 for some set of c values2009-10-18 15:44:21
<pinc> I think you should start from the data2009-10-18 15:44:24
<Shelwien> but that doesn't seem like a good idea2009-10-18 15:44:44
 because division is slow2009-10-18 15:44:50
<pinc> you want to predict next offset by linear predictor, right?2009-10-18 15:45:27
<Shelwien> no2009-10-18 15:45:34
 i want to detect the record size in the file2009-10-18 15:45:55
 if there's a table2009-10-18 15:46:04
<pinc> now its clearer ))2009-10-18 15:46:23
 take a look on method to recognize random generators period2009-10-18 15:46:42
<Shelwien> well, but the table can be embedded in the data2009-10-18 15:46:57
 or there can be multiple tables with different record sizes2009-10-18 15:47:13
<pinc> ok, I've got your idea2009-10-18 15:47:37
<Shelwien> it includes stuff like detection of symbol alignment2009-10-18 15:48:34
 for example, ascii/unicode text2009-10-18 15:49:04
 by relation between offsets of bitstrings with multiple occurences2009-10-18 15:49:28
<pinc> xor data with multiple alignments2009-10-18 15:50:49
 alignment with least amplitude will be probable record size2009-10-18 15:51:18
<Shelwien> well, it might be ok with 2^n alignments only2009-10-18 15:51:59
<pinc> if on the next step you get amplitude much more than average - probably table has ended2009-10-18 15:52:05
<Shelwien> but i'm talking about more general case here2009-10-18 15:52:08
<pinc> I can't see any problem with non 2^n offsets2009-10-18 15:52:31
<Shelwien> basically periods up to at least 64k are quite usual2009-10-18 15:52:37
 the problem is trying them all2009-10-18 15:52:49
<pinc> 64K for one record?2009-10-18 15:52:57
<Shelwien> 64 bits?2009-10-18 15:53:11
 for an image?2009-10-18 15:53:14
 64k2009-10-18 15:53:19
<pinc> you mess the things up2009-10-18 15:53:39
<Shelwien> not really2009-10-18 15:53:46
<pinc> you are talking about DB or images?2009-10-18 15:53:55
<Shelwien> i'm talking about universal structure detector here2009-10-18 15:53:57
<pinc> images don't have record structure2009-10-18 15:54:07
<Shelwien> they do2009-10-18 15:54:13
 they're still tables2009-10-18 15:54:20
 and if you collect the stats of string matches for an image2009-10-18 15:54:58
 you'd likely get a periodic distribution2009-10-18 15:55:17
<pinc> tests to the studio! (c)2009-10-18 15:55:23
<Shelwien> just look at a raw image in any hex/text viewer2009-10-18 15:55:58
 (uncompressed i mean)2009-10-18 15:56:09
<Simon|B> sorry I had to go out. As to IPhpne vs N900. Yes but N900 has Unix system too and you can easier build/port applications to it I bet2009-10-18 15:56:30
<pinc> for me it looks highly improbable to get distinguished periods from the image except of obvious _width_ period2009-10-18 15:56:30
<Shelwien> obviously that's what i was talking about2009-10-18 15:56:55
<pinc> but you can easily get width from the header instead of calculating it2009-10-18 15:56:58
<Simon|B> Additional IPhone has to be cracked and this can be dangerous2009-10-18 15:57:06
<Shelwien> for example, you can have an image resource in executable2009-10-18 15:57:13
 for which you won't know the dimensions2009-10-18 15:57:22
 anyway, the header is not a solution ;)2009-10-18 15:57:39
<pinc> well, the things messed again2009-10-18 15:58:30
<Shelwien> dunno why2009-10-18 15:58:41
<pinc> are you talking about raster image?2009-10-18 15:58:42
 or executable image?2009-10-18 15:58:48
<Shelwien> i'm talking about compression obviously2009-10-18 15:58:49
<pinc> compression is different2009-10-18 15:58:57
<Shelwien> and detection of tables in the data2009-10-18 15:59:02
<pinc> there is no silver bullet2009-10-18 15:59:04
<Shelwien> sure2009-10-18 15:59:12
<pinc> which works for every data sample2009-10-18 15:59:13
<Shelwien> but as i said2009-10-18 15:59:19
 i can find some matching strings2009-10-18 15:59:32
 and analyse their distribution2009-10-18 15:59:43
<pinc> lets stick for one type of data and discover it deeper2009-10-18 15:59:45
<Shelwien> and if there's a period 2009-10-18 15:59:51
 i can use that2009-10-18 15:59:56
 also its clear how to do it dumb way2009-10-18 16:00:08
 by collecting the stats of x[i]%c for all c, as i said2009-10-18 16:00:23
 but that's slow2009-10-18 16:00:28
 so i'm asking for a faster solution2009-10-18 16:00:35
<pinc> you shouldn't collect it for all c, just for prime positions only ))2009-10-18 16:01:03
 but it sped up things only to log factor2009-10-18 16:01:34
<Shelwien> yeah, that's more like what i'd like to hear ;)2009-10-18 16:01:53
 though its not that simple2009-10-18 16:01:58
 as you won't detect byte alignment in text2009-10-18 16:02:08
 by only testing for 2 ;)2009-10-18 16:02:19
 one idea i've just got2009-10-18 16:02:48
<pinc> what is 'byte alignment'? did we talk about bits before?2009-10-18 16:02:53
<Shelwien> is simply rewriting the %c via multiplication2009-10-18 16:03:07
 and vectorizing the calculations2009-10-18 16:03:18
 as to bits...2009-10-18 16:03:37
 i want a universal method2009-10-18 16:03:44
 and thing with bits is a reasonable example too2009-10-18 16:04:21
 ...2009-10-18 16:05:18
<pinc> I think you should take some corpus of sample data and test ideas there2009-10-18 16:05:26
<Shelwien> well, its a very general idea overall2009-10-18 16:05:31
 i'm just getting tired by these ad hoc "counters" and stuff used for modelling2009-10-18 16:06:02
 and the correct way to do it2009-10-18 16:06:28
 should be taking the data2009-10-18 16:06:40
 and clustering the contexts by some similarity2009-10-18 16:06:52
 and it seems reasonable2009-10-18 16:07:17
<pinc> are you able to do it manually? e.g. for text?2009-10-18 16:07:20
<Shelwien> to use the context matches as seeds for the clusters2009-10-18 16:07:42
 so the question appears about how to detect similarity between these matches2009-10-18 16:08:26
<pinc> hamming? ))2009-10-18 16:08:57
<Shelwien> and periodicity is part of that2009-10-18 16:08:58
<pinc> levenstein2009-10-18 16:09:12
 minimal squares2009-10-18 16:09:16
<Shelwien> i'm talking about a different thing kinda2009-10-18 16:09:33
 its unreasonable to do heavy comparisons for all pairs of strings etc2009-10-18 16:10:05
<pinc> any data can be clustered using some metric2009-10-18 16:10:15
<Shelwien> but then, usually there's some amount of full matches2009-10-18 16:10:34
<pinc> computational complexity depends on metric choosen2009-10-18 16:10:42
<Shelwien> well, atm i'm only asking how to detect periodicity2009-10-18 16:12:26
 in offsets of some samples2009-10-18 16:12:37
 preferably at O(N) or O(N*Log(N)) time2009-10-18 16:13:06
<pinc> give me example of the data2009-10-18 16:13:30
<Shelwien> it can be any data2009-10-18 16:13:59
 i'm talking about universal analysis2009-10-18 16:14:13
 but if there's some period in the data2009-10-18 16:14:35
 it has better to be detected2009-10-18 16:14:42
 in the case of text2009-10-18 16:15:36
 it would be 8 or 16 bit alignment of symbols2009-10-18 16:15:48
 and then lines maybe2009-10-18 16:16:07
<pinc> use xor. 8/16bit alignment will be recognized quite quicky2009-10-18 16:16:47
<Shelwien> xor of what?2009-10-18 16:16:58
 and how?2009-10-18 16:17:03
 as i said, there's an array of offsets2009-10-18 16:17:12
<pinc> of data with data+alignment for various alignment2009-10-18 16:17:26
<Shelwien> some subsets of which might be in the form a+b[i]*c2009-10-18 16:17:41
 and you're suggesting bruteforce, which is not a real option2009-10-18 16:18:20
 ...don't you have some more interesting ideas? like using FFT? ;)2009-10-18 16:19:12
<pinc> I have to think ))2009-10-18 16:20:09
 linear regression could be done in many ways ))2009-10-18 16:20:45
<Shelwien> well, current solution looks like only testing all powers of prime numbers2009-10-18 16:21:23
 and doing it with a vectorized multiplication instead of division2009-10-18 16:21:39
 btw2009-10-18 16:24:48
 as to samples of data to apply that2009-10-18 16:24:56
 its a real problem too, in fact2009-10-18 16:25:05
 well, i'm thinking to use BWT data as usual2009-10-18 16:25:17
 but that's not really a perfect sample2009-10-18 16:25:40
 but using text or executables or audio or images2009-10-18 16:26:03
 would be even worse2009-10-18 16:26:14
 because there's a lot of external info about these, especially text2009-10-18 16:26:30
 so no sense to apply such analysis to raw data2009-10-18 16:26:49
 but on other hand, parsers to produce the data for which there would be such sense2009-10-18 16:27:25
 are quite troublesome to write as well2009-10-18 16:27:48
*** Simon|B has left the channel2009-10-18 16:49:45
 !next2009-10-18 16:59:52