*** pinc has left the channel		2009-10-18 12:17:46
*** pinc has joined the channel		2009-10-18 12:23:55
*** Simon\|B has joined the channel		2009-10-18 15:04:59
<Simon\|B>	hi	2009-10-18 15:05:10
<Shelwien>	hi ;)	2009-10-18 15:05:39
	a problem	2009-10-18 15:20:23
	there's a set of integers x[]	2009-10-18 15:21:12
	and we have a suspicion that a large percent of these	2009-10-18 15:21:45
	can be represented as x[i]=a+b[i]*c	2009-10-18 15:22:15
	so, how to efficiently find these?	2009-10-18 15:22:40
<Simon\|B>	are doing any interesting things at the moment?	2009-10-18 15:22:41
<Shelwien>	dunno what's insteresting for you ;)	2009-10-18 15:23:03
<Simon\|B>	are you talking to me or anyone else?	2009-10-18 15:23:08
	many things ;)	2009-10-18 15:23:17
<Shelwien>	as to "problem" - that's for you ;)	2009-10-18 15:23:29
	othere's here are mostly silent ;)	2009-10-18 15:23:43
<Simon\|B>	yeah thought so	2009-10-18 15:23:51
<Shelwien>	and as to things	2009-10-18 15:24:11
	i'm working on my backup for one job	2009-10-18 15:24:26
<Simon\|B>	yes I remember	2009-10-18 15:24:37
<Shelwien>	and posted a few versions here	2009-10-18 15:24:43
	and just made an iphone network stream player for my codec for another job	2009-10-18 15:25:25
<Simon\|B>	sounds nice. I am thinking at the moment to buy a new handy and have the problem between iphone - N900 :-D	2009-10-18 15:26:38
	Hard decision I guess	2009-10-18 15:27:17
<Shelwien>	well, iphone would be more convenient for me, i guess ;)	2009-10-18 15:27:37
	though actually i can't say i like iphone as a phone or as a mp3 player	2009-10-18 15:29:01
	there's a couple of convenient apps for it though	2009-10-18 15:29:31
	but i really like how it has a complete unix environment	2009-10-18 15:29:57
	so i can develop programs for it on windows (with cygwin toolchain)	2009-10-18 15:30:32
	and run them with sftp/ssh on iphone	2009-10-18 15:30:55
	also, its popular, and there're lots of resources	2009-10-18 15:31:21
<pinc>	could you provide some examples of x[i]?	2009-10-18 15:40:46
<Shelwien>	well, "obviously" i'm talking about structure detection in files	2009-10-18 15:41:18
<pinc>	it seems to be straight linear regression	2009-10-18 15:41:31
	from your words at least ))	2009-10-18 15:41:47
<Shelwien>	so x[i] are basically offsets of some string occurences or something	2009-10-18 15:42:09
	and dunno about it being so straight	2009-10-18 15:42:18
	as they're not necessarily all like that	2009-10-18 15:42:35
	basically we need to find "c" values to do a more thorough test	2009-10-18 15:43:29
	so a straightforward solution	2009-10-18 15:43:53
	would be something like collecting statistics on x[i]%c occurences	2009-10-18 15:44:15
	for some set of c values	2009-10-18 15:44:21
<pinc>	I think you should start from the data	2009-10-18 15:44:24
<Shelwien>	but that doesn't seem like a good idea	2009-10-18 15:44:44
	because division is slow	2009-10-18 15:44:50
<pinc>	you want to predict next offset by linear predictor, right?	2009-10-18 15:45:27
<Shelwien>	no	2009-10-18 15:45:34
	i want to detect the record size in the file	2009-10-18 15:45:55
	if there's a table	2009-10-18 15:46:04
<pinc>	now its clearer ))	2009-10-18 15:46:23
	take a look on method to recognize random generators period	2009-10-18 15:46:42
<Shelwien>	well, but the table can be embedded in the data	2009-10-18 15:46:57
	or there can be multiple tables with different record sizes	2009-10-18 15:47:13
<pinc>	ok, I've got your idea	2009-10-18 15:47:37
<Shelwien>	it includes stuff like detection of symbol alignment	2009-10-18 15:48:34
	for example, ascii/unicode text	2009-10-18 15:49:04
	by relation between offsets of bitstrings with multiple occurences	2009-10-18 15:49:28
<pinc>	xor data with multiple alignments	2009-10-18 15:50:49
	alignment with least amplitude will be probable record size	2009-10-18 15:51:18
<Shelwien>	well, it might be ok with 2^n alignments only	2009-10-18 15:51:59
<pinc>	if on the next step you get amplitude much more than average - probably table has ended	2009-10-18 15:52:05
<Shelwien>	but i'm talking about more general case here	2009-10-18 15:52:08
<pinc>	I can't see any problem with non 2^n offsets	2009-10-18 15:52:31
<Shelwien>	basically periods up to at least 64k are quite usual	2009-10-18 15:52:37
	the problem is trying them all	2009-10-18 15:52:49
<pinc>	64K for one record?	2009-10-18 15:52:57
<Shelwien>	64 bits?	2009-10-18 15:53:11
	for an image?	2009-10-18 15:53:14
	64k	2009-10-18 15:53:19
<pinc>	you mess the things up	2009-10-18 15:53:39
<Shelwien>	not really	2009-10-18 15:53:46
<pinc>	you are talking about DB or images?	2009-10-18 15:53:55
<Shelwien>	i'm talking about universal structure detector here	2009-10-18 15:53:57
<pinc>	images don't have record structure	2009-10-18 15:54:07
<Shelwien>	they do	2009-10-18 15:54:13
	they're still tables	2009-10-18 15:54:20
	and if you collect the stats of string matches for an image	2009-10-18 15:54:58
	you'd likely get a periodic distribution	2009-10-18 15:55:17
<pinc>	tests to the studio! (c)	2009-10-18 15:55:23
<Shelwien>	just look at a raw image in any hex/text viewer	2009-10-18 15:55:58
	(uncompressed i mean)	2009-10-18 15:56:09
<Simon\|B>	sorry I had to go out. As to IPhpne vs N900. Yes but N900 has Unix system too and you can easier build/port applications to it I bet	2009-10-18 15:56:30
<pinc>	for me it looks highly improbable to get distinguished periods from the image except of obvious _width_ period	2009-10-18 15:56:30
<Shelwien>	obviously that's what i was talking about	2009-10-18 15:56:55
<pinc>	but you can easily get width from the header instead of calculating it	2009-10-18 15:56:58
<Simon\|B>	Additional IPhone has to be cracked and this can be dangerous	2009-10-18 15:57:06
<Shelwien>	for example, you can have an image resource in executable	2009-10-18 15:57:13
	for which you won't know the dimensions	2009-10-18 15:57:22
	anyway, the header is not a solution ;)	2009-10-18 15:57:39
<pinc>	well, the things messed again	2009-10-18 15:58:30
<Shelwien>	dunno why	2009-10-18 15:58:41
<pinc>	are you talking about raster image?	2009-10-18 15:58:42
	or executable image?	2009-10-18 15:58:48
<Shelwien>	i'm talking about compression obviously	2009-10-18 15:58:49
<pinc>	compression is different	2009-10-18 15:58:57
<Shelwien>	and detection of tables in the data	2009-10-18 15:59:02
<pinc>	there is no silver bullet	2009-10-18 15:59:04
<Shelwien>	sure	2009-10-18 15:59:12
<pinc>	which works for every data sample	2009-10-18 15:59:13
<Shelwien>	but as i said	2009-10-18 15:59:19
	i can find some matching strings	2009-10-18 15:59:32
	and analyse their distribution	2009-10-18 15:59:43
<pinc>	lets stick for one type of data and discover it deeper	2009-10-18 15:59:45
<Shelwien>	and if there's a period	2009-10-18 15:59:51
	i can use that	2009-10-18 15:59:56
	also its clear how to do it dumb way	2009-10-18 16:00:08
	by collecting the stats of x[i]%c for all c, as i said	2009-10-18 16:00:23
	but that's slow	2009-10-18 16:00:28
	so i'm asking for a faster solution	2009-10-18 16:00:35
<pinc>	you shouldn't collect it for all c, just for prime positions only ))	2009-10-18 16:01:03
	but it sped up things only to log factor	2009-10-18 16:01:34
<Shelwien>	yeah, that's more like what i'd like to hear ;)	2009-10-18 16:01:53
	though its not that simple	2009-10-18 16:01:58
	as you won't detect byte alignment in text	2009-10-18 16:02:08
	by only testing for 2 ;)	2009-10-18 16:02:19
	one idea i've just got	2009-10-18 16:02:48
<pinc>	what is 'byte alignment'? did we talk about bits before?	2009-10-18 16:02:53
<Shelwien>	is simply rewriting the %c via multiplication	2009-10-18 16:03:07
	and vectorizing the calculations	2009-10-18 16:03:18
	as to bits...	2009-10-18 16:03:37
	i want a universal method	2009-10-18 16:03:44
	and thing with bits is a reasonable example too	2009-10-18 16:04:21
	...	2009-10-18 16:05:18
<pinc>	I think you should take some corpus of sample data and test ideas there	2009-10-18 16:05:26
<Shelwien>	well, its a very general idea overall	2009-10-18 16:05:31
	i'm just getting tired by these ad hoc "counters" and stuff used for modelling	2009-10-18 16:06:02
	and the correct way to do it	2009-10-18 16:06:28
	should be taking the data	2009-10-18 16:06:40
	and clustering the contexts by some similarity	2009-10-18 16:06:52
	and it seems reasonable	2009-10-18 16:07:17
<pinc>	are you able to do it manually? e.g. for text?	2009-10-18 16:07:20
<Shelwien>	to use the context matches as seeds for the clusters	2009-10-18 16:07:42
	so the question appears about how to detect similarity between these matches	2009-10-18 16:08:26
<pinc>	hamming? ))	2009-10-18 16:08:57
<Shelwien>	and periodicity is part of that	2009-10-18 16:08:58
<pinc>	levenstein	2009-10-18 16:09:12
	minimal squares	2009-10-18 16:09:16
<Shelwien>	i'm talking about a different thing kinda	2009-10-18 16:09:33
	its unreasonable to do heavy comparisons for all pairs of strings etc	2009-10-18 16:10:05
<pinc>	any data can be clustered using some metric	2009-10-18 16:10:15
<Shelwien>	but then, usually there's some amount of full matches	2009-10-18 16:10:34
<pinc>	computational complexity depends on metric choosen	2009-10-18 16:10:42
<Shelwien>	well, atm i'm only asking how to detect periodicity	2009-10-18 16:12:26
	in offsets of some samples	2009-10-18 16:12:37
	preferably at O(N) or O(N*Log(N)) time	2009-10-18 16:13:06
<pinc>	give me example of the data	2009-10-18 16:13:30
<Shelwien>	it can be any data	2009-10-18 16:13:59
	i'm talking about universal analysis	2009-10-18 16:14:13
	but if there's some period in the data	2009-10-18 16:14:35
	it has better to be detected	2009-10-18 16:14:42
	in the case of text	2009-10-18 16:15:36
	it would be 8 or 16 bit alignment of symbols	2009-10-18 16:15:48
	and then lines maybe	2009-10-18 16:16:07
<pinc>	use xor. 8/16bit alignment will be recognized quite quicky	2009-10-18 16:16:47
<Shelwien>	xor of what?	2009-10-18 16:16:58
	and how?	2009-10-18 16:17:03
	as i said, there's an array of offsets	2009-10-18 16:17:12
<pinc>	of data with data+alignment for various alignment	2009-10-18 16:17:26
<Shelwien>	some subsets of which might be in the form a+b[i]*c	2009-10-18 16:17:41
	and you're suggesting bruteforce, which is not a real option	2009-10-18 16:18:20
	...don't you have some more interesting ideas? like using FFT? ;)	2009-10-18 16:19:12
<pinc>	I have to think ))	2009-10-18 16:20:09
	linear regression could be done in many ways ))	2009-10-18 16:20:45
<Shelwien>	well, current solution looks like only testing all powers of prime numbers	2009-10-18 16:21:23
	and doing it with a vectorized multiplication instead of division	2009-10-18 16:21:39
	btw	2009-10-18 16:24:48
	as to samples of data to apply that	2009-10-18 16:24:56
	its a real problem too, in fact	2009-10-18 16:25:05
	well, i'm thinking to use BWT data as usual	2009-10-18 16:25:17
	but that's not really a perfect sample	2009-10-18 16:25:40
	but using text or executables or audio or images	2009-10-18 16:26:03
	would be even worse	2009-10-18 16:26:14
	because there's a lot of external info about these, especially text	2009-10-18 16:26:30
	so no sense to apply such analysis to raw data	2009-10-18 16:26:49
	but on other hand, parsers to produce the data for which there would be such sense	2009-10-18 16:27:25
	are quite troublesome to write as well	2009-10-18 16:27:48
*** Simon\|B has left the channel		2009-10-18 16:49:45
	!next	2009-10-18 16:59:52