Home > Archive > Open Source Software > May 2006 > Re: Algorithms for finding weak patterns in a data stream
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: Algorithms for finding weak patterns in a data stream
|
|
| Generic Usenet Account 2006-05-10, 7:07 pm |
| Richard Heathfield wrote:
> Generic Usenet Account said:
>
>
> The thing that concerns me most about this article is that either it doesn't
> actually contain the weak pattern that it claims to contain, or it contains
> a colossal number of weak patterns; for example:
>
> 2-4-8-0-2-4-6-0
> 7-1-3-5-7
> 2-7-2-3-5-7
> 0-0
> 0-1-0
> 1-2-3-4-5-6-7
> 2-4-8-1-6
>
> and so on.
>
> So let's try again - what does Generic User Account *really* mean by the
> term "weak pattern"?
Sorry for the confusion. Let me give a "stronger" definition for
"weak" patterns:)
A weak pattern is a set of symbols, not necessarily contiguous, which
repeats itself more than once in a data stream. For example, in the
data stream 2-4-7-8-0-1-2-3-5-4-6-7-0, the weak pattern 2-4-7-0 is
rpeated twice.
I would also like to re-characterize the problem as finding the longest
"weak" pattern in a data stream, given the definition of "weak" pattern
provided above.
Hope this helps!
Song
| |
| Chris Uppal 2006-05-11, 7:05 pm |
| Generic Usenet Account wrote:
> For example, in the data stream 2-4-7-8-0-1-2-3-5-4-6-0-7 (slighly
> modified version of the original data stream i.e.
> 2-4-7-8-0-1-2-3-5-4-6-7-0) the weak pattern 2-4-7-0 does not exist.
> However, the weak patterns 2-4-7 and 2-4-0 exist in the data stream.
More questions:
Is there a bound on how much noise is allowed in a pattern, in
particular can you restrict your attention to symbol pairs that appear
consecutively at least once somewhere ?
Is the alphabet of symbols known in advance ?
Is there a ranking which you should consider ? Do longer patterns
count for more than short ones ? Do more noisy patterns count more or
less than strong ones ? Do patterns which occur often count for more
than ones that do not. For instance in
1-2-3-1-2-3-1-2-3
there is the obvious 1-2-3 pattern, but it actually occurs 10 times,
the pattern 1-2-3-1-2-3 occurs 7 times, and the patterns
1-2-3-1-2-3-1-2-3 and 3-2-1 both occurs once.
-- chris
| |
| Chris Uppal 2006-05-11, 7:05 pm |
| Generic Usenet Account wrote:
> For example, in the data stream 2-4-7-8-0-1-2-3-5-4-6-0-7
> [...] the weak pattern 2-4-7-0 does not exist.
And still more questions ;-)
Are the repeated patterns allowed to overlap ? For instance in the
above example, the longest weakly repeated pattern is
2-3-5-4-6-0-7
which occurs twice, using the two different 2s.
Do you need to know /which/ pattern occurs more than once ? If there
are only N possible symbols in the alphabet, then you know for sure
that there is a weak repeated pattern if the sequence is longer than
2N, but you don't know which it is.
-- chris
| |
| Paul E. Black 2006-05-11, 7:05 pm |
| On Wednesday 10 May 2006 13:15, Generic Usenet Account wrote:
> A weak pattern is a set of symbols, not necessarily contiguous, which
> repeats itself more than once in a data stream. For example, in the
> data stream 2-4-7-8-0-1-2-3-5-4-6-7-0, the weak pattern 2-4-7-0 is
> rpeated twice.
>
> ... the problem [i]s finding the longest
> "weak" pattern in a data stream, ...
You said, "data stream". I will assume you mean a text which is all
available.
If some characters might only occur once, you can start by throwing
them out: they couldn't be in a repeated pattern. (If almost all
characters appear at least twice, it may not be worth the trouble.)
You might adapt longest-common subsequence
http://en.wikipedia.org/wiki/Longes...equence_problem
or minimal longest ascending subsequence
http://home.tiac.net/~cri/2001/mlas.html
You might use dynamic programming or some kind of suffix tree (really,
a subsequence tree) starting with common characters in the text.
-paul-
--
Paul E. Black (p.black@acm.org)
|
|
|
|
|