For Programmers: Free Programming Magazines  


Home > Archive > Open Source Software > May 2006 > Re: Algorithms for finding weak patterns in a data stream









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Algorithms for finding weak patterns in a data stream
wkaras@yahoo.com

2006-05-10, 7:07 pm

Generic Usenet Account wrote:
> Generic Usenet Account wrote:
>
> Before I get slammed again, let me clarify the definition further. A
> weak pattern is a collection (not set, since set implies ordering) of
> symbols, not necessarily contiguous, which repeats itself more than
> once in a data stream. Even though contiguity of symbols within a
> pattern is not important, the order of the symbols within the pattern
> is important.
>
> For example, in the data stream 2-4-7-8-0-1-2-3-5-4-6-0-7 (slighly
> modified version of the original data stream i.e.
> 2-4-7-8-0-1-2-3-5-4-6-7-0) the weak pattern 2-4-7-0 does not exist.
> However, the weak patterns 2-4-7 and 2-4-0 exist in the data stream.
>
> Needless to say, the algorithm does not assume prior knowledge of these
> patterns. The algorithm should be able to discover them.
>
> -Song


Here's a brute-force approach to finding patterns of up to 4 digits.
The next_digit function is called for each digit in the sequence
(with the calls in the same order as the sequence).

typedef struct
{
/* Number of occurences of pattern. */
unsigned count;
/* 1-base position of end of last occurence of pattern */
unsigned last_pos;
}
T_pat_data;

/* Dummy. */
T_pat_data len0[1] = { 0 };
/* Histogram of each digit. */
T_pat_data len1[10] = { 0 };
/* len2[10 * a + b].count = occurences of pattern 'ab' */
T_pat_data len2[10 * 10] = { 0 };
/* len3[100 * a + 10 * b + c].count = occurences of pattern 'abc' */
T_pat_data len3[10 * 10 * 10] = { 0 };
/* len4[1000 * a + 100 * b + 10 * c + d].count = occurences of
** pattern 'abcd' */
T_pat_data len4[10 * 10 * 10 * 10] = { 0 };

T_pat_data *data[5] = { len0, len1, len2, len3, len4 };

unsigned pos = 0;

void next_digit(unsigned dig)
{
unsigned i, end, idx;

pos++;

for (idx = 4, end = 1000; idx > 0; idx--, end /= 10)
for (i = 0; i < end; i++)
if (data[idx][i * 10 + dig].last_pos <= data[idx -
1][i].last_pos)
{
data[idx][i * 10 + dig].count++;
data[idx][i * 10 + dig].last_pos = pos;
}
}

Note that I am counting overlapping occurences, for example,
545454 would count as 2 occurences of the pattern 5454
(but not 3).

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com