Home > Archive > Compression > February 2005 > Finding longest and most repeated substring
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Finding longest and most repeated substring
|
|
| dayzman@hotmail.com 2005-02-10, 3:55 am |
| Hi,
I'm trying to find the longest and most repeated substring using a
suffix tree. For example, with the string "20120120120120120201201201",
I would like the result to be "201". I've been trying to look at the
immediate children of the root and count their leaves. The result turns
out to be "20", because I'm taking the child with the most leaves as my
result. Am I doing something wrong? Is there a better solution for my
problem?
Cheers,
Michael
| |
| dayzman@hotmail.com 2005-02-10, 3:55 am |
| Let me clarify one thing first, I'm trying to find the substring that
has the maximal len(str) * num_repeats among all other substrings.
Any help will be much appreciated.
Cheers,
Michael
dayzman@hotmail.com wrote:
> Hi,
>
> I'm trying to find the longest and most repeated substring using a
> suffix tree. For example, with the string
"20120120120120120201201201",
> I would like the result to be "201". I've been trying to look at the
> immediate children of the root and count their leaves. The result
turns
> out to be "20", because I'm taking the child with the most leaves as
my
> result. Am I doing something wrong? Is there a better solution for my
> problem?
>
> Cheers,
> Michael
| |
| Alf P. Steinbach 2005-02-10, 3:55 am |
| * dayzman@hotmail.com:
> Hi,
>
> I'm trying to find the longest and most repeated substring using a
> suffix tree. For example, with the string "20120120120120120201201201",
> I would like the result to be "201". I've been trying to look at the
> immediate children of the root and count their leaves. The result turns
> out to be "20", because I'm taking the child with the most leaves as my
> result. Am I doing something wrong?
Obviously, since you get an incorrect answer.
> Is there a better solution for my problem?
Don't know. O(n^2) isn't good, but I don't see how this could be done
much more efficiently _if you want an exact answer_. Perhaps someone
else can help.
Pseudo, sketch (details may be wrong):
struct posAndCount{ int position; int count; }
struct maxInfo: posAndCount { int length };
int n = s.length();
maxInfo max = {0, 0, 0};
for( int len = 1; len <= n; ++len )
{
hashtable<posAndCount> hash( /*initial capacity:*/ 1.7*(n-len+1) );
hashkey key = /* key for first substring of length len */;
for( int i = 0; i < n-len+1; ++i )
{
// Update key incrementally for char at position 'i+len-1'.
posAndCount* hashed = hash[key];
hashed->position = i;
hashed->count += 1;
}
// Find max count in the hash, O(n).
// Update max
}
}
--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
| |
| gerard46 2005-02-10, 3:55 am |
| | dayzman wrote:
| I'm trying to find the longest and most repeated substring using a
| suffix tree. For example, with the string "20120120120120120201201201",
| I would like the result to be "201". I've been trying to look at the
| immediate children of the root and count their leaves. The result turns
| out to be "20", because I'm taking the child with the most leaves as my
| result. Am I doing something wrong? Is there a better solution for my
| problem?
Well, to me, the longest repeated substring in
20120120120120120201201201 is (note that you may have had a typo here):
201201201
If you had entered
201201201201201201201201201 instead, then the longest repeated substring is:
201201201201
BUT, it isn't the most repeated string.
-------------
What you are asking is:
OK, all you people, line up against the wall by increasing height and age.
Do you want the longest repeated substring, or
do you want the most repeated substring? ___________________Gerard S.
| |
| dayzman@hotmail.com 2005-02-10, 3:55 am |
| Hi,
The string "20120120120120120201201201" is correct. I purposely insert
"20" in it. I'm after the substring with the maximal len(substr) *
num_repeats. So, with "201201201", the value will only be
len('201201201") * 2 = 18. However, with the substring "201", the value
will be len("201") * 8 = 24.
I hope I have expressed myself better this time.
Cheers,
Michael
| |
| Bill Godfrey 2005-02-10, 8:55 am |
| dayzman@hotmail.com wrote:
> I hope I have expressed myself better this time.
Are there any other constraints? Are your required to do it in O()
anything?
What about an input of "abc" in "abcabcabcabc"? Is the answer "abc"
(3*4=12), "abcabcabcabc" (1*12=12) or is either answer correct?
Bill, deja vu string parser.
--
http://billpg.me.uk/
usenet(at)billpg(dot)me(dot)uk
| |
| Thad Smith 2005-02-10, 3:55 pm |
| dayzman@hotmail.com wrote:
> The string "20120120120120120201201201" is correct. I purposely insert
> "20" in it. I'm after the substring with the maximal len(substr) *
> num_repeats. So, with "201201201", the value will only be
> len('201201201") * 2 = 18. However, with the substring "201", the value
> will be len("201") * 8 = 24.
The number of repeats with "201201201" is 1, not 2. The number of
repeats of "201" is 7, not 8. If you want to maximize the substring
length times the number of substring occurrences, rather than substring
repetitions, then the full string will always have the maximum score of
len(fullstring)*1.
> I hope I have expressed myself better this time.
You should think about your criteria some more.
Thad
| |
| Alf P. Steinbach 2005-02-10, 3:55 pm |
| * Thad Smith:
> dayzman@hotmail.com wrote:
>
>
> The number of repeats with "201201201" is 1, not 2. The number of
> repeats of "201" is 7, not 8. If you want to maximize the substring
> length times the number of substring occurrences, rather than substring
> repetitions, then the full string will always have the maximum score of
> len(fullstring)*1.
*Applying axe to forehead* Ouch! Ouch!
Blindly assumed the question was meaningful.
Thank you.
>
> You should think about your criteria some more.
--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
| |
| Hans-Christoph Wirth 2005-02-10, 3:55 pm |
| Thad Smith <ThadSmith@acm.org> wrote:
>
> The number of repeats with "201201201" is 1, not 2. The number of
> repeats of "201" is 7, not 8. If you want to maximize the substring
> length times the number of substring occurrences, rather than substring
> repetitions, then the full string will always have the maximum score of
> len(fullstring)*1.
Unless you have something like fullstring = "aaaaaaaaaaaaaaaaaa"
where you have overlapping substrings and a score of O(len^2).
| |
| Alf P. Steinbach 2005-02-10, 3:55 pm |
| X-Trace: individual.net n4ZXv7+3mtVL6GLPsaLpeA14U1No5ZSDuG7hdoYv
4x5FV+hNGr
X-Newsreader: Forte Free Agent 1.21/32.243
Xref: number1.nntp.dca.giganews.com comp.compression:63463 comp.programming:204743 comp.theory:38881
* Hans-Christoph Wirth:
> Thad Smith <ThadSmith@acm.org> wrote:
>
> Unless you have something like fullstring = "aaaaaaaaaaaaaaaaaa"
> where you have overlapping substrings and a score of O(len^2).
*Applying that axe to the forehead, again!*
Thanks.
Am I dumb? Or tired? Or just plain old?
--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
| |
| GregoryD 2005-02-10, 3:55 pm |
| On Wed, 09 Feb 2005 18:07:19 -0800, dayzman wrote:
> Hi,
>
> I'm trying to find the longest and most repeated substring using a
> suffix tree. For example, with the string "20120120120120120201201201",
> I would like the result to be "201". I've been trying to look at the
> immediate children of the root and count their leaves. The result turns
> out to be "20", because I'm taking the child with the most leaves as my
> result. Am I doing something wrong? Is there a better solution for my
> problem?
One of the problems a lot of programmers face is that they don't define
their rules well enough.
Ask yourself these questions:
What is my definition of a substring?
Does that definition differ from the rules of the language I'm using?
Here's some questions and answers to different problems. See where I'm
going with this.
Q1: Which substring(s) repeat(s) the most?
A: "2", "0", "20" depending on how the language defines the minimum length
of the substring.
Now...
Q2: What repeating substring is the longest?
A: "201201201"
But if you combine the two:
Q3: Given the substrings with the most repetitions, which substring is the
longest?
A: "20"
What you're trying to find given your statement is the substring that
repeats the most given a length >= 3, but that isn't what you asked the
computer to do. You asked it Q3.
If you give it: "20120120120120155201201201", what's the output? That
should be 201.
Seems to me that your algorithm is answering the question you gave it, but
you didn't form your input correctly. It's just hard to determine from
what you said.
GregoryD
| |
| Daniel Lidström 2005-02-10, 8:55 pm |
| On Wed, 09 Feb 2005 19:20:17 -0800, dayzman wrote:
> Let me clarify one thing first, I'm trying to find the substring that
> has the maximal len(str) * num_repeats among all other substrings.
>
> Any help will be much appreciated.
I think this will do it:
#include <iostream>
#include <iterator>
#include <string>
#include <vector>
using namespace std;
int main()
{
string text;
cout << "Enter string: "; cin >> text;
int bestscore = 0;
vector<string> best_substrings;
//! search for all substrings
for( int start_pos=0; start_pos<text.size()-1; start_pos++ ) {
for( int length=1; length<=text.size()-start_pos; length++ ) {
// create substring
string substring(text, start_pos, length);
//! count occurrences of substring
int occurrences = 0;
int search_position = start_pos;
int found_position = text.find(substring, search_position);
while( found_position<text.size() ) {
occurrences ++;
search_position = found_position+1;
found_position = text.find(substring, search_position);
}
//! save best scored substrings
int score = occurrences*length;
if( score>bestscore ) {
best_substrings.clear();
best_substrings.push_back(substring);
bestscore = score;
}
else if( score==bestscore )
best_substrings.push_back(substring);
}
}
cout << "Best score: " << bestscore << ". Best substring(s): ";
copy(best_substrings.begin(), best_substrings.end(),
ostream_iterator<string>(cout, " "));
cout << endl;
return 0;
}
Any comments?
--
Daniel
| |
|
| Hi, dayz...
I was answering your question several w s ago on a different thread.
I suggest using Chrochemore algorithm and you can solve your problem in
O(n*log(n)) time complexity and O(n^3) space complexity.
But if you try to use this idea to compress data you optimization
criterion(frequency*length) is not useful. You should try a more
complicated criterion. Consider W a word over an alphabet V and H(W)
it's entropy. Consider F a factor (substring) of W, and W' the string
obtained
by replacing all occurrences of F in W by a new symbol #.
The score S(F) = |W| * H(W) - [ |W'| * H(W') + |F| ] ; if S(F) is
positive, at least theoretically, F can be used to rewrite W in a more
compact manner.
If you find this short comment not very useful, you can find a list
with articles about Offline dictionary compression at: http://
www.mathamthism.3x.ro
Regards,
Popai
| |
|
| Hi, dayz...
I was answering your question several w s ago on a different thread.
I suggest using Chrochemore algorithm and you can solve your problem in
O(n*log(n)) time complexity and O(n^3) space complexity.
But if you try to use this idea to compress data you optimization
criterion(frequency*length) is not useful. You should try a more
complicated criterion. Consider W a word over an alphabet V and H(W)
it's entropy. Consider F a factor (substring) of W, and W' the string
obtained
by replacing all occurrences of F in W by a new symbol #.
The score S(F) = |W| * H(W) - [ |W'| * H(W') + |F| ] ; if S(F) is
positive, at least theoretically, F can be used to rewrite W in a more
compact manner.
If you find this short comment not very useful, you can find a list
with articles about Offline dictionary compression at: http://
www.mathemathism.3x.ro
Regards,
Popai
| |
| Risto Lankinen 2005-02-11, 8:55 am |
|
<dayzman@hotmail.com> wrote in message
news:1108001239.083520.160300@o13g2000cwo.googlegroups.com...
> Hi,
>
> I'm trying to find the longest and most repeated substring using a
> suffix tree. For example, with the string "20120120120120120201201201",
> I would like the result to be "201". I've been trying to look at the
> immediate children of the root and count their leaves. The result turns
> out to be "20", because I'm taking the child with the most leaves as my
> result. Am I doing something wrong? Is there a better solution for my
> problem?
The two criteria ("longest" and "most repeated") are incompatible.
The "20" is indeed the "most repeated" substring, but "201" is not
the longest one ("201201201" is).
Try to formulate a different criteria which takes both into account.
- Risto -
| |
| Phil Carmody 2005-02-11, 3:55 pm |
| "Popai" <ionut.popa@gmail.com> writes:
> Hi, dayz...
>
>
> I was answering your question several w s ago on a different thread.
> I suggest using Chrochemore algorithm and you can solve your problem in
>
> O(n*log(n)) time complexity and O(n^3) space complexity.
You can't touch O(n^3) space in less than O(n^3) time.
If your data is sparse, then you can trivially trade space for time.
Associative memory, for example.
Phil
--
If a religion is defined to be a system of ideas that contains unprovable
statements, then Godel taught us that mathematics is not only a religion, it
is the only religion that can prove itself to be one. -- John Barrow
|
|
|
|
|