Home > Archive > PERL Miscellaneous > April 2005 > Matching mixed up words
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Matching mixed up words
|
|
| Michael T. Davis 2005-04-12, 8:59 pm |
| Say I want to match "gremlin" or the letters that compose the word
"gremlin", but in any order. Note that once "g" is consumed, the set of
available letters no longer includes "g". (Also, "g" isn't necessarily
going to be the first letter.) I would anticipate that a proper solution
for a word of <N> letters would approach a complexity (or "big O") of N!
(read "N factorial"). Is there a solution which could be implemented as
a single match, or would this require some extra code around a match?
Thanks,
Mike
--
| Systems Specialist: CBE,MSE
Michael T. Davis | Departmental Networking/Computing
http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
| 197 Watts, (614) 292-6928
| |
| A. Sinan Unur 2005-04-12, 8:59 pm |
| DAVISM@ecr6.ohio-state.edu (Michael T. Davis) wrote in
news:d3h7bu$hbt$1@charm.magnus.acs.ohio-state.edu:
> Say I want to match "gremlin" or the letters that compose the
> word "gremlin", but in any order. Note that once "g" is consumed,
> the set of available letters no longer includes "g". (Also, "g" isn't
> necessarily going to be the first letter.) I would anticipate that a
> proper solution for a word of <N> letters would approach a complexity
> (or "big O") of N! (read "N factorial").
You are too pessimistic :)
> Is there a solution which could be implemented as a single match,
> or would this require some extra code around a match?
I don't see any mention of regexes in your post. I am not sure if that
is what you are after. There is a simple solution to this that falls
directly from your explanation of the problem:
use strict;
use warnings;
use Data::Dumper;
sub check {
my ($orig, $target) = @_;
my %c;
use integer;
my @l = split //, $orig;
++$c{$_} for @l;
@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}
@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}
my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}
__END__
I am sure someone will show a regex solution that I have overlooked.
By the way, your signature is not formatted properly:
> --
The proper signature marker is two dashes followed by a space and a
newline. Please do use that.
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
| |
| thundergnat 2005-04-12, 8:59 pm |
| Michael T. Davis wrote:
> Say I want to match "gremlin" or the letters that compose the word
> "gremlin", but in any order. Note that once "g" is consumed, the set of
> available letters no longer includes "g". (Also, "g" isn't necessarily
> going to be the first letter.) I would anticipate that a proper solution
> for a word of <N> letters would approach a complexity (or "big O") of N!
> (read "N factorial"). Is there a solution which could be implemented as
> a single match, or would this require some extra code around a match?
>
I'm sure it could be done more efficiently but it was an interesting
little diversion. I wandered a little from the OPs spec since I am
ignoring spaces, punctuation and case, I guess.
use warnings;
use strict;
my $phrase = 'George W. Bush';
my %letters;
for (split//, $phrase){
$letters{lc($_)}++ if /[a-zA-Z]/;
}
while (<DATA> ){
chomp (my $test_phrase = $_);
my $no_match;
my %testhash = %letters;
for (split//, $test_phrase){
if (/[a-zA-Z]/){
if (--$testhash{lc($_)} < 0){
$no_match++;
last;
}
}
}
for (values %testhash){
last if $no_match;
if ($_ < 0){
$no_match++;
}
}
print "Phrase \"$test_phrase\" ".($no_match ?
'does not match' : 'matches')." $phrase.\n";
}
__DATA__
NOT A MATCH
SHRUB EGG WOE
BUG GORE HEWS
GOB SEWER HUG
WEB USER GOGH
RUBES EGG WHO
BUG GREW HOSE
WHOSE BUGGER
BEG WORSE UGH
A BOGUS ENTRY
| |
| xhoster@gmail.com 2005-04-12, 8:59 pm |
| DAVISM@ecr6.ohio-state.edu (Michael T. Davis) wrote:
> Say I want to match "gremlin" or the letters that compose the
> word "gremlin", but in any order. Note that once "g" is consumed, the
> set of available letters no longer includes "g". (Also, "g" isn't
> necessarily going to be the first letter.) I would anticipate that a
> proper solution for a word of <N> letters would approach a complexity (or
> "big O") of N! (read "N factorial"). Is there a solution which could be
> implemented as a single match, or would this require some extra code
> around a match?
canon("gremlin") eq canon($foo) or die;
sub canon {
join "", sort split //, $_[0];
};
Xho
--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
| |
| A. Sinan Unur 2005-04-12, 8:59 pm |
| xhoster@gmail.com wrote in news:20050412170728.908$gZ@newsreader.com:
> DAVISM@ecr6.ohio-state.edu (Michael T. Davis) wrote:
....
[color=darkred]
> canon("gremlin") eq canon($foo) or die;
>
> sub canon {
> join "", sort split //, $_[0];
> };
That's what I call the power of a clear mind :)
Simple and elegant.
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
| |
| Tassilo v. Parseval 2005-04-12, 8:59 pm |
| Also sprach A. Sinan Unur:
> DAVISM@ecr6.ohio-state.edu (Michael T. Davis) wrote in
> news:d3h7bu$hbt$1@charm.magnus.acs.ohio-state.edu:
>
>
> You are too pessimistic :)
>
>
> I don't see any mention of regexes in your post. I am not sure if that
> is what you are after. There is a simple solution to this that falls
> directly from your explanation of the problem:
>
> use strict;
> use warnings;
>
> use Data::Dumper;
>
> sub check {
> my ($orig, $target) = @_;
>
> my %c;
>
> use integer;
> my @l = split //, $orig;
> ++$c{$_} for @l;
>
> @l = split //, $target;
> for (@l) {
> if(exists($c{$_}) and $c{$_}) {
> --$c{$_};
> }
> }
>
> @l = grep { $_ > 0 } values %c;
> scalar @l ? 0 : 1;
> }
>
> my %check = (
> sinan => [ 'nasin', 'nasina', 'lasin' ],
> gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
> );
>
> for my $k (keys %check) {
> for my $t (@{ $check{$k} }) {
> print "$k matches $t?: ";
> if(check($k, $t)) {
> print "Yes\n";
> } else {
> print "No\n";
> }
> }
> }
A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:
use strict;
use Benchmark qw/cmpthese/;
sub check {
my ($orig, $target) = @_;
my %c;
use integer;
my @l = split //, $orig;
++$c{$_} for @l;
@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}
@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}
sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
}
my %check = (
sinan => [ 'nasin', 'nasina', 'lasin', ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);
cmpthese(-2, {
histo => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check($k, $t);
}
}
},
sort => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check_sort($k, $t);
}
}
},
});
__END__
Rate histo sort
histo 2280/s -- -43%
sort 3992/s 75% --
This might however be due to a denser implementation of check_sort()
avoiding temporary variables etc.
Also, check_sort() is more correct as it wont falsely report 'sinan' and
'nasina' as matching, which check() does. ;-) I'd write check() thusly:
sub check {
my ($orig, $target) = @_;
my %c;
++$c{$_} for split //, $orig;
--$c{$_} for split //, $target;
return ! grep $_, values %c;
}
This is still slower by roughly 25% than using sort. The 'use integer'
appears to have no effect on the benchmark.
Tassilo
--
use bigint;
$n=7142335034377028016139702633033737113
9054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
| |
| A. Sinan Unur 2005-04-13, 3:59 am |
| "Tassilo v. Parseval" <tassilo.von.parseval@rwth-aachen.de> wrote in
news:slrnd5ohfe.178.tassilo.von.parseval@localhost.localdomain:
> sub check_sort {
> my ($orig, $target) = @_;
> my $o = join '', sort split //, $orig;
> my $t = join '', sort split //, $target;
> return $o eq $t;
> }
....
> Also, check_sort() is more correct as it wont falsely report 'sinan'
> and 'nasina' as matching, which check() does. ;-)
And to think that I actually look at the output, and somehow did not
notice my error. Thank you for catching that.
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/c...guidelines.html
| |
| Michael T. Davis 2005-04-13, 3:59 am |
| Just to be clear, I'm looking for a regex-based mechanism that will
work within the confines of "m/.../". I would imagine it's going to need to
rely on the "(${code})" construct.
BTW, my signature includes a trailing space at the end of the first
line, but the gateway I'm using apparently strips it off. I have alerted
them to the mistake.
Regards,
Mike
--
| Systems Specialist: CBE,MSE
Michael T. Davis | Departmental Networking/Computing
http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
| 197 Watts, (614) 292-6928
| |
| Tassilo v. Parseval 2005-04-13, 3:59 am |
| Also sprach Michael T. Davis:
> Just to be clear, I'm looking for a regex-based mechanism that will
> work within the confines of "m/.../". I would imagine it's going to need to
> rely on the "(${code})" construct.
Most likely even (??{CODE}). However, any of my attempts so far ended up
in a segmentation fault or 'panic: '. I knew that some of these extended
patterns are flagged as experimental but I didn't expect them to be that
fragile. It's tricky enough coming up with a pure regex solution but
here you'll also need to find one that wont crash perl. So I wouldn't
bother.
Tassilo
--
use bigint;
$n=7142335034377028016139702633033737113
9054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
| |
| Anno Siegel 2005-04-13, 8:56 am |
| A. Sinan Unur <1usa@llenroc.ude.invalid> wrote in comp.lang.perl.misc:
> DAVISM@ecr6.ohio-state.edu (Michael T. Davis) wrote in
> news:d3h7bu$hbt$1@charm.magnus.acs.ohio-state.edu:
[how to test for anagrams]
> use strict;
> use warnings;
>
> use Data::Dumper;
>
> sub check {
> my ($orig, $target) = @_;
>
> my %c;
>
> use integer;
> my @l = split //, $orig;
> ++$c{$_} for @l;
>
> @l = split //, $target;
> for (@l) {
> if(exists($c{$_}) and $c{$_}) {
> --$c{$_};
> }
> }
>
> @l = grep { $_ > 0 } values %c;
> scalar @l ? 0 : 1;
> }
>
> my %check = (
> sinan => [ 'nasin', 'nasina', 'lasin' ],
> gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
> );
>
> for my $k (keys %check) {
> for my $t (@{ $check{$k} }) {
> print "$k matches $t?: ";
> if(check($k, $t)) {
> print "Yes\n";
> } else {
> print "No\n";
> }
> }
> }
>
>
> __END__
>
>
> I am sure someone will show a regex solution that I have overlooked.
A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.
Using a hash for counting is just fine. It is basically a well known
data structure that implements what has been called "bags". Bags are
like sets, but each element (a hash key) can be contained multiple times
(the hash value). Containment and equality of bags are defined in the
obvious way. Then, to check if two strings are anagrams, create the
corresponding bags and test for equality. Code:
my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);
for my $k (keys %check) {
my $bk = Bag->embag( $k);
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
print $bk eq Bag->embag( $t) ? "Yes\n" : "No\n";
}
}
########################################
#################################
package Bag;
sub embag { # create a bag of letters from a string
my $class = shift;
my %bag;
$bag{ $_} ++ for split //, shift;
bless \ %bag, $class;
}
sub contained {
my ( $b1, $b2) = @_;
$b2->{ $_} and $b1->{ $_} > $b2->{ $_} and return 0 for keys %$b1;
1;
}
use overload(
le => 'contained',
eq => sub { $_[ 0] le $_[ 1] and $_[ 1] le $_[ 0] },
);
__END__
Anno
| |
| Anno Siegel 2005-04-13, 8:56 am |
| Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote in comp.lang.perl.misc:
> Also sprach A. Sinan Unur:
>
[anagram detection by counting letters]
[color=darkred]
> A faster solution appears to involve sort(): split both strings, sort
> them and compare for equality. According to a benchmark:
[benchmark snipped]
Interesting, since counting is linear and sorting is n*log n. Presumably,
with huge words, counting would win in the end, but there probably
never was a language (not even German) with words long enough to bring
out the difference.
Anno
| |
| Tassilo v. Parseval 2005-04-13, 8:56 am |
| Also sprach Anno Siegel:
> Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote in comp.lang.perl.misc:
>
> [anagram detection by counting letters]
>
>
> [benchmark snipped]
>
> Interesting, since counting is linear and sorting is n*log n. Presumably,
> with huge words, counting would win in the end, but there probably
> never was a language (not even German) with words long enough to bring
> out the difference.
Altering the benchmark a little so that we can change the length of the
words more easily:
use Benchmark qw/cmpthese/;
sub check {
my ($orig, $target) = @_;
my %c;
++$c{$_} for split //, $orig;
--$c{$_} for split //, $target;
return ! grep $_, values %c;
}
sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
}
my $len = shift;
my $key = join '', map { ['a'..'z']->[rand 26] } 1 .. $len;
cmpthese(-2, {
histo => sub {
check($key, scalar reverse $key);
},
sort => sub {
check_sort($key, scalar reverse $key);
},
});
$len = 20:
Rate histo sort
histo 8029/s -- -19%
sort 9962/s 24% --
$len = 50:
Rate histo sort
histo 3600/s -- -10%
sort 4015/s 12% --
$len = 100:
Rate histo sort
histo 1912/s -- -3%
sort 1981/s 4% --
$len = 200:
Rate sort histo
sort 972/s -- -5%
histo 1018/s 5% --
Aha! So the words need to be unrealistically long in order for the
linear method to win. Which says quite something about the efficiency of
perl's sort implementations. Of course, check() could be made to return
earlier, for instance when a negative value shows up in the second
for-loop. The same is true for the final grep().
Still, for real-world words I suspect using sort() is still a very
efficient (both coding- and runtime-wise) solution.
Tassilo
--
use bigint;
$n=7142335034377028016139702633033737113
9054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
| |
| Anno Siegel 2005-04-13, 3:58 pm |
| Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote in comp.lang.perl.misc:
> Also sprach Anno Siegel:
> comp.lang.perl.misc:
>
> Altering the benchmark a little so that we can change the length of the
> words more easily:
[shortened]
> $len = 100:
>
> Rate histo sort
> histo 1912/s -- -3%
> sort 1981/s 4% --
>
> $len = 200:
>
> Rate sort histo
> sort 972/s -- -5%
> histo 1018/s 5% --
>
> Aha! So the words need to be unrealistically long in order for the
> linear method to win.
I wouldn't have been amazed to find the crossover length even higher,
at 1000 or so.
> Which says quite something about the efficiency of
> perl's sort implementations. Of course, check() could be made to return
> earlier, for instance when a negative value shows up in the second
> for-loop. The same is true for the final grep().
List::Util::first is the grep replacement for that. How much it
saves depends heavily on the distribution of the strings. If
strings vary wildly, it can save a lot, if most comparisons are
for almost-anagrams it won't save so much.
> Still, for real-world words I suspect using sort() is still a very
> efficient (both coding- and runtime-wise) solution.
In one implementation I used byte vectors for letter counting, a la
embag {
my $bag = '';
++ vec( $bag, ord $_, 8) for split //, shift;
$bag;
}
which is easily Inline-able. Equality of counts is 'eq', like with
sorting. The count vectors can be compacted using another level of
indirection ( $charno[ ord $_] instead of ord $_), which is still
easily Inlined. Sorting was no option for the application, so I never
benchmarked against it, but I'd expect the Inlined code to be in the
same ballpark, even for short strings.
Anno
| |
| Ilmari Karonen 2005-04-13, 8:57 pm |
| Anno Siegel <anno4000@lublin.zrz.tu-berlin.de> kirjoitti 13.04.2005:
> A. Sinan Unur <1usa@llenroc.ude.invalid> wrote in comp.lang.perl.misc:
>
> [how to test for anagrams]
>
>
> A regex solution seems unlikely. It would require jumping back and
> forth in a string while keeping track of what was matched where.
> Regexes aren't very good at that.
Nonetheless, here's a regex solution:
sub anagram_re {
my $word = shift;
return "" if $word eq "";
my (@re, %seen);
foreach my $i (0 .. length($word)-1) {
my $temp = $word;
my $ch = substr($temp, $i, 1, "");
next if $seen{$ch}++;
push @re, quotemeta($ch) . anagram_re($temp);
}
return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
}
Give it a word, and it will return a regex to match any anagram of it.
For example, here's the regex for "food" (sans "?:" modifiers):
(f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(f
o|of))|d(foo|o(fo|of)))
The corresponding regex for "gremlin" is 33218 characters long with
the "?:" modifiers, or 25978 without them.
--
Ilmari Karonen
To reply by e-mail, please replace ".invalid" with ".net" in address.
| |
| Anno Siegel 2005-04-13, 8:57 pm |
| Ilmari Karonen <usenet@vyznev.invalid> wrote in comp.lang.perl.misc:
> Anno Siegel <anno4000@lublin.zrz.tu-berlin.de> kirjoitti 13.04.2005:
>
> Nonetheless, here's a regex solution:
>
> sub anagram_re {
> my $word = shift;
> return "" if $word eq "";
> my (@re, %seen);
> foreach my $i (0 .. length($word)-1) {
> my $temp = $word;
> my $ch = substr($temp, $i, 1, "");
> next if $seen{$ch}++;
> push @re, quotemeta($ch) . anagram_re($temp);
> }
> return @re > 1 ? "(?:".join("|", @re).")" : $re[0];
> }
>
> Give it a word, and it will return a regex to match any anagram of it.
> For example, here's the regex for "food" (sans "?:" modifiers):
>
> (f(o(od|do)|doo)|o(f(od|do)|o(fd|df)|d(f
o|of))|d(foo|o(fo|of)))
Nice hack...
> The corresponding regex for "gremlin" is 33218 characters long with
> the "?:" modifiers, or 25978 without them.
....and mostly useless. I like it :)
Anno
| |
| Ilmari Karonen 2005-04-14, 8:57 pm |
| Anno Siegel <anno4000@lublin.zrz.tu-berlin.de> kirjoitti 13.04.2005:
> Ilmari Karonen <usenet@vyznev.invalid> wrote in comp.lang.perl.misc:
>
> Nice hack...
>
>
> ...and mostly useless. I like it :)
It does, however, have one advantage -- it's fast. Really fast. Over
an order of magnitude faster than any other solution in this thread so
far, in fact.
Of course, that's only if you ignore the time to build and compile the
regex. And it only works for fairly short words anyway.
But if you want a solution that both runs _and_ starts fast, here's
something adapted from an earlier thread titled "perl scramble":
#!/usr/bin/perl -w
use strict;
my $word = shift;
my $canon = join "", sort split //, "$word\n";
my $code = q{
while (<> ) {
print if length == length $canon
and !tr/LETTERS//c
and $canon eq join "", sort split //;
}
};
$code =~ s/LETTERS/\Q$canon/;
eval $code; die if $@;
This assumes input comes from a file (or stdin), but it can be easily
modified to, say, grep an array.
--
Ilmari Karonen
To reply by e-mail, please replace ".invalid" with ".net" in address.
| |
| Anno Siegel 2005-04-14, 8:57 pm |
| Ilmari Karonen <usenet@vyznev.invalid> wrote in comp.lang.perl.misc:
> Anno Siegel <anno4000@lublin.zrz.tu-berlin.de> kirjoitti 13.04.2005:
[...]
[color=darkred]
> But if you want a solution that both runs _and_ starts fast, here's
> something adapted from an earlier thread titled "perl scramble":
>
> #!/usr/bin/perl -w
> use strict;
>
> my $word = shift;
> my $canon = join "", sort split //, "$word\n";
>
> my $code = q{
> while (<> ) {
> print if length == length $canon
> and !tr/LETTERS//c
> and $canon eq join "", sort split //;
> }
> };
> $code =~ s/LETTERS/\Q$canon/;
> eval $code; die if $@;
>
> This assumes input comes from a file (or stdin), but it can be easily
> modified to, say, grep an array.
Ah, that's basically the "sort-solution", but the length and tr/// tests
speed it up. Sorting only happens when a word is entirely made of the
same letters, but in different numbers (with the same total). That helps
a lot in typical situations when most candidates are not anagrams. Nifty.
Anno
| |
| Abigail 2005-04-14, 8:57 pm |
| Tassilo v. Parseval (tassilo.von.parseval@rwth-aachen.de) wrote on
MMMMCCXLIII September MCMXCIII in <URL:news:slrnd5pcas.pf.tassilo.von.parseval@localhost.localdomain>:
|| Also sprach Michael T. Davis:
||
|| > Just to be clear, I'm looking for a regex-based mechanism that will
|| > work within the confines of "m/.../". I would imagine it's going to need to
|| > rely on the "(${code})" construct.
||
|| Most likely even (??{CODE}). However, any of my attempts so far ended up
|| in a segmentation fault or 'panic: '. I knew that some of these extended
|| patterns are flagged as experimental but I didn't expect them to be that
|| fragile. It's tricky enough coming up with a pure regex solution but
|| here you'll also need to find one that wont crash perl. So I wouldn't
|| bother.
#!/usr/bin/perl
use strict;
use warnings;
no warnings qw /syntax/;
my $word = "gremlin";
my $ana = "nlmregi";
my (%h);
print $word =~
/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h}})(?!)|)/x ? "match\n" : "no match\n";
__END__
HTH. HAND.
Abigail
--
$_ = "\nrekcaH lreP rehtona tsuJ"; my $chop; $chop = sub {print chop; $chop};
$chop -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> ()
-> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> () -> ()
| |
| Tassilo v. Parseval 2005-04-15, 3:58 am |
| Also sprach Abigail:
> Tassilo v. Parseval (tassilo.von.parseval@rwth-aachen.de) wrote on
> MMMMCCXLIII September MCMXCIII in <URL:news:slrnd5pcas.pf.tassilo.von.parseval@localhost.localdomain>:
>|| Also sprach Michael T. Davis:
>||
>|| > Just to be clear, I'm looking for a regex-based mechanism that will
>|| > work within the confines of "m/.../". I would imagine it's going to need to
>|| > rely on the "(${code})" construct.
>||
>|| Most likely even (??{CODE}). However, any of my attempts so far ended up
>|| in a segmentation fault or 'panic: '. I knew that some of these extended
>|| patterns are flagged as experimental but I didn't expect them to be that
>|| fragile. It's tricky enough coming up with a pure regex solution but
>|| here you'll also need to find one that wont crash perl. So I wouldn't
>|| bother.
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> no warnings qw /syntax/;
>
> my $word = "gremlin";
> my $ana = "nlmregi";
>
> my (%h);
> print $word =~
> /^(?{%h = ()})
> (?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
> (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
> $(?(?{grep {$_} values %h}})(?!)|)/x ? "match\n" : "no match\n";
Hmmh, this doesn't compile:
Sequence (?{...}) not terminated or not {}-balanced in regex; marked by <-- HERE in m/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{ <-- HERE grep {$_} values %h}})(?!)|)/ at - line 13.
I can get rid of these by strategically inserting a few spaces here and
there, but then it eventually complains about an "Unknown switch
condition".
Tassilo
--
use bigint;
$n=7142335034377028016139702633033737113
9054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
| |
| Abigail 2005-04-15, 3:58 am |
| Tassilo v. Parseval (tassilo.von.parseval@rwth-aachen.de) wrote on
MMMMCCXLIV September MCMXCIII in <URL:news:slrnd5tv3o.27g.tassilo.von.parseval@localhost.localdomain>:
,, Also sprach Abigail:
,,
,, > [ Snip ]
,,
,, Hmmh, this doesn't compile:
,,
,, Sequence (?{...}) not terminated or not {}-balanced in regex; marked by <-- HERE in m/^(?{%h = ()})
,, (?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
,, (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
,, $(?(?{ <-- HERE grep {$_} values %h}})(?!)|)/ at - line 13.
Cut-and-paste error. Retry:
#!/usr/bin/perl
use strict;
use warnings;
no warnings qw /syntax/;
my $word = "gremlin";
my $ana = "nlmregi";
my (%h);
print $word =~
/^(?{%h = ()})
(?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h})(?!)|)/x
? "match\n" : "no match\n";
__END__
Abigail
--
map{${+chr}=chr}map{$_=>$_^ord$"}$=+$]..3*$=/2;
print "$J$u$s$t $a$n$o$t$h$e$r $P$e$r$l $H$a$c$k$e$r\n";
| |
| Fabian Pilkowski 2005-04-15, 3:58 am |
| * Tassilo v. Parseval schrieb:
> Also sprach Abigail:
^^[color=darkred]
>
> Hmmh, this doesn't compile:
Delete one of those marked curly parentheses and it'll work fine.
regards,
fabian
| |
| Tassilo v. Parseval 2005-04-15, 3:58 am |
| Also sprach Abigail:
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> no warnings qw /syntax/;
>
> my $word = "gremlin";
> my $ana = "nlmregi";
>
> my (%h);
> print $word =~
> /^(?{%h = ()})
> (?{$h {substr $ana => $_, 1} ++ for 0 .. length ($ana) - 1})
> (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
> $(?(?{grep {$_} values %h})(?!)|)/x
> ? "match\n" : "no match\n";
Indeed, this is much better. Interestingly enough, it stops working when
using split:
use strict;
use warnings;
no warnings qw /syntax/;
my $word = "gremlin";
my $ana = "nlmregi";
my (%h);
print $word =~ m#
^(?{%h = ()})
(?{$h {$_} ++ for split //, $ana})
(?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
$(?(?{grep {$_} values %h})(?!)|)#x
? "match\n" : "no match\n";
As far as I see it, this code should be functionally equivalent to
yours. Probably these extended patterns don't work too well when another
pattern match happens inside.
Tassilo
--
use bigint;
$n=7142335034377028016139702633033737113
9054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
| |
| Abigail 2005-04-15, 8:57 am |
| Tassilo v. Parseval (tassilo.von.parseval@rwth-aachen.de) wrote on
MMMMCCXLV September MCMXCIII in <URL:news:slrnd5ul67.pq.tassilo.von.parseval@localhost.localdomain>:
()
() Indeed, this is much better. Interestingly enough, it stops working when
() using split:
()
() use strict;
() use warnings;
() no warnings qw /syntax/;
()
() my $word = "gremlin";
() my $ana = "nlmregi";
()
() my (%h);
() print $word =~ m#
() ^(?{%h = ()})
() (?{$h {$_} ++ for split //, $ana})
() (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
() $(?(?{grep {$_} values %h})(?!)|)#x
() ? "match\n" : "no match\n";
()
() As far as I see it, this code should be functionally equivalent to
() yours. Probably these extended patterns don't work too well when another
() pattern match happens inside.
The regex engine is not re-entrant. So you can't use split inside
a regex.
(And # is a nasty delimiter to use in combination with /x).
Abigail
--
perl -wle 'print "Prime" if (0 x shift) !~ m 0^\0?$|^(\0\0+?)\1+$0'
| |
| Tassilo v. Parseval 2005-04-15, 8:57 am |
| Also sprach Abigail:
> Tassilo v. Parseval (tassilo.von.parseval@rwth-aachen.de) wrote on
> MMMMCCXLV September MCMXCIII in <URL:news:slrnd5ul67.pq.tassilo.von.parseval@localhost.localdomain>:
> ()
> () Indeed, this is much better. Interestingly enough, it stops working when
> () using split:
> ()
> () use strict;
> () use warnings;
> () no warnings qw /syntax/;
> ()
> () my $word = "gremlin";
> () my $ana = "nlmregi";
> ()
> () my (%h);
> () print $word =~ m#
> () ^(?{%h = ()})
> () (?{$h {$_} ++ for split //, $ana})
> () (?: (.) (?(?{$h {$1} --> 0})|(?!)) )*
> () $(?(?{grep {$_} values %h})(?!)|)#x
> () ? "match\n" : "no match\n";
> ()
> () As far as I see it, this code should be functionally equivalent to
> () yours. Probably these extended patterns don't work too well when another
> () pattern match happens inside.
>
>
> The regex engine is not re-entrant. So you can't use split inside
> a regex.
I guess this the case. Yet, I wonder why perl lets me do it. It could
easily barf when it encounters anything that requires re-entrantness
in (?{}) and (??{}).
> (And # is a nasty delimiter to use in combination with /x).
Next to /, it's my default-delimiter. As I virtually never use /x, I
seldom question its use. :-)
Tassilo
--
use bigint;
$n=7142335034377028016139702633033737113
9054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
| |
| Michael T. Davis 2005-04-21, 3:58 pm |
| I really appreciate all the work you folks have put into answering
this for me. I think I'm close to implementing something, but as you may
have guessed, it's time for me to complicate things, again. (;-)
The whole system I'm developing works like this... I have some string
obtained from an environment variable. I also have a list of potential
pattern matches in a text file. My code reads the file and attempts to match
each pattern against the env. string until it hits the end-of-file or a match
occurs. Each line from the data file is the argument to a case-insensitive
regex match (i.e. "m//i", or -- for the pattern "<pattern>" read from the
file -- "m/(<pattern> )/i"), so the code looks something like this:
[...]
# $string is value obtained from environment
while ( $pattern = <> ) {
$pattern = '(' . $pattern . ')';
last if ( $match ) = $string =~ /$pattern/i;
[...] }
Since $pattern is basically just what's being read in from the data file, I'm
looking for a way to handle anagram matching within it. Here's something
along these lines (though it's not correct):
\sg((?i)[a-z]+?)s?\W(?{anagram($+,'host','houl','remlin')})
(The Perl code might include "use re 'eval'" [I know...nasty], and will include
"sub anagram {...}".) I'm having trouble understanding how I might trigger a
match for the whole regex iff (if and only if) the result returned by anagram()
indicates a match. (If there's a way to avoid "use re 'eval'", that would be
helpful, too.)
Thanks,
Mike
--
| Systems Specialist: CBE,MSE
Michael T. Davis | Departmental Networking/Computing
http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
| 197 Watts, (614) 292-6928
| |
| Fabian Pilkowski 2005-04-21, 8:57 pm |
| * Michael T. Davis schrieb:
> I really appreciate all the work you folks have put into answering
> this for me. I think I'm close to implementing something, but as you may
> have guessed, it's time for me to complicate things, again. (;-)
>
> The whole system I'm developing works like this... I have some string
> obtained from an environment variable. I also have a list of potential
> pattern matches in a text file. My code reads the file and attempts to match
> each pattern against the env. string until it hits the end-of-file or a match
> occurs. Each line from the data file is the argument to a case-insensitive
> regex match (i.e. "m//i", or -- for the pattern "<pattern>" read from the
> file -- "m/(<pattern> )/i"), so the code looks something like this:
>
> [...]
> # $string is value obtained from environment
> while ( $pattern = <> ) {
> $pattern = '(' . $pattern . ')';
> last if ( $match ) = $string =~ /$pattern/i;
> [...] }
>
> Since $pattern is basically just what's being read in from the data file, I'm
> looking for a way to handle anagram matching within it. Here's something
> along these lines (though it's not correct):
>
> \sg((?i)[a-z]+?)s?\W(?{anagram($+,'host','houl','remlin')})
>
> (The Perl code might include "use re 'eval'" [I know...nasty], and will include
> "sub anagram {...}".) I'm having trouble understanding how I might trigger a
> match for the whole regex iff (if and only if) the result returned by anagram()
> indicates a match.
I think, Abigail has shown such a trick in this thread. Scout about :
(?(?{grep {$_} values %h})(?!)|)
The inner (?{code}) is what you want to eval, surrounded by a construct
called
(?(condition)yes-pattern|no-pattern)
in `perldoc perlre`. Furthermore, »(?!)« never matchs, while the empty
pattern matchs always.
> (If there's a way to avoid "use re 'eval'", that would be
> helpful, too.)
You're reading the patterns from a file, so you have them in vars. And
one has to use »use re 'eval'« if you have (?{code}) blocks and variable
interpolation in one regex. But, have you read `perldoc re` already?
Especially:
For the purpose of this pragma, interpolation of precompiled regular
expressions (i.e., the result of qr//) is not considered variable
interpolation. Thus:
/foo${pat}bar/
is allowed if $pat is a precompiled regular expression, even if $pat
contains (?{ ... }) assertions.
Perhaps this precompilation could help you.
regards,
fabian
| |
| Michael T. Davis 2005-04-22, 3:57 am |
|
In article <3cqi4sF6n2237U1@individual.net>, Fabian Pilkowski
<pilkowsk@informatik.uni-marburg.de> writes:
>* Michael T. Davis schrieb:
>
> match
> match
> I'm
> include
> a
> anagram()
>
>I think, Abigail has shown such a trick in this thread. Scout about :
>
> (?(?{grep {$_} values %h})(?!)|)
>
>The inner (?{code}) is what you want to eval, surrounded by a construct
>called
>
> (?(condition)yes-pattern|no-pattern)
>
>in `perldoc perlre`. Furthermore, »(?!)« never matchs, while the empty
>pattern matchs always.
OK, so I need to modify my test pattern as follows:
\sg((?i)[a-z]+?)s?\W(?(?{anagram($+,'host','houl','remlin')}).*$|(?!))
I read this as...
Match a (plural) anagram of "ghost", "ghoul" or "gremlin" with a leading
white-space character and a trailing non-word character. If the anagram
matches, consume the rest of the line; otherwise, let the match fail.
Here's my anagram code:
use re 'eval';
sub anagram
{
my ( $result, $t, $target );
$result = '';
$target = shift ( @_ );
$t = join "", sort { lc ( $a ) cmp lc ( $b ) } split //, $target;
foreach $word ( @_ )
{
my $w;
next if length ( $word ) != length ( $target );
$w = join "", sort { lc ( $a ) cmp lc ( $b ) } split //, $word;
$result = $w eq $t;
last if $result
}
if ( $result )
{
return $target
}
else
{
return $result
}
}
Through a diagnosic print statement, I can confirm that my test string of
" ermlgin " would match, but the match test that calls anagram via re-eval
isn't matching:
( $match ) = $string =~ /($pattern)/i
($string = " ermlgin " and $pattern is the pattern, above.) What am I
missing?
>
>
>You're reading the patterns from a file, so you have them in vars. And
>one has to use »use re 'eval'« if you have (?{code}) blocks and variable
>interpolation in one regex. But, have you read `perldoc re` already?
>Especially:
>
> For the purpose of this pragma, interpolation of precompiled regular
> expressions (i.e., the result of qr//) is not considered variable
> interpolation. Thus:
>
> /foo${pat}bar/
>
> is allowed if $pat is a precompiled regular expression, even if $pat
> contains (?{ ... }) assertions.
>
>Perhaps this precompilation could help you.
Yeah, I'll look into the potential of integrating this, once I can get
the matching to work.
>
>regards,
>fabian
Thanks,
Mike
--
| Systems Specialist: CBE,MSE
Michael T. Davis | Departmental Networking/Computing
http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
| 197 Watts, (614) 292-6928
| |
| Fabian Pilkowski 2005-04-22, 8:56 pm |
| * Michael T. Davis schrieb:
> In article <3cqi4sF6n2237U1@individual.net>, Fabian Pilkowski
> <pilkowsk@informatik.uni-marburg.de> writes:
>
> OK, so I need to modify my test pattern as follows:
>
> \sg((?i)[a-z]+?)s?\W(?(?{anagram($+,'host','houl','remlin')}).*$|(?!))
>
> I read this as...
>
> Match a (plural) anagram of "ghost", "ghoul" or "gremlin" with a leading
> white-space character and a trailing non-word character. If the anagram
> matches, consume the rest of the line; otherwise, let the match fail.
I think, you misread a little bit ;-) I read:
Match one white-space followed by "g". Thereafter a few alphabetic
chars with a trailing optional "s". A non-word char must follow. I
agree with: If the anagram matches, consume the rest of the line;
otherwise, let the match fail.
Now, have a look at your test string of " ermlgin " you mentioned below.
It does start with »\s« but that is *not* followed by "g". Therefore the
whole regex cannot match this string. Even if you omit the »\s« at the
beginning, $+ would contain "in" (the part behind "g"). I think you want
do something like
\s([a-zA-Z]+)s?\W(?(? {anagram($+,'ghost','ghoul','gremlin')})
.*$|(?!))
Btw, »[a-zA-Z]+« is shorter than »(?i)[a-z]+«, and the non-greedy »?«
isn't needed due to the »\W« you enforced thereafter.
>
> Here's my anagram code:
>
> use re 'eval';
> sub anagram
> {
> my ( $result, $t, $target );
>
> $result = '';
> $target = shift ( @_ );
> $t = join "", sort { lc ( $a ) cmp lc ( $b ) } split //, $target;
> foreach $word ( @_ )
> {
> my $w;
> next if length ( $word ) != length ( $target );
> $w = join "", sort { lc ( $a ) cmp lc ( $b ) } split //, $word;
> $result = $w eq $t;
> last if $result
> }
> if ( $result )
> {
> return $target
> }
> else
> {
> return $result
> }
> }
>
> Through a diagnosic print statement, I can confirm that my test string of
> " ermlgin " would match, but the match test that calls anagram via re-eval
> isn't matching:
>
> ( $match ) = $string =~ /($pattern)/i
>
> ($string = " ermlgin " and $pattern is the pattern, above.) What am I
> missing?
You haven't read Abigail's postings carefully enough. You cannot use any
other regex in the re-eval, so split() isn't allowed inside of anagram()
you are calling there. Have you tried to print() a debug message inside
of anagram()?. I had, but nothing is printed out -- but why? Ok, the sub
anagram() is only called if there are no regexes inside, I guess.
From that place you have to rewrite anagram(). I'd do it this way:
sub anagram {
my $result = '';
my $target = shift ( @_ );
my $t = join "", sort { lc ( $a ) cmp lc ( $b ) }
map { substr $target, $_, 1 } 0 .. length($target)-1;
foreach my $word ( @_ ) {
next if length ( $word ) != length ( $target );
my $w = join "", sort { lc ( $a ) cmp lc ( $b ) }
map { substr $word, $_, 1 } 0 .. length($word)-1;
$result = $w eq $t;
last if $result
}
return $result ? $target : $result;
}
I haven't checked if this sub is doing what you want -- I just replaced
the split() calls and summarized the return statement. It seems there
are more things you could improve. I hope I have understood what you
want -- hence, I tested it this way:
my $string = " start ermlgin ";
my $pattern = qr/\s([a-zA-Z]+)s?\W(?(? {anagram($+,'ghost','ghoul','gremlin')})
.*$|(?!))/;
my( $match ) = $string =~ m/($pattern)/i;
print "[$match]";
For me, this prints out
[ ermlign ]
what is what you want, I thought. Btw, there is no »use re 'eval'« in my
test script. I hope this could help you.
regards,
fabian
| |
| Michael T. Davis 2005-04-23, 3:57 pm |
|
In article <3ctartF6p4jr0U1@individual.net>, Fabian Pilkowski
<pilkowsk@informatik.uni-marburg.de> writes:
>* Michael T. Davis schrieb:
>
>
>I think, you misread a little bit ;-) I read:
>
> Match one white-space followed by "g". Thereafter a few alphabetic
> chars with a trailing optional "s". A non-word char must follow. I
> agree with: If the anagram matches, consume the rest of the line;
> otherwise, let the match fail.
>
>Now, have a look at your test string of " ermlgin " you mentioned below.
>It does start with »\s« but that is *not* followed by "g". Therefore the
>whole regex cannot match this string. Even if you omit the »\s« at the
>beginning, $+ would contain "in" (the part behind "g"). I think you want
>do something like
>
> \s([a-zA-Z]+)s?\W(?(? {anagram($+,'ghost','ghoul','gremlin')})
.*$|(?!))
I made the mistake of over-generalizing the test string. Thanks for
pointing out my obvious idiocy. (;-)
>
>Btw, »[a-zA-Z]+« is shorter than »(?i)[a-z]+«, and the non-greedy »?«
>isn't needed due to the »\W« you enforced thereafter.
Point taken.
>
>
>You haven't read Abigail's postings carefully enough. You cannot use any
>other regex in the re-eval, so split() isn't allowed inside of anagram()
>you are calling there. Have you tried to print() a debug message inside
>of anagram()?. I had, but nothing is printed out -- but why? Ok, the sub
>anagram() is only called if there are no regexes inside, I guess.
>
>From that place you have to rewrite anagram(). I'd do it this way:
>
>
> sub anagram {
> my $result = '';
> my $target = shift ( @_ );
> my $t = join "", sort { lc ( $a ) cmp lc ( $b ) }
> map { substr $target, $_, 1 } 0 .. length($target)-1;
> foreach my $word ( @_ ) {
> next if length ( $word ) != length ( $target );
> my $w = join "", sort { lc ( $a ) cmp lc ( $b ) }
> map { substr $word, $_, 1 } 0 .. length($word)-1;
> $result = $w eq $t;
> last if $result
> }
> return $result ? $target : $result;
> }
>
>
>I haven't checked if this sub is doing what you want -- I just replaced
>the split() calls and summarized the return statement. It seems there
>are more things you could improve. I hope I have understood what you
>want -- hence, I tested it this way:
>
>
> my $string = " start ermlgin ";
> my $pattern =
> qr/\s([a-zA-Z]+)s?\W(?(? {anagram($+,'ghost','ghoul','gremlin')})
.*$|(?!))/;
> my( $match ) = $string =~ m/($pattern)/i;
> print "[$match]";
>
>
>For me, this prints out
>
> [ ermlign ]
>
>what is what you want, I thought. Btw, there is no »use re 'eval'« in my
>test script. I hope this could help you.
Yes, I believe you've hit the nail on the head.
>
>regards,
>fabian
Much obliged,
Mike
--
| Systems Specialist: CBE,MSE
Michael T. Davis | Departmental Networking/Computing
http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
| 197 Watts, (614) 292-6928
|
|
|
|
|