Home > Archive > PERL Beginners > May 2006 > reg exp speed?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Alan Campbell 2006-05-19, 6:58 pm |
| hello folks,
I'm slurping in a large file and seeing a nice speedup versus line by line processing...but I'm losing it in my (likely poorly constructed!) reg-expression match
I do: -
#
# look for potentially problematic code of the following form: -
# STW b0, *SP--[3]
# The reg exp tries to match: -
# - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by
# - 1+ non-whitespace chars followed by
# - 0+ whitespace chars followed by
# - 0+ non-whitespace chars followed by
# the string 'B15--' followed by
# anything up until an odd single-digit number followed by
# the ']' character
# Matches all occurrences
#
my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
...and then I foreach on @match_sp to show all occurrences found...
Any speedup tips most welcome. Would also appreciate a brief explanation of why this reg ex is slow (so I dont screw it up next time!)
Many thanks, Alan
---------------------------------
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
| |
| Timothy Johnson 2006-05-19, 6:58 pm |
|
One thing you could do is precompile the regex and save it in a variable
ala
my $regex =3D qr/.*ST\S+\s*\S*B15--.*[^02468]]/o;
my @match_sp =3D $all_lines =3D~ /$regex/mg;
Note the 'o' option on the qr//, which tells perl to compile the regex
only once.
I'm not sure if it applies in this case or not, but it's something to
check.
-----Original Message-----
From: Alan Campbell [mailto:alancam73@yahoo.com]=20
Sent: Friday, May 19, 2006 3:13 PM
To: beginners@perl.org
Subject: reg exp speed?
hello folks,
=20
I'm slurping in a large file and seeing a nice speedup versus line by
line processing...but I'm losing it in my (likely poorly constructed!)
reg-expression match
=20
<snip explanation>
my @match_sp =3D $all_lines =3D~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
=20
<snip>
| |
| David Romano 2006-05-19, 6:58 pm |
| Hi Alan,
On 5/19/06, Alan Campbell wrote:
> hello folks,
>
> I'm slurping in a large file and seeing a nice speedup versus line by l=
ine processing...but I'm losing it in my (likely poorly constructed!) reg-e=
xpression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc) follo=
wed by
> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp =3D $all_lines =3D~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
>
> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanatio=
n of why this reg ex is slow (so I dont screw it up next time!)
I think using character classes and nonbacktracking groups can speed
up the .* stuff. Maybe something like the following will help?
/(?:(?>[^S\n]*)S?)*ST\S+\s*\S*B15--(?:(?>[^02468\n]*)[02468]?)*[^02468]\]/m=
g
David
| |
|
| Alan Campbell wrote:
> I'm slurping in a large file and seeing a nice speedup versus line by line processing
I don't know for sure if this strategy will work but I do see a couple
problems with your regex.
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
(1) Your m modifier is useless since you don't have a ^ or $ anchor.
Without the anchors, your regex probably does a lot of unnecessary
backtracking on every logical line of your data.
(2) "Odd single-digit" can be matched simply with [13579], while
[^24680] would match letters, puntuations, etc.
| |
| Dr.Ruud 2006-05-19, 9:57 pm |
| Alan Campbell schreef:
> I'm slurping in a large file and seeing a nice speedup
> versus line by line processing...
I would go back to line-by-line again if the regexp should only match
within one line.
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc)
> followed by # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
The start isn't specific, make that "^\ *ST"
The next, \S+ could also be [A-Z]+.
The next, \s*, I would make "\ *".
The next , \S*, is ok. You remain in the same line.
The ".*" also remains in the same line, so I don't understand the
/m-modifier.
Is this meant as multi-line, or should the B15 be on the same line as
the STW?
Why is the b0 in lowercase, and the B15 in uppercase?
Is the "]" supposed to be in the charset? I assume no. Otherwise use
"[^]02468]".
while ( <> )
{
/ ^ \ * ST[A-Z]+ \ * \S*,\S* B15-- .* [^02468] \] /xi
and push @match_sp, $_;
}
You could also run a grep on the large file first:
$ grep -i "ST.*B15--" infile > tmpfile
and then run your Perl program with tmpfile.
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| Xicheng Jia 2006-05-20, 3:59 am |
| Alan Campbell wrote:
> hello folks,
>
> I'm slurping in a large file and seeing a nice speedup versus line by line processing...but I'm losing it in my (likely poorly constructed!) reg-expression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by
.*?ST
(for your case, use non-greedy form might be better than the greedy
one)
> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
\S+\s*\S*
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
B15--[^13579]*.\]
you might also want to exclude the newline in the above character-set,
B15--[^13579\n]*.\]
> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
my @match_sp = $all_lines =~ /.*?ST\S+\s*\S*B15--[^13579\n]*.\]/g
(untested)
Xicheng
> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanation of why this reg ex is slow (so I dont screw it up next time!)
>
> Many thanks, Alan
>
>
>
> ---------------------------------
> New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
> --0-2010111674-1148076799=:94237--
| |
| John W. Krahn 2006-05-20, 3:59 am |
| Alan Campbell wrote:
> hello folks,
Hello,
> I'm slurping in a large file and seeing a nice speedup versus line by line
> processing...but I'm losing it in my (likely poorly constructed!)
> reg-expression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by
> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
>
> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanation of
> why this reg ex is slow (so I dont screw it up next time!)
Your pattern starts with '.*' and the '*' modifier is greedy so it has to
match as many non-newline characters as possible and then it back-tracks to
match 'ST'. You should use the non-greedy quantifier '*?' so it won't
back-track. Your character class [^02468] does indeed match odd numbers, as
well as every other character not '0', '2', '4', '6' or '8'. You should
probably use the character class [13579] to match _only_ odd numbers. The /m
option is not required because you are not using either '^' or '$' to anchor
the pattern.
my @match_sp = $all_lines =~ /.*?ST\S+\s*\S*B15--.*?\[[13579]]/g;
John
--
use Perl;
program
fulfillment
| |
| Alan Campbell 2006-05-21, 6:59 pm |
| hello folks,
Thanks for all the advice. As several of you suggested, the winning ticket turned out to be flipping to line-by-line regex from an array-slurped input i.e.
# look for potentially problematic dissassembly of the following form: -
# 8004b980 003c34f4 STW.D2T1 A0,*B15--[1]
my @dis_lines = <>;
foreach my $ln (@dis_lines) {
if ($ln =~ m/.*ST[A-Z].*B15--.*[13579]]/) {
print $ln;
}
}
I think I got carried away by last problem I had which *required* a multi-line match & scalar slurp without line-by-line was faster.
Thanks again for all the advice. Much appreciated. FYI the script now runs in 1sec on a 9Mb file, as compared to 3min 30s previously!
cheers, Alan
"John W. Krahn" <krahnj@telus.net> wrote:
Alan Campbell wrote:
> hello folks,
Hello,
> I'm slurping in a large file and seeing a nice speedup versus line by line
> processing...but I'm losing it in my (likely poorly constructed!)
> reg-expression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by
> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
>
> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanation of
> why this reg ex is slow (so I dont screw it up next time!)
Your pattern starts with '.*' and the '*' modifier is greedy so it has to
match as many non-newline characters as possible and then it back-tracks to
match 'ST'. You should use the non-greedy quantifier '*?' so it won't
back-track. Your character class [^02468] does indeed match odd numbers, as
well as every other character not '0', '2', '4', '6' or '8'. You should
probably use the character class [13579] to match _only_ odd numbers. The /m
option is not required because you are not using either '^' or '$' to anchor
the pattern.
my @match_sp = $all_lines =~ /.*?ST\S+\s*\S*B15--.*?\[[13579]]/g;
John
--
use Perl;
program
fulfillment
--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
---------------------------------
How low will we go? Check out Yahoo! Messenger’s low PC-to-Phone call rates.
| |
| Dr.Ruud 2006-05-21, 9:58 pm |
| Alan Campbell schreef:
Please don't top-post.
> # look for potentially problematic dissassembly of the following
> form: - # 8004b980 003c34f4 STW.D2T1 A0,*B15--[1]
> my @dis_lines = <>;
> foreach my $ln (@dis_lines) {
> if ($ln =~ m/.*ST[A-Z].*B15--.*[13579]]/) {
> print $ln;
The start of your regexp is still not as specific as is possible. And
you should add non-greediness, as suggested.
Alternatives:
m/ ST[A-Z].*?,\*B15--\[[13579]]/
m/^[0-9a-f]+ +[0-9a-f]+ +ST[A-Z].*?,\*B15--\[[13579]]/
m/^(?:[[:xdigit:]]+ +){2}ST[A-Z].*?,\*B15--\[[13579]]/
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| John W. Krahn 2006-05-21, 9:58 pm |
| Alan Campbell wrote:
> hello folks,
Hello,
> Thanks for all the advice. As several of you suggested, the winning ticket
> turned out to be flipping to line-by-line regex from an array-slurped
> input i.e.
>
> # look for potentially problematic dissassembly of the following form: -
> # 8004b980 003c34f4 STW.D2T1 A0,*B15--[1]
> my @dis_lines = <>;
> foreach my $ln (@dis_lines) {
> if ($ln =~ m/.*ST[A-Z].*B15--.*[13579]]/) {
> print $ln;
> }
> }
>
> I think I got carried away by last problem I had which *required* a multi-
> line match & scalar slurp without line-by-line was faster.
Because you now have one line in $ln you can remove '.*' at the beginning of
the pattern and you should probably use non-greedy quantifiers for the others:
if ( $ln =~ /ST[A-Z].*?B15--.*?[13579]]/ ) {
John
--
use Perl;
program
fulfillment
|
|
|
|
|