For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > May 2006 > reg exp speed?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author reg exp speed?
Alan Campbell

2006-05-19, 6:58 pm

hello folks,

I'm slurping in a large file and seeing a nice speedup versus line by line processing...but I'm losing it in my (likely poorly constructed!) reg-expression match

I do: -
#
# look for potentially problematic code of the following form: -
# STW b0, *SP--[3]
# The reg exp tries to match: -
# - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by
# - 1+ non-whitespace chars followed by
# - 0+ whitespace chars followed by
# - 0+ non-whitespace chars followed by
# the string 'B15--' followed by
# anything up until an odd single-digit number followed by
# the ']' character
# Matches all occurrences
#
my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;

...and then I foreach on @match_sp to show all occurrences found...

Any speedup tips most welcome. Would also appreciate a brief explanation of why this reg ex is slow (so I dont screw it up next time!)

Many thanks, Alan



---------------------------------
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
Timothy Johnson

2006-05-19, 6:58 pm


One thing you could do is precompile the regex and save it in a variable
ala

my $regex =3D qr/.*ST\S+\s*\S*B15--.*[^02468]]/o;
my @match_sp =3D $all_lines =3D~ /$regex/mg;

Note the 'o' option on the qr//, which tells perl to compile the regex
only once.

I'm not sure if it applies in this case or not, but it's something to
check.



-----Original Message-----
From: Alan Campbell [mailto:alancam73@yahoo.com]=20
Sent: Friday, May 19, 2006 3:13 PM
To: beginners@perl.org
Subject: reg exp speed?

hello folks,
=20
I'm slurping in a large file and seeing a nice speedup versus line by
line processing...but I'm losing it in my (likely poorly constructed!)
reg-expression match
=20
<snip explanation>

my @match_sp =3D $all_lines =3D~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
=20
<snip>

David Romano

2006-05-19, 6:58 pm

Hi Alan,
On 5/19/06, Alan Campbell wrote:
> hello folks,
>
> I'm slurping in a large file and seeing a nice speedup versus line by l=

ine processing...but I'm losing it in my (likely poorly constructed!) reg-e=
xpression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc) follo=

wed by
> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp =3D $all_lines =3D~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
>
> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanatio=

n of why this reg ex is slow (so I dont screw it up next time!)
I think using character classes and nonbacktracking groups can speed
up the .* stuff. Maybe something like the following will help?
/(?:(?>[^S\n]*)S?)*ST\S+\s*\S*B15--(?:(?>[^02468\n]*)[02468]?)*[^02468]\]/m=
g

David
MSG

2006-05-19, 6:58 pm

Alan Campbell wrote:
> I'm slurping in a large file and seeing a nice speedup versus line by line processing

I don't know for sure if this strategy will work but I do see a couple
problems with your regex.

> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;

(1) Your m modifier is useless since you don't have a ^ or $ anchor.
Without the anchors, your regex probably does a lot of unnecessary
backtracking on every logical line of your data.
(2) "Odd single-digit" can be matched simply with [13579], while
[^24680] would match letters, puntuations, etc.

Dr.Ruud

2006-05-19, 9:57 pm

Alan Campbell schreef:

> I'm slurping in a large file and seeing a nice speedup
> versus line by line processing...


I would go back to line-by-line again if the regexp should only match
within one line.


> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc)
> followed by # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;


The start isn't specific, make that "^\ *ST"
The next, \S+ could also be [A-Z]+.
The next, \s*, I would make "\ *".
The next , \S*, is ok. You remain in the same line.
The ".*" also remains in the same line, so I don't understand the
/m-modifier.

Is this meant as multi-line, or should the B15 be on the same line as
the STW?
Why is the b0 in lowercase, and the B15 in uppercase?
Is the "]" supposed to be in the charset? I assume no. Otherwise use
"[^]02468]".

while ( <> )
{
/ ^ \ * ST[A-Z]+ \ * \S*,\S* B15-- .* [^02468] \] /xi
and push @match_sp, $_;
}


You could also run a grep on the large file first:

$ grep -i "ST.*B15--" infile > tmpfile

and then run your Perl program with tmpfile.

--
Affijn, Ruud

"Gewoon is een tijger."


Xicheng Jia

2006-05-20, 3:59 am

Alan Campbell wrote:
> hello folks,
>
> I'm slurping in a large file and seeing a nice speedup versus line by line processing...but I'm losing it in my (likely poorly constructed!) reg-expression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -


> # - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by


.*?ST

(for your case, use non-greedy form might be better than the greedy
one)

> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by


\S+\s*\S*

> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character


B15--[^13579]*.\]

you might also want to exclude the newline in the above character-set,

B15--[^13579\n]*.\]

> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;


my @match_sp = $all_lines =~ /.*?ST\S+\s*\S*B15--[^13579\n]*.\]/g

(untested)

Xicheng

> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanation of why this reg ex is slow (so I dont screw it up next time!)
>
> Many thanks, Alan
>
>
>
> ---------------------------------
> New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.
> --0-2010111674-1148076799=:94237--


John W. Krahn

2006-05-20, 3:59 am

Alan Campbell wrote:
> hello folks,


Hello,

> I'm slurping in a large file and seeing a nice speedup versus line by line
> processing...but I'm losing it in my (likely poorly constructed!)
> reg-expression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by
> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
>
> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanation of
> why this reg ex is slow (so I dont screw it up next time!)


Your pattern starts with '.*' and the '*' modifier is greedy so it has to
match as many non-newline characters as possible and then it back-tracks to
match 'ST'. You should use the non-greedy quantifier '*?' so it won't
back-track. Your character class [^02468] does indeed match odd numbers, as
well as every other character not '0', '2', '4', '6' or '8'. You should
probably use the character class [13579] to match _only_ odd numbers. The /m
option is not required because you are not using either '^' or '$' to anchor
the pattern.

my @match_sp = $all_lines =~ /.*?ST\S+\s*\S*B15--.*?\[[13579]]/g;



John
--
use Perl;
program
fulfillment
Alan Campbell

2006-05-21, 6:59 pm

hello folks,

Thanks for all the advice. As several of you suggested, the winning ticket turned out to be flipping to line-by-line regex from an array-slurped input i.e.

# look for potentially problematic dissassembly of the following form: -
# 8004b980 003c34f4 STW.D2T1 A0,*B15--[1]
my @dis_lines = <>;
foreach my $ln (@dis_lines) {
if ($ln =~ m/.*ST[A-Z].*B15--.*[13579]]/) {
print $ln;
}
}

I think I got carried away by last problem I had which *required* a multi-line match & scalar slurp without line-by-line was faster.

Thanks again for all the advice. Much appreciated. FYI the script now runs in 1sec on a 9Mb file, as compared to 3min 30s previously!

cheers, Alan

"John W. Krahn" <krahnj@telus.net> wrote:
Alan Campbell wrote:
> hello folks,


Hello,

> I'm slurping in a large file and seeing a nice speedup versus line by line
> processing...but I'm losing it in my (likely poorly constructed!)
> reg-expression match
>
> I do: -
> #
> # look for potentially problematic code of the following form: -
> # STW b0, *SP--[3]
> # The reg exp tries to match: -
> # - anything up until 'ST' (so that we match STH, STW, STDW etc) followed by
> # - 1+ non-whitespace chars followed by
> # - 0+ whitespace chars followed by
> # - 0+ non-whitespace chars followed by
> # the string 'B15--' followed by
> # anything up until an odd single-digit number followed by
> # the ']' character
> # Matches all occurrences
> #
> my @match_sp = $all_lines =~ /.*ST\S+\s*\S*B15--.*[^02468]]/mg;
>
> ...and then I foreach on @match_sp to show all occurrences found...
>
> Any speedup tips most welcome. Would also appreciate a brief explanation of
> why this reg ex is slow (so I dont screw it up next time!)


Your pattern starts with '.*' and the '*' modifier is greedy so it has to
match as many non-newline characters as possible and then it back-tracks to
match 'ST'. You should use the non-greedy quantifier '*?' so it won't
back-track. Your character class [^02468] does indeed match odd numbers, as
well as every other character not '0', '2', '4', '6' or '8'. You should
probably use the character class [13579] to match _only_ odd numbers. The /m
option is not required because you are not using either '^' or '$' to anchor
the pattern.

my @match_sp = $all_lines =~ /.*?ST\S+\s*\S*B15--.*?\[[13579]]/g;



John
--
use Perl;
program
fulfillment

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org






---------------------------------
How low will we go? Check out Yahoo! Messenger’s low PC-to-Phone call rates.
Dr.Ruud

2006-05-21, 9:58 pm

Alan Campbell schreef:


Please don't top-post.

> # look for potentially problematic dissassembly of the following
> form: - # 8004b980 003c34f4 STW.D2T1 A0,*B15--[1]
> my @dis_lines = <>;
> foreach my $ln (@dis_lines) {
> if ($ln =~ m/.*ST[A-Z].*B15--.*[13579]]/) {
> print $ln;


The start of your regexp is still not as specific as is possible. And
you should add non-greediness, as suggested.

Alternatives:

m/ ST[A-Z].*?,\*B15--\[[13579]]/

m/^[0-9a-f]+ +[0-9a-f]+ +ST[A-Z].*?,\*B15--\[[13579]]/

m/^(?:[[:xdigit:]]+ +){2}ST[A-Z].*?,\*B15--\[[13579]]/

--
Affijn, Ruud

"Gewoon is een tijger."


John W. Krahn

2006-05-21, 9:58 pm

Alan Campbell wrote:
> hello folks,


Hello,

> Thanks for all the advice. As several of you suggested, the winning ticket
> turned out to be flipping to line-by-line regex from an array-slurped
> input i.e.
>
> # look for potentially problematic dissassembly of the following form: -
> # 8004b980 003c34f4 STW.D2T1 A0,*B15--[1]
> my @dis_lines = <>;
> foreach my $ln (@dis_lines) {
> if ($ln =~ m/.*ST[A-Z].*B15--.*[13579]]/) {
> print $ln;
> }
> }
>
> I think I got carried away by last problem I had which *required* a multi-
> line match & scalar slurp without line-by-line was faster.


Because you now have one line in $ln you can remove '.*' at the beginning of
the pattern and you should probably use non-greedy quantifiers for the others:

if ( $ln =~ /ST[A-Z].*?B15--.*?[13579]]/ ) {


John
--
use Perl;
program
fulfillment
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com