For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > April 2005 > bulky regex









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author bulky regex
Peter Rabbitson

2005-04-27, 3:56 pm

Hello everyone,
This is the first time I was able to get a complex regex actually working as
I expect, but I wanted someone to comment on it, in case I am overlooking
something very major.
I am trying to throw away a certain string out of random text. The
occurences might be as follows:

Free UPS Ground Shipping <rest of string>
<some of string> free ground shipping !! <rest of string>
<some of string> free UPS ground shipping!!!

and all variations of the above. Here is what I did:

description =~ s/ #take away free ground shipping text

(?: #non-capturing block for | inclusion
(^) #start of string
| #or
(?<=\S) #lookbehind non-space character
)

\s* #maybe some spaces
free #word 'free'
\s+ #at least one space
(?:ups\s+)? #non-capturing 'ups' with at least one trailing space
ground #'ground'
\s+ #spaces
shipping #'shipping'
\s* #maybe some spaces
!* #maybe some exclamation marks
\s* #maybe some more spaces

(?: #non-capturing for | inclusion
($) #end of string
| #or
(?=\S\s?) #lookahead non-space character maybe followed by a space (I want to keep the space if I am cutting from inside a string)
)

//ixg; #replace with nothing

Seems to be working, but I am afraid it will bite me later. Appreciate any
comments. The reason I placed all the (?: ) is to speed it up at least a
bit, I remember reading somewhere that it matters.

Thanks

Peter
Peter Rabbitson

2005-04-27, 3:56 pm

On Wed, Apr 27, 2005 at 12:16:05PM -0500, Peter Rabbitson wrote:
> description =~ s/ #take away free ground shipping text
>
> (?: #non-capturing block for | inclusion
> (^) #start of string
> | #or
> (?<=\S) #lookbehind non-space character
> )
>
> \s* #maybe some spaces
> free #word 'free'
> \s+ #at least one space
> (?:ups\s+)? #non-capturing 'ups' with at least one trailing space
> ground #'ground'
> \s+ #spaces
> shipping #'shipping'
> \s* #maybe some spaces
> !* #maybe some exclamation marks
> \s* #maybe some more spaces
>
> (?: #non-capturing for | inclusion
> ($) #end of string
> | #or
> (?=\S\s?) #lookahead non-space character maybe followed by a space (I want to keep the space if I am cutting from inside a string)
> )
>
> //ixg; #replace with nothing


Ops. (?=\S\s?) above should be (?=\s\S), if it's not at the end of a
string I am guaranteed at least a single space, sorry about that.
Jay Savage

2005-04-27, 3:56 pm

On 4/27/05, Peter Rabbitson <rabbit@rabbit.us> wrote:
> Hello everyone,
> This is the first time I was able to get a complex regex actually working=

as
> I expect, but I wanted someone to comment on it, in case I am overlooking
> something very major.
> I am trying to throw away a certain string out of random text. The
> occurences might be as follows:
>=20
> Free UPS Ground Shipping <rest of string>
> <some of string> free ground shipping !! <rest of string>
> <some of string> free UPS ground shipping!!!
>=20
> and all variations of the above. Here is what I did:
>=20
> description =3D~ s/ #take away free ground shipping text
>=20
> (?: #non-capturing block for | inclusion
> (^) #start of string
> | #or
> (?<=3D\S) #lookbehind non-space character
> )
>=20
> \s* #maybe some spaces
> free #word 'free'
> \s+ #at least one space
> (?:ups\s+)? #non-capturing 'ups' with at least one trailing space
> ground #'ground'
> \s+ #spaces
> shipping #'shipping'
> \s* #maybe some spaces
> !* #maybe some exclamation marks
> \s* #maybe some more spaces
>=20
> (?: #non-capturing for | inclusion
> ($) #end of string
> | #or
> (?=3D\S\s?) #lookahead non-space character maybe followed by a spac=

e (I want to keep the space if I am cutting from inside a string)
> )
>=20
> //ixg; #replace with nothing
>=20
> Seems to be working, but I am afraid it will bite me later. Appreciate an=

y
> comments. The reason I placed all the (?: ) is to speed it up at least a
> bit, I remember reading somewhere that it matters.
>=20
> Thanks
>=20
> Peter
>=20
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>=20
>=20


Peter,

Don't make things so complicated. You want to find some words and
replace them with nothing. You don't care where in the string the
pattern appears. Therefore, you don't have to predict where in the
string it might possibly appear. As long as what you're looking for
is contiguous, it doesn't matter where it occurs in relation to what
you *aren't* looking for. You don't have to write a regex for the
enitre line; that's what makes regex so powerful: it does the looking
for you. Also, ! is a negation. It may work bare here, but get in
the habit of useing it escaped or in a class.

$description =3D~ s/free ups ground shipping[!]*//i; # or /ig if neede=
d

should work just fine. Write your patterns to find what is is you're
looking for, not to find what it is you're *not* lokking for.

HTH,

--jay
Peter Rabbitson

2005-04-27, 3:56 pm

On Wed, Apr 27, 2005 at 01:31:08PM -0400, Jay Savage wrote:
>
> Don't make things so complicated. You want to find some words and
> replace them with nothing. You don't care where in the string the
> pattern appears. Therefore, you don't have to predict where in the


The word 'ups' is not mandatory - it might be there, might not. Also the
amount of spaces inbetween is not fixed. However what you said about me not
caring where the string is - you are right. I dropped (?:(^)|(?<=\S)) from
the front, and it produces identical results. Thanks!
Jay Savage

2005-04-27, 3:57 pm

On 4/27/05, Peter Rabbitson <rabbit@rabbit.us> wrote:
> On Wed, Apr 27, 2005 at 01:31:08PM -0400, Jay Savage wrote:
>=20
> The word 'ups' is not mandatory - it might be there, might not. Also the
> amount of spaces inbetween is not fixed. However what you said about me n=

ot
> caring where the string is - you are right. I dropped (?:(^)|(?<=3D\S)) =

from
> the front, and it produces identical results. Thanks!
>=20


You can drop the stuff from the end, too. If 'ups' is optional and
the spacing is variable, then of course handle that with *

$description =3D~
s/\s*free\+(?:ups)*\s*ground\s+shipping\s*[!]*\s*//i; # or /ig if
needed

Rearrange to suit. But the imporant thing here is to go for what you
need to replace, and not what you don't.

HTH,

--jay
Peter Rabbitson

2005-04-27, 8:56 pm

On Wed, Apr 27, 2005 at 01:50:57PM -0400, Jay Savage wrote:
> You can drop the stuff from the end, too. If 'ups' is optional and
> the spacing is variable, then of course handle that with *
>
> $description =~
> s/\s*free\+(?:ups)*\s*ground\s+shipping\s*[!]*\s*//i; # or /ig if
> needed
>
> Rearrange to suit. But the imporant thing here is to go for what you
> need to replace, and not what you don't.



Mmm... as I wrote in the comments in the very first e-mail:
> blablabla...I want to keep the space if I am cutting from inside a string

Is there a way to do this without the lookahead? Yes I can replace with / /,
but then I am introducing more spaces than there were if I am at the
beginning or at the end...? Excuse my curiosity :)
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com