For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > January 2008 > How would I create a Regular Expression to check









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author How would I create a Regular Expression to check
Nathan

2008-01-03, 7:06 pm

How would I create a Regular Expression to check Street address for
any of the below items:

If the first character is a P ...
p.o. box
po box
po. box
p.o box
post office box
POB
POX
PODRAWER
POSTOFFICE
PO BX
POBOX
P/O

If the first character is a B ...
BX
BOX
Buzon -- (Means 'Box' in Spanish)

If the first character is a A ...
Apartado -- (is 'PO Box in Spanish)
Aptdo -- (is POB abbreviated in Spanish)



Thanks,
Nathan
Christian Winter

2008-01-03, 7:06 pm

Nathan wrote:
> How would I create a Regular Expression to check Street address for
> any of the below items:
>
> If the first character is a P ...
> p.o. box
> po box
> po. box
> p.o box
> post office box
> POB
> POX
> PODRAWER
> POSTOFFICE
> PO BX
> POBOX
> P/O
>
> If the first character is a B ...
> BX
> BOX
> Buzon -- (Means 'Box' in Spanish)
>
> If the first character is a A ...
> Apartado -- (is 'PO Box in Spanish)
> Aptdo -- (is POB abbreviated in Spanish)


The short answer: you can't. At least not one single, reasonably
short regex that can cover it in one go. I'd simply iterate
over all the possibilities and compare each one to the street address,
like:

my @potokens = ("p.o. box", "po box", "po. box", "P/O", "etc.");
my $streetaddr = "po box 12345";

if( grep { $streetaddr =~ /^\Q$_\E/ ) @potokens )
{
print "Is a PO address!" . $/;
}

# or completely without regex:
foreach( @potokens )
{
if( substr( $streetaddr, 0, length($_) ) eq $_ )
{
print "Is a PO address!" . $/;
}
}

-Chris
Ted Zlatanov

2008-01-03, 7:06 pm

On Thu, 03 Jan 2008 17:22:00 +0100 Christian Winter <thepoet_nospam@arcor.de> wrote:

CW> Nathan wrote:[color=darkred]

CW> The short answer: you can't. At least not one single, reasonably
CW> short regex that can cover it in one go. I'd simply iterate
CW> over all the possibilities and compare each one to the street address,
CW> like:

An alternate approach is to use Parse::RecDescent. It's really good in
my experience for parsing this kind of disparate input, and will
organize it for you (so you can tell that the street adress was in
Spanish, for example).

Ted
jjcassidy@gmail.com

2008-01-03, 7:06 pm

On Jan 3, 10:17 am, Nathan <nathan.stanf...@gmail.com> wrote:
> How would I create a Regular Expression to check Street address for
> any of the below items:
>
> If the first character is a P ...
> p.o. box
> po box
> po. box
> p.o box
> post office box
> POB
> POX
> PODRAWER
> POSTOFFICE
> PO BX
> POBOX
> P/O
>
> If the first character is a B ...
> BX
> BOX
> Buzon -- (Means 'Box' in Spanish)
>
> If the first character is a A ...
> Apartado -- (is 'PO Box in Spanish)
> Aptdo -- (is POB abbreviated in Spanish)
>
> Thanks,
> Nathan


It feels like I'm doing your homework, but here:

(Ap(?>artado|tdo)|B(?>O?X|uzon)|p(?>\.?o\.?|ost office) box|P(?>\/O|
O(?:B|X|DRAWER|STOFFICE|[ ]BX|BOX))

It's just simple decomposition.
Nathan

2008-01-07, 7:10 pm

You did not do my homework but thanks... I will try yours as well...

Here is what I came up with but I like yours better I might try yours
instead of mine....

^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
[Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww
][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
[Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
[Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][
Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])



On Jan 3, 5:26=A0pm, jjcass...@gmail.com wrote:
> On Jan 3, 10:17 am, Nathan <nathan.stanf...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
>
> It feels like I'm doing your homework, but here:
>
> (Ap(?>artado|tdo)|B(?>O?X|uzon)|p(?>\.?o\.?|ost office) box|P(?>\/O|
> O(?:B|X|DRAWER|STOFFICE|[ ]BX|BOX))
>
> It's just simple decomposition.- Hide quoted text -
>
> - Show quoted text -


J. Gleixner

2008-01-07, 7:10 pm

Nathan wrote:
> You did not do my homework but thanks... I will try yours as well...
>
> Here is what I came up with but I like yours better I might try yours
> instead of mine....
>
> ^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
> [Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww
][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
> [Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
> [Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][
Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])


Ever hear of case-insensitive pattern matching?

perldoc perlop

Search for "m/PATTERN/cgimosx".
Uri Guttman

2008-01-07, 7:10 pm

>>>>> "JG" == J Gleixner <glex_no-spam@qwest-spam-no.invalid> writes:

JG> Nathan wrote:[color=darkred]

JG> Ever hear of case-insensitive pattern matching?

JG> perldoc perlop

beyond that, note the [.\s] which is just . with the /s modifier. and it
has * after it which may not be correct (or just slower than +). [/] is
noisy and will break it unless alternate delimiters are used. beyond
that it is impossible to read (and /i will help there). and the way the
words are jammed together makes no sense or is impossible to parse out
visually. altogether a most horrible regex. i will copy it for training
purposes. i don't expect its author to claim this is proprietary code
just out of embarrasment. :)

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
Peter J. Holzer

2008-01-08, 4:20 am

On 2008-01-07 22:01, Uri Guttman <uri@stemsystems.com> wrote:
>
> JG> Nathan wrote:
>
> JG> Ever hear of case-insensitive pattern matching?
>
> JG> perldoc perlop
>
> beyond that, note the [.\s] which is just . with the /s modifier.


What? A "." in a character class matches only a ".", But a \s still
matches any whitespace character, so [.\s] matches a "." or a whitespace
character. A /s modifier won't change its meaning.

hp
Jürgen Exner

2008-01-08, 8:06 am

[Please do not top-post, trying to correct]
Nathan <nathan.stanford@gmail.com> wrote:
>Here is what I came up with but I like yours better I might try yours
>instead of mine....
>
>^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
> [Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww
][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
>[Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
> [Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][
Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])


Sorry, but that's a great example for what not to do. Absolutely
unmaintainable. Within 4 ws you will have no idea what that RE does and
how to modify it if you need to add another term.

IMO regular expressions are the wrong tool for the job. Far better would be
to put those terms in a hash (as keys), then extract the street name from
your address, and simply check if this street name exists() in the hash.
Or put the terms in an array and just loop through them.

Maybe that's not as smart as an RE approach, but it's much more intelligent.

jue
Ted Zlatanov

2008-01-08, 7:10 pm

On Mon, 7 Jan 2008 13:28:00 -0800 (PST) Nathan <nathan.stanford@gmail.com> wrote:

N> You did not do my homework but thanks... I will try yours as well...
N> Here is what I came up with but I like yours better I might try yours
N> instead of mine....

N> ^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
N> [Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww
][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
N> [Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
N> [Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][
Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])

Good god, doesn't this bother you even a little bit? You should at
least submit it to the Daily WTF.

Ted
David Combs

2008-01-31, 8:41 am

In article <86lk76vily.fsf@lifelogs.com>,
Ted Zlatanov <tzz@lifelogs.com> wrote:
>On Thu, 03 Jan 2008 17:22:00 +0100 Christian Winter <thepoet_nospam@arcor.de> wrote:
>
>CW> Nathan wrote:
>
>CW> The short answer: you can't. At least not one single, reasonably
>CW> short regex that can cover it in one go. I'd simply iterate
>CW> over all the possibilities and compare each one to the street address,
>CW> like:
>
>An alternate approach is to use Parse::RecDescent. It's really good in
>my experience for parsing this kind of disparate input, and will
>organize it for you (so you can tell that the street adress was in
>Spanish, for example).
>
>Ted


A late response/request. *If* you find doing that pretty easy and
quick to do, *please* show us how you'd do it.

I've read the doc on it, and come away with neither facility nor understanding
for actually being able to use it in a real problem.

THANKS MUCH (from all of us?)

David


David Combs

2008-01-31, 8:41 am

In article <47829ea8$0$3575$815e3792@news.qwest.net>,
J. Gleixner <glex_no-spam@qwest-spam-no.invalid> wrote:
>Nathan wrote:
>
>Ever hear of case-insensitive pattern matching?


Without first going to perlop, I ask: even in *character classes*?!
>
>perldoc perlop
>
>Search for "m/PATTERN/cgimosx".


david

Gunnar Hjalmarsson

2008-01-31, 8:41 am

David Combs wrote:
> In article <47829ea8$0$3575$815e3792@news.qwest.net>,
> J. Gleixner <glex_no-spam@qwest-spam-no.invalid> wrote:
>
> Without first going to perlop, I ask: even in *character classes*?!


You should have tried it instead of asking hundreds of people.

C:\home>type test.pl
$_ = 'abc';
print "Yes\n" if /[A-Z]/i;

C:\home>test.pl
Yes

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
Ted Zlatanov

2008-01-31, 7:21 pm

On Thu, 31 Jan 2008 13:46:10 +0000 (UTC) dkcombs@panix.com (David Combs) wrote:

DC> In article <86lk76vily.fsf@lifelogs.com>,
DC> Ted Zlatanov <tzz@lifelogs.com> wrote:[color=darkred]
CW> Nathan wrote:[color=darkred]
CW> The short answer: you can't. At least not one single, reasonably
CW> short regex that can cover it in one go. I'd simply iterate
CW> over all the possibilities and compare each one to the street address,
CW> like:[color=darkred]

DC> A late response/request. *If* you find doing that pretty easy and
DC> quick to do, *please* show us how you'd do it.

DC> I've read the doc on it, and come away with neither facility nor understanding
DC> for actually being able to use it in a real problem.

I wrote a tutorial on P::RD a while ago, and it should still be valid.
IBM dW seems to be down right this moment, use the Google cache if you
have to. I don't mention auto_tree, which is really handy if you want
to process the data yourself.

http://www.ibm.com/developerworks/l...perl-speak.html

Here's another good one (and many others will come up in a web search):

http://www.perl.com/pub/a/2001/06/13/recdecent.html

Are you asking specifically for the mailing address example originally
posted to be implemented in P::RD, or do you need more information on
how to use P::RD for your own applications? I can certainly give a
P::RD grammar for the full list of address rules, but it's tedious work
to implement every rule the OP wanted and I don't want to spend hours of
my time doing it just to prove it's easy.

Thanks
Ted
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com