For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > July 2004 > regex search - suggestions?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author regex search - suggestions?
Sara

2004-07-24, 3:56 am

Hi All,
I have a string (a paragraph) without newlines, with organization
names and their abbreviations in brackets like...

$tmp = "... was proposed by World Health Organisation (WHO) in ...";

I have the following code segment:

$tmp =~ s/\)/\)\n<brk>/g; # because we have . in regex and
# there is no \n in $tmp
my ($abbr,$org) = "";
my (%orgs) = ();
foreach my $line (split (/\n/, $tmp)) {
if ($line =~ /\b([A-Z])(\w+[ forand]*) ([A-Z])(.*?)
\((\1\3[A-Z]*)\)/) {
$abbr = $5; $org = "$1$2 $3$4";
$orgs{$abbr} = $org;
}
}
I added [ forand]* in regex to include 'for', 'of', 'and' that might
appear after the first word.
Can anyone help me to improve the accuracy of this search, especially
the [ forand]* part.
Thanks in advance.
Ilmari Karonen

2004-07-28, 9:00 pm

On 2004-07-24, Sara <sa_ravenone@yahoo.com> wrote:
> Hi All,
> I have a string (a paragraph) without newlines, with organization
> names and their abbreviations in brackets like...
>
> $tmp = "... was proposed by World Health Organisation (WHO) in ...";


....and you want to extract the organization names and abbreviations?

my @tmp = split /\s*\(([A-Z]+)\)/, $tmp;
pop @tmp;

my %orgs;
while (my ($str, $abbr) = splice(@tmp, 0, 2)) {
(my $re = $abbr) =~ s/(.)/$1[a-z\\W]*/g;
$str =~ /.*($re)$/s or warn "Can't expand $abbr!\n" and next;
$orgs{$abbr} = $1;
}


> Can anyone help me to improve the accuracy of this search, especially


If you could provide more sample data, I could do some more thorough
testing. My code works for your example case, and probably quite many
others. Some cases where it fails for various reasons include:

World Wide Web Consortium (W3C)
PlayStation 2 (PS2)
Church of Scientology (CoS)
Skip if Equal (SEQ)
Decrement and Jump if Not Zero (DJN)
Deutscher Jugendbund für Naturbeobachtung (DJN)
GNU's Not Unix (GNU)

Most of those can be fixed, although idiosyncratic abbreviations like
W3C are probably not worth the effort.

--
Ilmari Karonen
If replying by e-mail, please replace ".invalid" with ".net" in address.
Tad McClellan

2004-07-28, 9:00 pm

Sara <sa_ravenone@yahoo.com> wrote:

> I added [ forand]* in regex to include 'for', 'of', 'and' that might
> appear after the first word.



That will match exactly the same strings as:

[adfnor ]*

It would match:

aaaaaa
afafafaf

etc.

A character class matches a _character_, not a string.


> Can anyone help me to improve the accuracy of this search, especially
> the [ forand]* part.



(for|of|and)


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
Sara

2004-07-28, 9:00 pm

Ilmari Karonen wrote in message
>...and you want to extract the organization names and abbreviations?

Yes, forgot to mention that :-o

>If you could provide more sample data, I could do some more thorough
>testing. My code works for your example case, and probably quite

many

I have got organization names like ...
European Process Safety Centre (EPSC)
Association of British Chemical Manufacturers (ABCM)
Safety and Reliability Directorate (SRD)
# The next one was not found by your code
Health and Safety at Work etc. Act 1974 (HSWA)
Advisory Committee on Major Hazards (ACMH)
Center for Chemical Process Safety (CCPS)

>Most of those can be fixed, although idiosyncratic abbreviations like
>W3C are probably not worth the effort.

I agree, I don't want to work for it either


Tad McClellan wrote in message
> That will match exactly the same strings as:
> [adfnor ]*
>
>
> (for|of|and)


That was almost exactly what I tried first:
$line =~ /\b([A-Z])(\w+)( for| of| and)? ([A-Z])(.*?)
\((\1\4[A-Z]*)\)/;
$abbr = $6; $org = "$1$2$3 $4$5";
$orgs{$abbr} = $org;

since 'for','of','and' don't get included in abbreviations, but won't
it produce 'Use of uninitialized value in ...' for those which don't
have 'for','of','and'? Is that ignorable?
Thanks,
Sara
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com