Home > Archive > PERL Miscellaneous > July 2004 > regex search - suggestions?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
regex search - suggestions?
|
|
|
| Hi All,
I have a string (a paragraph) without newlines, with organization
names and their abbreviations in brackets like...
$tmp = "... was proposed by World Health Organisation (WHO) in ...";
I have the following code segment:
$tmp =~ s/\)/\)\n<brk>/g; # because we have . in regex and
# there is no \n in $tmp
my ($abbr,$org) = "";
my (%orgs) = ();
foreach my $line (split (/\n/, $tmp)) {
if ($line =~ /\b([A-Z])(\w+[ forand]*) ([A-Z])(.*?)
\((\1\3[A-Z]*)\)/) {
$abbr = $5; $org = "$1$2 $3$4";
$orgs{$abbr} = $org;
}
}
I added [ forand]* in regex to include 'for', 'of', 'and' that might
appear after the first word.
Can anyone help me to improve the accuracy of this search, especially
the [ forand]* part.
Thanks in advance.
| |
| Ilmari Karonen 2004-07-28, 9:00 pm |
| On 2004-07-24, Sara <sa_ravenone@yahoo.com> wrote:
> Hi All,
> I have a string (a paragraph) without newlines, with organization
> names and their abbreviations in brackets like...
>
> $tmp = "... was proposed by World Health Organisation (WHO) in ...";
....and you want to extract the organization names and abbreviations?
my @tmp = split /\s*\(([A-Z]+)\)/, $tmp;
pop @tmp;
my %orgs;
while (my ($str, $abbr) = splice(@tmp, 0, 2)) {
(my $re = $abbr) =~ s/(.)/$1[a-z\\W]*/g;
$str =~ /.*($re)$/s or warn "Can't expand $abbr!\n" and next;
$orgs{$abbr} = $1;
}
> Can anyone help me to improve the accuracy of this search, especially
If you could provide more sample data, I could do some more thorough
testing. My code works for your example case, and probably quite many
others. Some cases where it fails for various reasons include:
World Wide Web Consortium (W3C)
PlayStation 2 (PS2)
Church of Scientology (CoS)
Skip if Equal (SEQ)
Decrement and Jump if Not Zero (DJN)
Deutscher Jugendbund für Naturbeobachtung (DJN)
GNU's Not Unix (GNU)
Most of those can be fixed, although idiosyncratic abbreviations like
W3C are probably not worth the effort.
--
Ilmari Karonen
If replying by e-mail, please replace ".invalid" with ".net" in address.
| |
| Tad McClellan 2004-07-28, 9:00 pm |
| Sara <sa_ravenone@yahoo.com> wrote:
> I added [ forand]* in regex to include 'for', 'of', 'and' that might
> appear after the first word.
That will match exactly the same strings as:
[adfnor ]*
It would match:
aaaaaa
afafafaf
etc.
A character class matches a _character_, not a string.
> Can anyone help me to improve the accuracy of this search, especially
> the [ forand]* part.
(for|of|and)
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
| |
|
| Ilmari Karonen wrote in message
>...and you want to extract the organization names and abbreviations?
Yes, forgot to mention that :-o
>If you could provide more sample data, I could do some more thorough
>testing. My code works for your example case, and probably quite
many
I have got organization names like ...
European Process Safety Centre (EPSC)
Association of British Chemical Manufacturers (ABCM)
Safety and Reliability Directorate (SRD)
# The next one was not found by your code
Health and Safety at Work etc. Act 1974 (HSWA)
Advisory Committee on Major Hazards (ACMH)
Center for Chemical Process Safety (CCPS)
>Most of those can be fixed, although idiosyncratic abbreviations like
>W3C are probably not worth the effort.
I agree, I don't want to work for it either
Tad McClellan wrote in message
> That will match exactly the same strings as:
> [adfnor ]*
>
>
> (for|of|and)
That was almost exactly what I tried first:
$line =~ /\b([A-Z])(\w+)( for| of| and)? ([A-Z])(.*?)
\((\1\4[A-Z]*)\)/;
$abbr = $6; $org = "$1$2$3 $4$5";
$orgs{$abbr} = $org;
since 'for','of','and' don't get included in abbreviations, but won't
it produce 'Use of uninitialized value in ...' for those which don't
have 'for','of','and'? Is that ignorable?
Thanks,
Sara
|
|
|
|
|