Home > Archive > PERL Beginners > March 2007 > Regex problem with accented characters
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Regex problem with accented characters
|
|
| Beginner 2007-03-27, 8:01 am |
| Hi,
I am trying to extract the iso code and country name from a 3 column
table (taken from en.wikipedia.org) and have noticed a problem with
accented characters such as =D4.
Below is my script and a sample of the data I am using. When I run
the script the code beginning CI for C=F4te d'Ivoire returns the string
"CI\tC" where as I had hoped for "CI\tC=F4te d'Ivoire"
Does anyone know why \w+ does include C=F4te d'Ivoire and how I can get
around it in future?
TIA,
Dp.
=3D=3D=3D=3D extract.pl =3D=3D=3D=3D=3D=3D=3D=3D
#!/usr/bin/perl
use strict;
use warnings;
my $file =3D 'iso-alpha2.txt';
open(FH,$file) or die "Can't open $file: $!\n";
while (<FH> ) {
chomp;
next if ($_ !~ /^\w{2}\s+/);
my ($code,$name) =3D ($_ =3D~
/ ^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s
\w+|\w+\s\w+|\w+)/);
print "$code\t$name\n";
}
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
3D=3D
=3D=3D=3D=3D=3D=3D=3D=3D sample data =3D=3D=3D=3D=3D=3D=3D=3D
....snip
BY Belarus Previously named "Byelorussian S.S.R."
BZ Belize
CA Canada
CC Cocos (Keeling) Islands
CD Congo, the Democratic Republic of the Previously named "Zaire"
ZR
CF Central African Republic
CG Congo
CH Switzerland Code taken from "Confoederatio Helvetica", its
official Latin name
CI C=F4te d'Ivoire
CK Cook Islands
CL Chile
CM Cameroon
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
| |
| Alexei A. Frounze 2007-03-27, 8:01 am |
| Beginner wrote:
> Hi,
>
> I am trying to extract the iso code and country name from a 3 column
> table (taken from en.wikipedia.org) and have noticed a problem with
> accented characters such as Ô.
>
> Below is my script and a sample of the data I am using. When I run
> the script the code beginning CI for Côte d'Ivoire returns the string
>
> "CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
>
> Does anyone know why \w+ does include Côte d'Ivoire and how I can get
> around it in future?
From perlintro (see the perl documentation):
\w a word character (a-z, A-Z, 0-9, _)
You could create your own set of allowed characters using unicode. For
Spanish I use these:
"á" eq "\x{E1}"
"é" eq "\x{E9}"
"í" eq "\x{ED}"
"ó" eq "\x{F3}"
"ú" eq "\x{FA}"
"ü" eq "\x{FC}"
"ñ" eq "\x{F1}"
Of course, there're upper case letters with diacritics. You may put all
those characters and normal characters into a string (say,
$Letters="a\x{E1}bcde\x{E9}...") and match against it using something like
/[$Letters]+/
Get yourself a copy of the Unicode standard too from
http://www.unicode.org/. See the charts to find the characters you're
interested in.
> TIA,
> Dp.
>
>
> ==== extract.pl ========
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my $file = 'iso-alpha2.txt';
>
> open(FH,$file) or die "Can't open $file: $!\n";
> while (<FH> ) {
> chomp;
> next if ($_ !~ /^\w{2}\s+/);
> my ($code,$name) = ($_ =~
> / ^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s
\w+|\w+\s\w+|\w+)/);
> print "$code\t$name\n";
> }
> ===============
>
> ======== sample data ========
> ...snip
> BY Belarus Previously named "Byelorussian S.S.R."
> BZ Belize
> CA Canada
> CC Cocos (Keeling) Islands
> CD Congo, the Democratic Republic of the Previously named "Zaire"
> ZR
> CF Central African Republic
> CG Congo
> CH Switzerland Code taken from "Confoederatio Helvetica", its
> official Latin name
> CI Côte d'Ivoire
> CK Cook Islands
> CL Chile
> CM Cameroon
> ===========
HTH,
Alex
| |
| Mumia W. 2007-03-27, 8:01 am |
| On 03/27/2007 03:34 AM, Beginner wrote:
> Hi,
>
> I am trying to extract the iso code and country name from a 3 column
> table (taken from en.wikipedia.org) and have noticed a problem with
> accented characters such as Ô.
>
> Below is my script and a sample of the data I am using. When I run
> the script the code beginning CI for Côte d'Ivoire returns the string
>
> "CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
>
> Does anyone know why \w+ does include Côte d'Ivoire and how I can get
> around it in future?
>
> TIA,
> Dp.
>
>
> ==== extract.pl ========
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my $file = 'iso-alpha2.txt';
>
> open(FH,$file) or die "Can't open $file: $!\n";
> while (<FH> ) {
> chomp;
> next if ($_ !~ /^\w{2}\s+/);
> my ($code,$name) = ($_ =~
> / ^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s
\w+|\w+\s\w+|\w+)/);
> print "$code\t$name\n";
> }
> ===============
>
> ======== sample data ========
> ...snip
> BY Belarus Previously named "Byelorussian S.S.R."
> BZ Belize
> CA Canada
> CC Cocos (Keeling) Islands
> CD Congo, the Democratic Republic of the Previously named "Zaire"
> ZR
> CF Central African Republic
> CG Congo
> CH Switzerland Code taken from "Confoederatio Helvetica", its
> official Latin name
> CI Côte d'Ivoire
> CK Cook Islands
> CL Chile
> CM Cameroon
> ===========
>
It's partly the encoding. Put «use encoding "iso-8859-1";» at the top of
your program, and there will be a little improvement. However, that only
gets you as far as "Côte d"; I doubt there is any encoding where
apostrophe is in \w.
It's probably best to create an expression that contains all of the
characters you may want. That would include accented characters and the
apostrophe in this case.
Also, I advise you to use an programmer's editor that supports syntax
highlighting. My VIM shows me that you missed the backslash that is
supposed to be on the fourth "\s" in your regular expression.
| |
| Rob Dixon 2007-03-27, 8:01 am |
| Beginner wrote:
> Hi,
>
> I am trying to extract the iso code and country name from a 3 column
> table (taken from en.wikipedia.org) and have noticed a problem with
> accented characters such as Ô.
>
> Below is my script and a sample of the data I am using. When I run
> the script the code beginning CI for Côte d'Ivoire returns the string
>
> "CI\tC" where as I had hoped for "CI\tCôte d'Ivoire"
>
> Does anyone know why \w+ does include Côte d'Ivoire and how I can get
> around it in future?
>
> TIA,
> Dp.
>
>
> ==== extract.pl ========
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my $file = 'iso-alpha2.txt';
>
> open(FH,$file) or die "Can't open $file: $!\n";
> while (<FH> ) {
> chomp;
> next if ($_ !~ /^\w{2}\s+/);
> my ($code,$name) = ($_ =~ / ^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s
\w+|\w+\s\w+|\w+)/);
> print "$code\t$name\n";
> }
> ===============
>
> ======== sample data ========
> ...snip
> BY Belarus Previously named "Byelorussian S.S.R."
> BZ Belize
> CA Canada
> CC Cocos (Keeling) Islands
> CD Congo, the Democratic Republic of the Previously named "Zaire"
> ZR
> CF Central African Republic
> CG Congo
> CH Switzerland Code taken from "Confoederatio Helvetica", its official Latin name
> CI Côte d'Ivoire
> CK Cook Islands
> CL Chile
> CM Cameroon
> ===========
Ordinarily the range of characters mapped by \w is limited to [0-9A-Za-z_].
However, if you put 'use locale' at the start of your program this will be
extended to include the accented alpha characters as well (see perldoc
perllocale).
However, this will still not solve your problem, as the apostrophe in
"Côte d'Ivoire" will still not match \w and you will end up with
"CI\tCôte d". I suggest you change your regex to simply match any
character at all up to the end of the line, like this:
while (<FH> ) {
chomp;
next unless /^(\w\w)\s+(.+?)\s*$/;
my ($code, $name) = ($1, $2);
print "$code\t$name\n";
}
which will give the result you desire.
But you still have the problem that the line for Zaire has no text and
will not match the regex anyway!
Hope this helps.
Rob
| |
| Rob Dixon 2007-03-27, 10:01 pm |
| Beginner wrote:
>
> / ^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s
\w+|\w+\s\w+|\w+)/);
It's worth noting that this could be written:
/^(\w{2})\s+(\w+(?:\s\w+)*)/);
Rob
|
|
|
|
|