Home > Archive > PERL Beginners > December 2007 > replace chars
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Octavian Rasnita 2007-12-26, 7:59 am |
| Hi,
I want to replace some special characters with their corresponding Western
European chars, for example a with a, â with a, s with s, t with t, î with i
and so on.
Could you please recommend a module that can do this?
I don't know what I need to search on CPAN for and I don't want to do the
replacement manually with tr// because I don't know all the special chars
that might appear.
Thank you and have a happy new year!
Octavian
| |
| Tom Phoenix 2007-12-26, 7:01 pm |
| On Dec 26, 2007 3:05 AM, Octavian Rasnita <orasnita@gmail.com> wrote:
> I want to replace some special characters with their corresponding Wester=
n
> European chars, for example a with a, =E2 with a, s with s, t with t, =EE=
with i
> and so on.
>
> Could you please recommend a module that can do this?
You might be able to do what you want with Encode.
http://perldoc.perl.org/Encode.html
Hope this helps!
--Tom Phoenix
| |
| Gunnar Hjalmarsson 2007-12-26, 7:01 pm |
| Tom Phoenix wrote:
> On Dec 26, 2007 3:05 AM, Octavian Rasnita <orasnita@gmail.com> wrote:
I thought that all those characters were included in the Western
European character set ISO-8859-1, and if so, your requirement makes no
sense. Do you possibly mean corresponding ASCII characters?
[color=darkred]
>
> You might be able to do what you want with Encode.
>
> http://perldoc.perl.org/Encode.html
Might he? How?
The Swedish alphabet contains three non-ascii characters: å, ä and ö. To
my knowledge, there is no official encoding scheme that converts them to
a, a and o respectively. That's natural, since 'å' is a completely
different character than 'a' etc.
Sometimes, the special Swedish characters are converted in an English
context, and based on how they are pronounced, like this:
å -> ou
ä -> ae
ö -> oe
I believe the OP will need to identify all the characters he would like
to see converted, and code the conversion rules himself using the tr///
or s/// operator.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Octavian Rasnita 2007-12-26, 7:01 pm |
| Yes I think that it might not be any standard transforming algorithm for
doing this, and the program that do that, do their own transform.
So finally I've decided to try finding all the possible chars with tildes,
acute or grave accents, umlauts, etc, and replace using tr//.
I hope I won't have any issues, because the chars are UTF-8.
Thanks.
Octavian
----- Original Message -----
From: "Gunnar Hjalmarsson" <noreply@gunnar.cc>
To: <beginners@perl.org>
Sent: Wednesday, December 26, 2007 7:33 PM
Subject: Re: replace chars
> Tom Phoenix wrote:
>
> I thought that all those characters were included in the Western European
> character set ISO-8859-1, and if so, your requirement makes no sense. Do
> you possibly mean corresponding ASCII characters?
>
>
> Might he? How?
>
> The Swedish alphabet contains three non-ascii characters: å, ä and ö. To
> my knowledge, there is no official encoding scheme that converts them to
> a, a and o respectively. That's natural, since 'å' is a completely
> different character than 'a' etc.
>
> Sometimes, the special Swedish characters are converted in an English
> context, and based on how they are pronounced, like this:
>
> å -> ou
> ä -> ae
> ö -> oe
>
> I believe the OP will need to identify all the characters he would like to
> see converted, and code the conversion rules himself using the tr/// or
> s/// operator.
>
> --
> Gunnar Hjalmarsson
> Email: http://www.gunnar.cc/cgi-bin/contact.pl
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
| |
| Tom Phoenix 2007-12-26, 7:01 pm |
| On Dec 26, 2007 9:33 AM, Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
> Tom Phoenix wrote:
>
> Might he? How?
If what he wants is within the abilities of that module, of course. It
may be that you understand better than I do what the OP wants, and
what Encode can and cannot do, of course.
Cheers!
--Tom Phoenix
Stonehenge Perl Training
| |
| Gunnar Hjalmarsson 2007-12-26, 7:01 pm |
| [ Please only quote what's necessary to give context. ]
[ Please don't top-post. ]
Octavian Rasnita wrote:
> Gunnar Hjalmarsson wrote:
>
> Yes I think that it might not be any standard transforming algorithm for
> doing this, and the program that do that, do their own transform.
> So finally I've decided to try finding all the possible chars with
> tildes, acute or grave accents, umlauts, etc, and replace using tr//.
>
> I hope I won't have any issues, because the chars are UTF-8.
Well, then you'll probably need to identify the utf8 octet sequences
that correspond to the special characters you want to see transformed.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Chas. Owens 2007-12-26, 7:01 pm |
| On Dec 26, 2007 2:59 PM, Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
> [ Please only quote what's necessary to give context. ]
> [ Please don't top-post. ]
>
> Octavian Rasnita wrote:
>
> Well, then you'll probably need to identify the utf8 octet sequences
> that correspond to the special characters you want to see transformed.
snip
Perl strings are in UTF-8*, but if you want to specify a character
without using it directly (so the Perl file can still be treated as
ASCII) you use the UNICODE representation instead:
my $a_with_macron = "\x{0101}"; #UTF-8 encoding is C4 81
So, knowing the UTF-8 sequences is fairly useless.
* Well, for sufficiently recent versions of Perl.
| |
| Octavian Rasnita 2007-12-26, 7:01 pm |
| From: "Chas. Owens" <chas.owens@gmail.com>
> snip
>
> Perl strings are in UTF-8*, but if you want to specify a character
> without using it directly (so the Perl file can still be treated as
> ASCII) you use the UNICODE representation instead:
>
> my $a_with_macron = "\x{0101}"; #UTF-8 encoding is C4 81
>
> So, knowing the UTF-8 sequences is fairly useless.
>
Ok, and if I want to use tr// to replace a set of UTF-8 chars, how can I do
it?
Can I simply use
tr/astâîASTÂÎ/astaiASTAI/;
I am not sure I can because I've tried this, and something's not ok so I'll
need to check tomorrow.
I have also seen that length($string) returns the number of bytes of
$string, and not the number of chars (if the string contains UTF-8 chars).
How can I get the array of UTF-8 chars and the length of the string in
chars?
I haven't used
use bytes;
and neither
use utf-8;
I've tried them both, but... no change.
Thanks.
Octavian
| |
| Dr.Ruud 2007-12-26, 7:01 pm |
| "Octavian Rasnita" schreef:
> I have also seen that length($string) returns the number of bytes of
> $string, and not the number of chars (if the string contains UTF-8
> chars).
This tells me that you are taking input from an octet buffer that comes
from outside.
my $octets = <>;
my $string;
eval {
$string = Encode::decode("utf8", $octets, Encode::FB_CROAK);
1;
} or {
# malformed input
}
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| Gunnar Hjalmarsson 2007-12-27, 7:01 pm |
| Chas. Owens wrote:
> On Dec 26, 2007 2:59 PM, Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
> snip
>
> Perl strings are in UTF-8*, but if you want to specify a character
> without using it directly (so the Perl file can still be treated as
> ASCII) you use the UNICODE representation instead:
>
> my $a_with_macron = "\x{0101}"; #UTF-8 encoding is C4 81
>
> So, knowing the UTF-8 sequences is fairly useless.
This is the approach I had in mind:
$ cat test.pl
#!/usr/bin/perl
use Encode;
$octets = <DATA>;
$chars = decode 'utf8', $octets;
%special = ( "\xc3\x96" => 'O', "\xc3\xa5" => 'a' );
($translated = $octets) =~ s/(\xc3\x96|\xc3\xa5)/$special{$1}/g;
printf '%-28s%s', 'Raw data (utf8 encoded): ', $octets;
printf '%-28s%s', 'Readable characters: ', $chars;
printf '%-28s%s', 'Translated characters: ', $translated;
__DATA__
Östen Mogård
$ ./test.pl
Raw data (utf8 encoded): Östen Mogård
Readable characters: Östen Mogård
Translated characters: Osten Mogard
However, I now realize that there ought to be smarter approaches...
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
| |
| Octavian Rasnita 2007-12-27, 7:01 pm |
| From: "Dr.Ruud" <rvtol+news@isolution.nl>
> "Octavian Rasnita" schreef:
>
> This tells me that you are taking input from an octet buffer that comes
> from outside.
Yes, I am getting it from a SQLite database.
> my $octets = <>;
> my $string;
> eval {
> $string = Encode::decode("utf8", $octets, Encode::FB_CROAK);
> 1;
> } or {
> # malformed input
> }
>
Ok, I can get the size of the string using this code, but please tell me how
to get the UTF-8 chars from this string.
After decoding the octets, if I do
my @chars = split //, $string;
then it also returns the octets separately and not the UTF-8 chars.
Thanks.
Octavian
| |
|
| orasnita@gmail.com ("Octavian Rasnita") writes:
> I want to replace some special characters with their corresponding
> Western European chars, for example a with a, â with a, s with s, t
> with t, î with i and so on.
The module Text::Unidecode does exactly what you look for. The
conversion is not 100% for all possibilities but common characters are
converted ok. I've used it happily.
--
Radek
| |
| Octavian Rasnita 2007-12-27, 7:01 pm |
| From: "rahed" <raherh@gmail.com>
> orasnita@gmail.com ("Octavian Rasnita") writes:
>
>
> The module Text::Unidecode does exactly what you look for. The
> conversion is not 100% for all possibilities but common characters are
> converted ok. I've used it happily.
This is exactly what I was searching for.
Thank you very much!
Octavian
| |
| Octavian Rasnita 2007-12-27, 7:01 pm |
| From: "Gunnar Hjalmarsson" <noreply@gunnar.cc>
> This is the approach I had in mind:
>
> $ cat test.pl
> #!/usr/bin/perl
> use Encode;
>
> $octets = <DATA>;
>
> $chars = decode 'utf8', $octets;
>
> %special = ( "\xc3\x96" => 'O', "\xc3\xa5" => 'a' );
> ($translated = $octets) =~ s/(\xc3\x96|\xc3\xa5)/$special{$1}/g;
>
> printf '%-28s%s', 'Raw data (utf8 encoded): ', $octets;
> printf '%-28s%s', 'Readable characters: ', $chars;
> printf '%-28s%s', 'Translated characters: ', $translated;
>
I am thinking to do something like:
$text =~
tr/ oiaeuüäéö¦EÁëÍÚÝÖíµcÉeçôËÄúÓßCýuonÜóáeEÔ
Oyg»r§a«NÇdRrNEškCuUUDsnAOnc/ oiaeuuaeo|EAeIUYOiueEicoEAuOBEyuooUoaeIO
Oyg>aSa<OCiAoNEskCuUUISnAOnc/;
....because it requires less code.
Octavian
| |
| Yitzle 2007-12-27, 7:01 pm |
| On Dec 27, 2007 11:46 AM, Octavian Rasnita <orasnita@gmail.com> wrote:
> I am thinking to do something like:
<SNIP>
> ...because it requires less code.
>
More legible code is usually far more valuable than shorter code.
|
|
|
|
|