Home > Archive > PERL Beginners > June 2007 > bug in perl or in my head ;-)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
bug in perl or in my head ;-)
|
|
| Martin Barth 2007-06-19, 3:59 am |
| Hi there,
have a look at:
<snip>
% cat datei
eine test datei
die "u "a "o
% file datei
datei: ASCII text
% cp datei datei.bk
% perl -wpi -e 'use encoding "utf8"; s/"a/=C3=A4/' datei
% file datei
datei: ISO-8859 text
% perl -wp -e 'use encoding "utf8"; s/"a/=C3=A4/' datei.bk > datei.neu
% file datei.neu
datei.neu: UTF-8 Unicode text
</snip>
I'm a bit . Both files should be utf8??
( my xterm is utf8 )
Regards
Martin
| |
| Tom Phoenix 2007-06-19, 3:59 am |
| On 6/18/07, Martin Barth <martin@senfdax.de> wrote:
> I'm a bit . Both files should be utf8??
Probably. It's worth a bug report, at least.
Cheers!
--Tom Phoenix
Stonehenge Perl Training
| |
| Martin Barth 2007-06-19, 7:58 am |
| > Probably. It's worth a bug report, at least.
I sent it.
| |
| Martin Barth 2007-06-19, 6:59 pm |
| Hi jay,
> You haven't told us what Perl thinks the encoding of the first file
> is.=20
how can I do that?
> file is a system command that makes use of number of different
> approaches to determine file type including, on some systems, I think
> it even makes use of metadata. Actually examining the data in the file
> is time-consuming, and therefore a method of last resort, employed
> only when some other context doesn't match. It also returns the first
> match, not all matches.
You're right, but my inputfile does only contain 7bit ascii data. So
every file perl creats, or modifies, should be utf8. I am working with
a ubuntu, so everything should be utf8-ified. my xterm is utf8! that
means that the "=C3=A4" in s/// is utf8, too.
<snip>
> At the command line, you can use the -C switch to avoid confusion.
If I understand you right, following code should allways create a utf8
encoded file. Since my inputfile does only contain 7bit ascii data. and
STDIN STDOUT and STDERR is changed to utf8.
% perl -C7 -wpi -e 'use encoding "utf8"; s/"o/=C3=B6/' datei
% file datei
datei: ISO-8859 text
% hexdump -C datei
00000000 65 69 6e 65 20 74 65 73 74 20 64 61 74 65 69 0a |eine test date=
i.|
00000010 64 69 65 20 22 75 20 f6 20 0a |die "u . .|
f6 =3D =C3=B6 in lation1
c3 b6 =3D =C3=B6 in utf8
Regards
Martin
| |
| Dr.Ruud 2007-06-20, 7:59 am |
| Martin Barth schreef:
> [use encoding]
> If I understand you right, following code should allways create a utf8
> encoded file.
No, "use encoding" is about the encoding of your script, not about file
IO.
<quote src="encoding">
encoding - allows you to write your script in non-ascii or non-utf8
</quote>
> Since my inputfile does only contain 7bit ascii data.
> and STDIN STDOUT and STDERR is changed to utf8.
>
> % perl -C7 -wpi -e 'use encoding "utf8"; s/"o/ö/' datei
In that case, your -C7 could be -C4 or -CE, because STDIN and STDOUT are
already handled by the "encoding" pragma, see again `perldoc encoding`.
But you missed the 8+16 (i+o). See `perldoc perlrun`.
The C<use encoding "utf8"> could be done through -M. But you don't need
"encoding".
So better write it as
perl -Cio -wpi -e 's/"o/\x{f6}/' datei
(or -CIOEio, which is -C31)
> % file datei
> datei: ISO-8859 text
Why not "ASCII text"? Are you sure there are no 8 bit values in there?
(Maybe you forgot to put the original file back, consider "-i.bak".)
> % hexdump -C datei
> 00000000 65 69 6e 65 20 74 65 73 74 20 64 61 74 65 69 0a
> |eine test datei.|
> 00000010 64 69 65 20 22 75 20 f6 20 0a
> |die "u . .|
>
> f6 = ö in lation1
> c3 b6 = ö in utf8
$ file datei
datei: ASCII text
$ hexdump -e '"%07_ad" 16/1 " %02X" "\n"'
-e '" " 16/1 " %-2_p" "\n\n"' datei
0000000 65 69 6E 20 74 65 73 74 20 64 61 74 65 69 0A 64
e i n t e s t d a t e i . d
0000016 69 65 20 22 75 20 22 6F 0A
i e " u " o .
$ perl -C31 -i.bak -wpe 's/"o/\x{f6}/g' datei
$ hexdump -e '"%07_ad" 16/1 " %02X" "\n"' \
-e '" " 16/1 " %-2_p" "\n\n"' datei
0000000 65 69 6E 20 74 65 73 74 20 64 61 74 65 69 0A 64
e i n t e s t d a t e i . d
0000016 69 65 20 22 75 20 C3 B6 0A
i e " u . . .
--
Affijn, Ruud
"Gewoon is een tijger."
| |
| Martin Barth 2007-06-20, 7:59 am |
| Thank you
I learnt a lot!
Martin
| |
|
|
|
|
|
|
|
|
|
|
|