For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > June 2007 > Re: Assigning another filehandle to STDOUT, using binmode.









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Assigning another filehandle to STDOUT, using binmode.
Adam Funk

2007-06-22, 7:04 pm

On 2007-06-21, Dr.Ruud wrote:

> Adam Funk schreef:
>
>
> What is annoying about them? The just mean that you need to fix your
> program.


OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?


As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?
Mumia W.

2007-06-22, 10:04 pm

On 06/22/2007 06:18 PM, John W. Krahn wrote:
> Mumia W. wrote:
>
> You mean like:
>
> open FH, '<:raw', 'filename';
>
> ??
>
>
> John


Oh yeah.

;-)

Peter J. Holzer

2007-06-23, 10:09 pm

On 2007-06-22 18:18, Adam Funk <a24061@ducksburg.com> wrote:
> On 2007-06-21, Dr.Ruud wrote:
>
> OK, let my try a different set of questions: is using binmode the
> correct way to fix the error that causes those warnings?


It is "a" correct way, not "the" correct way. There are other ways: The
-C option (and it's cousin, the PERL_UNICODE environment variable),
specifying perl I/O layers for open, etc.

I generally prefer

open($fh, '<:utf8', $filename);

to

open($fh, '<', $filename);
binmode $fh, ':utf8';

because it is shorter and cleaner. So I use binmode only on STDIN,
STDOUT and (rarely) STDERR, and then I might use -C instead.

I used to use the PERL_UNICODE environment variable, but that bit me
almost as often as it helped, so I don't do that any more.

> As I said, I'm running the program in a UTF-8 environment but getting
> thousands (I think) of identical warnings about "Wide characters"
> which actually refer to correct UTF-8 characters that Perl has read
> from input data files without a hiccup.
>
> Why is it unreasonable that I find this annoying?
> or
> What am I doing that constitutes an error?


You are producing complete garbage. Consider this:

------------------------------------------------------------------------
1 #!/usr/bin/perl
2
3 use warnings;
4 use strict;
5 use utf8;
6
7 my $s1 = "Rübezahl\n";
8 my $s2 = "€ 200,--\n";
9
10 print $s1;
11 print $s2;
------------------------------------------------------------------------
hrunkner:~/tmp 21:55 193% ./foo | od -c
Wide character in print at ./foo line 11.
0000000 R 374 b e z a h l \n 342 202 254 2 0 0
0000020 , - - \n
0000024
hrunkner:~/tmp 21:55 194%

As you can see you get the warning only when printing $s2, but *not*
when printing $s1. The "ü" in $s1 has a code of less than 256, so it can
be printed as a single byte, and is. The € cannot be printed as a single
byte, so it is encoded as UTF-8 and a warning is printed.

The end result is that the output is a mixture of encodings. The first
line is ISO-8859-1, the second is UTF-8. It is impossible to read this
mess again. (And perl really cannot help this - in line 10 it doesn't
know that it will be asked to print a euro sign in line 11, it doesn't
even know it is printing text - it might print an image).

Now if we add a -CO to the shebang line, the output is:

hrunkner:~/tmp 22:04 198% ./foo | od -c
0000000 R 303 274 b e z a h l \n 342 202 254 2 0
0000020 0 , - - \n
0000025

And we now have both lines encoded in UTF-8.

hp


--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Symin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"
Adam Funk

2007-06-25, 7:10 pm

On 2007-06-22, Mumia W. wrote:

>
> You probably are assuming that open() configures your filehandles with
> binmode() for you. This isn't true.
>
> If you open a file, and it needs a special encoding,


By "special" you mean "anything other than ASCII, right?

> you need to call
> binmode(). If you close and re-open STDOUT, you need to call binmode()
> on it (if it needs encoding). If you close and re-open STDOUT when it's
> aliased as OUTPUT, you still need to set up the encoding.
>
> When you need an encoding, it's your responsibility to use binmode() to
> set it on each file handle. The only exception I'm aware of is when the
> "encoding" module is used. But that only sets up STDIN and STDOUT, and
> it only sets them once. Even if the encoding pragma is used, if STDOUT
> is closed and re-opened, binmode() must be called on it again.


OK, thanks.
Adam Funk

2007-06-25, 7:10 pm

On 2007-06-22, John W. Krahn wrote:

> Mumia W. wrote:


>
> You mean like:
>
> open FH, '<:raw', 'filename';
>
> ??


But to be fair to Mumia, the "simpler" form of open() doesn't do that,
and I was expressing surprise that open() didn't assume the
environment locale to be applicable.


Is there any difference between

open(OUTPUT, '>:utf8', $output_filename);

and

open(OUTPUT, ">" . $output_filename);
binmode (OUTPUT, ":utf8");

or should I just use whichever one I find more aesthetic?
Adam Funk

2007-06-25, 7:10 pm

On 2007-06-23, Peter J. Holzer wrote:

> It is "a" correct way, not "the" correct way. There are other ways: The
> -C option (and it's cousin, the PERL_UNICODE environment variable),
> specifying perl I/O layers for open, etc.
>
> I generally prefer
>
> open($fh, '<:utf8', $filename);
>
> to
>
> open($fh, '<', $filename);
> binmode $fh, ':utf8';
>
> because it is shorter and cleaner. So I use binmode only on STDIN,
> STDOUT and (rarely) STDERR, and then I might use -C instead.


As far as I can tell, I'm not getting errors or warnings reading the
input files (but I'm not doing it directly with my own code --- I'm
using XML::Twig's parsefile($input_filename) method; the input files
are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
environment into consideration, or assume UTF-8, for input but not
output?
Peter J. Holzer

2007-06-25, 7:10 pm

On 2007-06-25 10:13, Adam Funk <a24061@ducksburg.com> wrote:
> On 2007-06-22, Mumia W. wrote:
>
> By "special" you mean "anything other than ASCII, right?


"Anything other than what happens to be the default in your perl
implementation" actually. That might be EBCDIC :-).

It might be a good idea to always specify the intended encoding.

If you want to get the current charset/encoding from the locale, you can
use I18N::Langinfo:


use I18N::Langinfo qw(langinfo CODESET)
$charset = langinfo(CODESET)

[...]

open(my $fh, "<:encoding(charset)", $filename);

hp


--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Symin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"
Peter J. Holzer

2007-06-25, 7:10 pm

On 2007-06-25 10:28, Adam Funk <a24061@ducksburg.com> wrote:
> As far as I can tell, I'm not getting errors or warnings reading the
> input files (but I'm not doing it directly with my own code --- I'm
> using XML::Twig's parsefile($input_filename) method; the input files
> are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
> environment into consideration,


No. By default it assumes (on Unix) binary input. You are reading and
writing a stream of bytes, not a stream of characters.

> or assume UTF-8, for input but not output?


No. The XML parser gets the encoding from the XML file. If the XML file
doesn't explicitely specify an encoding, it must be UTF-8. This is
completely independent of the locale. XML files are supposed to be
portable and must not be interpreted differently depending on the
locale.

hp


--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Symin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"
Peter J. Holzer

2007-06-25, 7:10 pm

On 2007-06-25 10:18, Adam Funk <a24061@ducksburg.com> wrote:
> But to be fair to Mumia, the "simpler" form of open() doesn't do that,
> and I was expressing surprise that open() didn't assume the
> environment locale to be applicable.


open cannot know whether the file it opens is supposed to be a text file
or a binary file. Since perl treated all files as binary on Unix
previously, to keep that as default. Changing the default would have
broken lots of old scripts.

> Is there any difference between
>
> open(OUTPUT, '>:utf8', $output_filename);
>
> and
>
> open(OUTPUT, ">" . $output_filename);
> binmode (OUTPUT, ":utf8");
>
> or should I just use whichever one I find more aesthetic?


AFAIK they are equivalent.

hp


--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Symin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"
Adam Funk

2007-06-25, 7:10 pm

On 2007-06-25, Peter J. Holzer wrote:

> On 2007-06-25 10:28, Adam Funk <a24061@ducksburg.com> wrote:
>
> No. By default it assumes (on Unix) binary input. You are reading and
> writing a stream of bytes, not a stream of characters.
>
>
> No. The XML parser gets the encoding from the XML file. If the XML file
> doesn't explicitely specify an encoding, it must be UTF-8. This is
> completely independent of the locale. XML files are supposed to be
> portable and must not be interpreted differently depending on the
> locale.


Oh of course! I got so caught in up in this business of setting
encodings that I forgot about the encoding specified explicitly in the
XML file.
Adam Funk

2007-06-25, 7:10 pm

On 2007-06-25, Peter J. Holzer wrote:

> On 2007-06-25 10:18, Adam Funk <a24061@ducksburg.com> wrote:
>
> open cannot know whether the file it opens is supposed to be a text file
> or a binary file. Since perl treated all files as binary on Unix
> previously, to keep that as default. Changing the default would have
> broken lots of old scripts.


It's starting to make sense now.


>
> AFAIK they are equivalent.


Thanks.
Adam Funk

2007-06-25, 7:10 pm

On 2007-06-25, Peter J. Holzer wrote:

>
> "Anything other than what happens to be the default in your perl
> implementation" actually. That might be EBCDIC :-).


I've got enough trouble already, thanks. ;-)
Peter J. Holzer

2007-06-26, 10:03 pm

On 2007-06-26 10:32, Adam Funk <a24061@ducksburg.com> wrote:
> On 2007-06-25, Peter J. Holzer wrote:
>
>
> I think I get it. String literals and variables just contain strings
> of bytes,


No. Perl strings do not consist of bytes. Since there is no official
name for the thingies a perl string is made of, I'll just call them
"thingies".

On the most abstract level, about the only thing we know about these
thingies is that they are numbered: You get the number of the first
thingy in a string with ord() and you can create a string containing
only a single thingy with a specific number with chr(). The numbers
range from 0 .. 2**32-1.

What these thingies *mean* depends on your program. They might be
characters, they might be bytes of a graphics file, they might be
indexes, ... Perl mostly doesn't care.

Perl has two ways of storing strings: If all the thingies have numbers
below 256, the string can be stored as one thingy per byte. If this is
not the case, the thingies are encoded in UTF-8. Theoretically you
shouldn't know or care how perl stores a string.

In reality, Perl does assign some meaning to the type of the string. If
a string is utf8-encoded, Perl assumes that the thingies are really
Unicode-Codepoints. so "\x{FC}" matches /\w/ if it happens to be an
utf8-encoded string, but doesn't if it's a byte-encoded string (I'm
ignoring locales for now). For this reason the utf8-encoded strings are
often called "character strings" and the byte-encoded strings are called
"byte strings".

Since files consist of bytes, you can always only read bytes from a file
and write bytes to it. So when you read a file and want to treat it as a
series of characters instead of bytes, you have to "decode" it, and when
you have a character string which you want to write to a file, you have
to "encode" it. You can do that with the subs from the "Encode" module
or with I/O layers, and Modules written to deal with specific file
formats (like XML) do that automatically.


> Now I'm surprised that the following dippy little tag-stripping
> program, which is XML-unaware and has no settings whatever relating to
> encoding, works.
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my ($file, $line, $i);
>
> while (@ARGV) {
> $file = shift(@ARGV);
> open(F, "<", $file);
> $i = 0;
> while ($line = <F> ) {
> $i++;
> chomp($line);
> $line =~ s!<[^>]+>!*!g;
> print($file . " > " . $line . "\n");
> last if ($i > 11);
> }
> close(F);
> }
>
>
> When I run this over my UTF-8 XML files, I get correct-looking, mixed
> Cyrillic and Roman output, with no warnings --- why?


Because UTF-8 is designed in such a way that this should work :-).

Your program reads and writes the files as a series of bytes. If your
file contains a cyrillic character, for example "Б", it will read and
write two bytes (0xD0 0x91) instead. Since that happens both on input
and on output, it doesn't matter. If you treat the individual bytes of a
multibyte character as characters, then your program will break. For
example, if you want to insert a blank before each character and put a

$line =~ s!(.)| $1|g;

in your program it won't work because it converts the byte sequence
0xD0 0x91 into the byte sequence 0x20 0xD0 0x20 0x91, which is not a
proper UTF-8 sequence. You must properly decode your input and encode
your output if you want to do this (or deal with the encoding in your
code).

hp

--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Symin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"
Dr.Ruud

2007-06-26, 10:03 pm

Adam Funk schreef:

> I think I get it. String literals and variables just contain strings
> of bytes, and encoding is a consideration only for input and output
> --- or is that only for output?


A Perl text string contains characters. See perlunitut:
http://search.cpan.org/perldoc?perlunitut

--
Affijn, Ruud

"Gewoon is een tijger."
Adam Funk

2007-06-29, 8:03 am

On 2007-06-26, Peter J. Holzer wrote:

....[color=darkred]
>
> Because UTF-8 is designed in such a way that this should work :-).
>
> Your program reads and writes the files as a series of bytes. If your
> file contains a cyrillic character, for example "?", it will read and
> write two bytes (0xD0 0x91) instead. Since that happens both on input
> and on output, it doesn't matter. If you treat the individual bytes of a
> multibyte character as characters, then your program will break. For
> example, if you want to insert a blank before each character and put a
>
> $line =~ s!(.)| $1|g;
>
> in your program it won't work because it converts the byte sequence
> 0xD0 0x91 into the byte sequence 0x20 0xD0 0x20 0x91, which is not a
> proper UTF-8 sequence. You must properly decode your input and encode
> your output if you want to do this (or deal with the encoding in your
> code).


I think I'm getting this. Thanks!
Adam Funk

2007-06-29, 8:03 am

On 2007-06-26, Dr.Ruud wrote:

> Adam Funk schreef:
>
>
> A Perl text string contains characters. See perlunitut:
> http://search.cpan.org/perldoc?perlunitut


I think I'm finally figuring this out. Thanks.
Peter J. Holzer

2007-06-29, 7:03 pm

On 2007-06-26 18:06, Dr.Ruud <rvtol+news@isolution.nl> wrote:
> Adam Funk schreef:
>
> A Perl text string contains characters. See perlunitut:
> http://search.cpan.org/perldoc?perlunitut


True, but not the answer to Adam's question. Not every perl string is a
perl text string. Strings can be used to store non-textual information.

hp


--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Symin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"
Dr.Ruud

2007-06-30, 8:02 am

Peter J. Holzer schreef:
> Dr.Ruud:
[color=darkred]
>
> True, but not the answer to Adam's question. Not every perl string is
> a perl text string. Strings can be used to store non-textual
> information.


You should read more carefully, I wrote "A Perl *text* string". The
concept is further defined in perlunitut.
Together it is a complete answer to Adam's question.

--
Affijn, Ruud

"Gewoon is een tijger."

Peter J. Holzer

2007-06-30, 8:02 am

On 2007-06-30 00:08, Dr.Ruud <rvtol+news@isolution.nl> wrote:
> Peter J. Holzer schreef:
>
>
> You should read more carefully, I wrote "A Perl *text* string".


I did read this. That's why I wrote "Not every perl string is a perl
*text* string" (emphasis added). Adam asked about "String literals and
variables". While some point can be made that string literals are
supposed to always contain text strings, that certainly isn't true about
variables.

> The concept is further defined in perlunitut.


Perlunitut is good reading. If you had just recommended that Adam should
read this, I wouldn't have objected. But your first sentence was IMHO
missing the point and possibly misleading.

hp


--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Symin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"
Dr.Ruud

2007-06-30, 8:02 am

Peter J. Holzer schreef:
> Dr.Ruud:
[color=darkred]
>
> I did read this. That's why I wrote "Not every perl string is a perl
> *text* string" (emphasis added).


This is getting ridiculous. I wrote "Perl text string", and you reacted
on something you call "every perl string", which I didn't write. (Adam
is dealing with Perl text strings, or he should be.)

I was not talking about "every perl string", I was specifically
isolating the "Perl text string"-type-of-Perl-string, by explicitely
referring to it as "Perl text string", in an introduction to (so related
to) perlunitut. There was, contrary to what you read into it, nothing
incomplete about it.
Yes it assumes that you actually read perlunitut, which is easy to read
and understand, but why would I have ordered "See perlunitut" otherwise?
Should I maybe have written "Read and follow perlunitut" in stead of
"See perlunitut" for you to get the picture?

See also `perldoc Encode`, it defines all strings in Perl as sequences
of characters (and binary strings as just a subset of Perl strings),
which is different from how perlunitut projects it.

--
Affijn, Ruud

"Gewoon is een tijger."

Rinass

2007-06-30, 8:24 pm

Heather Locklear changing her dirty panties!

http://www.yourtubeaudio.com/Watch?id=1673286



american funny home video free shocking funny video adult funny joke video clip free funny online video funny world of warcraft video
http://635-funny-video.info/clip-funny-hot-video.html http://635-funny-video.info/funny-sexy-site-video.html http://635-funny-video.info/funny-online-video.html http://635-funny-video.info/funny-w...raft-video.html http://635-funny-video.info/funny-v...or-myspace.html
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com