Home > Archive > PERL Miscellaneous > February 2005 > How to NOT use utf8.
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
How to NOT use utf8.
|
|
| pkaluski 2005-02-18, 8:56 pm |
| Hi,
I am under impression that my perl scripts crash due to some
incompatibilities with utf8.
I was looking in the documentation and I found out (correct me if I am
wrong) that perl 5.8.0 and higher use utf8 by default. I have also found
several hints on how to make unicode and non-unicode strings work together.
My question is: Is it possible (on Windows) to tell perl not to use
unicode at all. Not in the current lexical scope, in the current module.
Just not to use unicode at all. Only 8-bit character strings.
Is it possible?
--
Piotr Kaluski
"It is the commitment of the individuals to excellence,
their mastery of the tools of their crafts, and their
ability to work together that makes the product, not rules."
("Testing Computer Software" by Cem Kaner, Jack Falk, Hung Quoc Nguyen)
| |
| Anno Siegel 2005-02-19, 8:57 pm |
| pkaluski <pkaluski@piotrkaluski.com> wrote in comp.lang.perl.misc:
> Hi,
> I am under impression that my perl scripts crash due to some
> incompatibilities with utf8.
> I was looking in the documentation and I found out (correct me if I am
> wrong) that perl 5.8.0 and higher use utf8 by default. I have also found
> several hints on how to make unicode and non-unicode strings work together.
> My question is: Is it possible (on Windows) to tell perl not to use
> unicode at all. Not in the current lexical scope, in the current module.
> Just not to use unicode at all. Only 8-bit character strings.
>
> Is it possible?
You can find the answer in "perldoc perlrun". See also perlunicode and
perluniintro.
Anno
| |
| Alan J. Flavell 2005-02-19, 8:57 pm |
| On Fri, 18 Feb 2005, pkaluski wrote:
> I am under impression that my perl scripts crash due to some
> incompatibilities with utf8.
Maybe the best move would be to discuss the problems and try to move
forward, rather than asking how to move backwards...
> I was looking in the documentation and I found out (correct me if I
> am wrong) that perl 5.8.0 and higher use utf8 by default.
A somewhat over-simplified assessment of the new features, I must say.
> I have also found several hints on how to make unicode and
> non-unicode strings work together.
They already do (work together) by and large, if you follow the
documentation and guidelines.
> My question is: Is it possible (on Windows)
Where actually does Windows come into this, specifically?
We've had some discussions here in which problems with Window's native
utf-16 storage format has caused problems. But, for the most part,
Perl aims to be platform-neutral, so maybe - if we discussed the
details clearly enough - the Windows-specific issues could be factored
out and the *real* problem (whatever it might be) could be solved.
> to tell perl not to use unicode at all. Not in the current lexical
> scope, in the current module. Just not to use unicode at all. Only
> 8-bit character strings.
>
> Is it possible?
Sure it;s "possible", though only to those who read the documentation.
But my question would be: is this the optimal approach? Since you
don't seem to know what Perl offers, and we don't know what problem
you are perceiving, it seems to me to be too soon to jump to the
conclusion that you can't use what it is that Perl is offering you.
| |
| pkaluski 2005-02-20, 3:56 pm |
| Anno Siegel wrote:
> pkaluski <pkaluski@piotrkaluski.com> wrote in comp.lang.perl.misc:
>
>
>
> You can find the answer in "perldoc perlrun". See also perlunicode and
> perluniintro.
>
> Anno
Oppss! I was so excited looking for an answer that I forgot to check in
perlrun :). Thanks!
--
Piotr Kaluski
"It is the commitment of the individuals to excellence,
their mastery of the tools of their crafts, and their
ability to work together that makes the product, not rules."
("Testing Computer Software" by Cem Kaner, Jack Falk, Hung Quoc Nguyen)
| |
| pkaluski 2005-02-21, 3:59 pm |
| Alan J. Flavell wrote:
> Sure it;s "possible", though only to those who read the documentation.
Alan, don't you think you are kind of impolite writing the text above?
You can find virtually everything in the documentation. So you can
basicaly find a reason to criticize people for most of their posts
"because it is in documentation". But some issues appear not that
obvious and it takes time to find an answer to simple questions. In the
same time it takes 30 seconds to write an answer once you know it.
I do not consider utf8 topic an obvious one.
"use bytes" was described as working only in the current lexical scope.
So I asked simple question, which I did not find clearly answered in
documentation "Is it possible to switch UTF8 totaly off in perl on
Windows?" Was this question so difficult to understand?
--
Piotr Kaluski
"It is the commitment of the individuals to excellence,
their mastery of the tools of their crafts, and their
ability to work together that makes the product, not rules."
("Testing Computer Software" by Cem Kaner, Jack Falk, Hung Quoc Nguyen)
| |
| phaylon 2005-02-21, 8:58 pm |
| pkaluski wrote:
> You can find virtually everything in the documentation.
Yep. I tried out google'ing for «turn off utf8 +in perl» and, seeing
that perlrun is listed (after the utf8 docs), I would look in them. Maybe
your research wasn't the best.
--
http://www.dunkelheit.at/
bellum omnium pater.
| |
| Villy Kruse 2005-02-22, 8:57 am |
| On Sun, 20 Feb 2005 18:10:20 +0100,
pkaluski <pkaluski@piotrkaluski.com> wrote:
> Oppss! I was so excited looking for an answer that I forgot to check in
> perlrun :). Thanks!
>
There wer some changes in 5.8.1 that might be interesting in this context.
Check the -C option.
Villy
| |
| Alan J. Flavell 2005-02-22, 8:58 pm |
| On Mon, 21 Feb 2005, pkaluski wrote:
> Alan J. Flavell wrote:
>
> Alan, don't you think you are kind of impolite writing the text
> above?
You could be right: it's hard to tell. I don't know you - I only
responded on the basis of the amount of information you had provided,
and the fact that you seemed to have pre-decided the solution before
even making it clear to us what the problem was.
Usenet tends to be like that.
> I do not consider utf8 topic an obvious one.
Neither do I, which is why I'd hoped for more detail so that the
real problem could be understood...
> "use bytes" was described as working only in the current lexical
> scope.
Well, after all, if you're calling a module which is designed to use
the unicode facilities of Perl 5.8+, then I hardly think that module
is likely to be amused when it finds that you've disabled its ability
to do what it was designed to do.
> So I asked simple question,
Actually no, you asked what is really a very complicated question -
especially considering that it was almost entirely lacking any context
in terms of problem domain, circumstances, external modules called,
etcetera etcetera etcetera.
If you're processing text, then you *need* to know what encoding has
been used. If you're processing binary data, then you shouldn't be
treating it as text. That's been my attitude since, well, around 1965
I suppose it was, when I first grasped the difference, although I'd
been doing it - in a sense - without realising the point, since I met
my first computer in 1958.
> "Is it possible to switch UTF8 totaly off in perl on Windows?"
I asked you before why you thought that Windows was somehow relevant
to this question, but you still have not supplied any answer to that.
> Was this question so difficult to understand?
Yes, it was, IMHO. That's why I asked you several supplementary
questions, to help in understanding the problem in its context - but
which you have chosen - it seems - to ignore.
good luck
| |
| pkaluski 2005-02-26, 3:57 am |
| Alan J. Flavell wrote:
>
> (...) I'd hoped for more detail so that the
> real problem could be understood...
>
> (...) you asked what is really a very complicated question -
> especially considering that it was almost entirely lacking any context
> in terms of problem domain, circumstances, external modules called,
> etcetera etcetera etcetera.
>
> If you're processing text, then you *need* to know what encoding has
> been used. If you're processing binary data, then you shouldn't be
> treating it as text. That's been my attitude since, well, around 1965
> I suppose it was, when I first grasped the difference, although I'd
> been doing it - in a sense - without realising the point, since I met
> my first computer in 1958.
>
>
> (...) I asked you several supplementary
> questions, to help in understanding the problem in its context - but
> which you have chosen - it seems - to ignore.
>
> good luck
OK. I can now provide you with some details.
I did not place details in my first post, because my problem was initialy
happening in my big script which I couldn't post because it was to big, using
too many modules. I had some indications that my problems are due to Unicode. So
my thought was - "OK, the easiest way would be to make perl work as if there is
no such think like unicode". And it was my question - is it possible to make
perl totaly Unicode unaware. Since my script is supposed to run under Windows, I
added the Windows part to my question in case there is something system specific.
Now I can provide you with some details, since I managed to separate the problem
and recreate it in the smaller script.
The problem was that Carp::cluck was crashing my script. Crashing in a nasty,
uncontrolled way so Windows were killing it. What was more interesting, the
thing was happening only when running my script under debugger (which is also
scary - if something fails on debuger and works without it could be an
indication that something is terribly screwed).
When I tried to spot the problem, I have found that one of regular expressions
in Carp::format_arg function, called by cluck, jumps to other chunk of code. See
below (I've attached a call stack):
DB<2>Carp::caller_info(C:/Perl/lib/Carp/Heavy.pm:62):
62: $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;
DB<2> s
utf8::SWASHNEW(C:/Perl/lib/utf8_heavy.pl:21):
21: my ($class, $type, $list, $minbits, $none) = @_;
DB<3> T
$ = utf8::SWASHNEW('utf8', '', '# comment^J+utf8::IsCntrl^J', 1, 0) called from
file `C:/Perl/lib/Carp/Heavy.pm' line 62
@ = Carp::format_arg('After value1') called from file `C:/Perl/lib/Carp/Heavy.pm
' line 31
@ = Carp::caller_info(3) called from file `C:/Perl/lib/Carp/Heavy.pm' line 142
@ = Carp::ret_backtrace(2, 'After value1') called from file `C:/Perl/lib/Carp/He
avy.pm' line 125
@ = Carp::longmess_heavy('After value1') called from file `C:/Perl/lib/Carp.pm'
line 235
@ = Carp::longmess('After value1') called from file `C:/Perl/lib/Carp.pm' line 2
72
.. = Carp::cluck('After value1') called from file `test2.pl' line 11
DB<12>
See? Steping on substitution operator moves me to utf8 module. And when stepping
further I was getting messages about malformed UTF-8.
BTW, comment in Carp::format_arg function says:
(Carp/Heavy.pm)
59 # The following handling of "control chars" is direct from
60 # the original code - I think it is broken on Unicode though.
61 # Suggestions?
62 $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;
So the author suggests that there may be a problems for unicode, and he seams
to be right.
The code snippet below makes perl crash (at least for me)
--- CODE STARTS ---
use strict;
use XML::Simple;
use Carp qw( cluck );
cluck "Before";
my $str = XMLin( "input.xml" );
my $msg = "After " . $str->{ 'tag1' }->{ 'attr1' };
cluck $msg;
--- CODE ENDS ---
The input.xml file is simple:
--- INPUT.XML STARTS ----
<opt>
<tag1 attr1="value1"/>
</opt>
--- INPUT.XML ENDS ----
In order to have the crash effect, you have to run perl under debbuger. Like this:
##########################
M:\temp\unicode>perl -d test2.pl
Loading DB routines from perl5db.pl version 1.28
Editor support available.
Enter h or `h h' for help, or `perldoc perldebug' for more help.
main::(test2.pl:6): cluck "Before";
DB<1> c
Before at test2.pl line 6
at test2.pl line 6
M:\temp\unicode>
###############################
It didn't make it to the end. It crashed.
If I get rid of unicode flag from the $msg it will work:
--- CODE STARTS ---
use strict;
use XML::Simple;
use Carp qw( cluck );
cluck "Before";
my $str = XMLin( "input.xml" );
my $msg = "After " . $str->{ 'tag1' }->{ 'attr1' };
require Encode;
Encode::_utf8_off( $msg );
cluck $msg;
--- CODE ENDS ---
Of course I have tried all this stuff with PERLIO=:bytes.
After this experiments I think I can make my first question more clear (I hope)
- Can you make perl totally unaware of such thing like Unicode?
And I believe that the answer is - You can't. Perl has unicode support in its
guts. The only things you can manipulate are:
* You can make perl to treat unicode as bytes durring reading and writing(by
PERLIO and some pragmas)
* You can reset the UTF-8 flag in a string.
But if you are about to write something bigger, using many modules, then Alan is
right - it is more efficient to adjust your code to unicode, instead of avoiding it.
In order to avoid it you would have to control each string produced by any
module and downgrade it to bytes. This approach is infeasible even for medium
size projects.
In the scripts above XML::Simple returns Unicode strings (even is Unicode is not
needed and PERLIO=:bytes).
Is my reasoning correct?
And what is wrong with this regular expression used indirectly by cluck, that it
makes perl crash?
--
Piotr Kaluski
"It is the commitment of the individuals to excellence,
their mastery of the tools of their crafts, and their
ability to work together that makes the product, not rules."
("Testing Computer Software" by Cem Kaner, Jack Falk, Hung Quoc Nguyen)
| |
| Alan J. Flavell 2005-02-26, 3:57 am |
| On Fri, 25 Feb 2005, pkaluski wrote:
[..]
> (Carp/Heavy.pm)
> 59 # The following handling of "control chars" is direct from
> 60 # the original code - I think it is broken on Unicode though.
> 61 # Suggestions?
> 62 $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;
>
> So the author suggests that there may be a problems for unicode,
Point noted; and I see this same code and comment in the version that
I'm using on Windows; but...
> use strict;
> use XML::Simple;
> use Carp qw( cluck );
>
> cluck "Before";
>
> my $str = XMLin( "input.xml" );
> my $msg = "After " . $str->{ 'tag1' }->{ 'attr1' };
> cluck $msg;
> --- CODE ENDS ---
>
> The input.xml file is simple:
>
> --- INPUT.XML STARTS ----
> <opt>
> <tag1 attr1="value1"/>
> </opt>
> --- INPUT.XML ENDS ----
>
> In order to have the crash effect, you have to run perl under debbuger. Like
> this:
>
> M:\temp\unicode>perl -d test2.pl
Sorry, I can't reproduce this behaviour in ActivePerl 5.8.1
on Win2K.
I tried saving the data in DOS format as well as in unix format, just
in case this was a relevant issue; but neither of them caused a
problem:
main::(notutf8.pl:7): cluck "Before";
DB<1> c
Before at notutf8.pl line 7
After value1 at notutf8.pl line 11
Debugged program terminated. Use q to quit or R to restart,
I'm not saying that you haven't got a point; just that I can't
yet reproduce the problem that you're reporting. Any thoughts on
relevant differences?
|
|
|
|
|