For Programmers: Free Programming Magazines  


Home > Archive > PERL Miscellaneous > March 2006 > unicode conversion









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author unicode conversion
Nospam

2006-03-20, 7:56 am

I am trying to convert some unicode to their equivalent characters, however
it is not printing out the character

#! perl\bin\perl
use strict;
use warnings;
use utf8;
use Text::Unidecode;

use Data::Dumper;
use WWW::Mechanize;


my $w = print
unidecode(" \x{25163}\x{34920}\x{36275}\x{29699}\x{2
0026}\x{33258}\x{30001}")
;
my $u = print
unidecode(" \x{25171}\x{24320}\x{24744}\x{30340}\x{3
0005}\x{35805}");
my $m = print
unidecode(" \x{20570}\x{37329}\x{38065}\x{35835}\x{2
0070}\x{30005}\x{23376}\x
{37038}\x{20214}");
my $e = print
unidecode(" \x{22238}\x{26469}\x{22312}\x{101}\x{98}
\x{97}\x{121}\x{32}\x{223
12}\x{24748}\x{28014}\x{20197}\x{21518}");
my $mi = print
unidecode(" \x{20415}\x{23452}\x{30340}\x{109}\x{112
}\x{51}\x{115}");
my $q = print
unidecode(" \x{20572}\x{27490}\x{25277}\x{28895}\x{2
2312}\x{19968}\x{20010}\x
{26143}\x{26399}");
my $c = print
unidecode(" \x{25913}\x{36827}\x{24744}\x{30340}\x{3
9640}\x{23572}\x{22827}\x
{29699}\x{22312}\x{20108}\x{20010}\x{261
43}\x{26399}");
my $p = print
unidecode(" \x{20196}\x{20154}\x{24778}\x{35766}\x{3
0340}\x{112}\x{97}\x{121}
\x{112}\x{97}\x{108}\x{32}\x{28431}\x{27
934}\x{91}\x{47}");
my $t = print unidecode("\x{24863}\x{20852}\x{36259}");
my $s = print unidecode(" \x{30475}\x{19968}\x{30475}\x{22914}\x{1
9979}");

open (FILE2, "samp.txt");
use constant START => "<FILE2>";


while ()
{my $mech = WWW::Mechanize->new();

my $START = <FILE2>;
$mech->get( $START );



$mech->field('chinese',"$s
chinese $w
chinese $u
chinese $m
chinese $e
chinese $mi
chinese $q
chinese $c
chinese $p");


close(FILE2);
}


robic0

2006-03-20, 7:56 am

On Mon, 20 Mar 2006 05:09:00 GMT, "Nospam" <nospam@home.com> wrote:

>I am trying to convert some unicode to their equivalent characters, however
>it is not printing out the character
>

Won't look through your code.
What do you mean by the phrase:
"trying to convert some unicode to their equivalent characters" ?

Characters are bitmaps, only the "codes" remain the same.
So are you trying to match up bitmaps that look the same?
Or are you actually trying to display code page bitmapped characters?

The last question is the reason I'm not looking at your code.
Brad Baxter

2006-03-20, 6:58 pm

Nospam wrote:
> I am trying to convert some unicode to their equivalent characters, however
> it is not printing out the character
>
> my $w = print
> unidecode(" \x{25163}\x{34920}\x{36275}\x{29699}\x{2
0026}\x{33258}\x{30001}")
> ;
>


You're setting $w to the return code from print, which is probably 1.
In the process, you're printing the return value from unidecode.

#!/usr/local/bin/perl
use warnings;
use strict;

use Text::Unidecode;

print
unidecode(" \x{25163}\x{34920}\x{36275}\x{29699}\x{2
0026}\x{33258}\x{30001}");
__END__
[?] [?] [?] [?] [?] [?] [?]


What exactly did you expect that to print?

--
Brad

Bart Van der Donck

2006-03-20, 6:58 pm

Nospam wrote:

> I am trying to convert some unicode to their equivalent characters, however
> it is not printing out the character


Many programs only have partial Unicode support.

Example:

perl -we 'use Text::Unidecode; print unidecode("\x{00068}\x{00069}")'
hi

perl -we 'use Text::Unidecode; print unidecode("\x{25163}\x{34920}")'
[?] [?]

It is not a perl problem (if perl version 5.8+), but a problem of the
terminal. The shell probably expects ISO-8859-1 or so.

The program that displays the characters (terminal, editor,
browser,...) must have Unicode support (that is as complete as
possible).

http://www.unicode.org
http://www.unicode.org/charts/
http://www.ayni.com/perldoc/perl5.8...rluniintro.html
http://groups.google.com/group/perl.unicode

--
Bart

Peter J. Holzer

2006-03-20, 6:58 pm

Bart Van der Donck wrote:
> Nospam wrote:
>
[...][color=darkred]
> perl -we 'use Text::Unidecode; print unidecode("\x{25163}\x{34920}")'
> [?] [?]
>
> It is not a perl problem (if perl version 5.8+), but a problem of the
> terminal. The shell probably expects ISO-8859-1 or so.
>
> The program that displays the characters (terminal, editor,
> browser,...) must have Unicode support (that is as complete as
> possible).


Nope. Text::Unidecode transliterates into ASCII. So unless the terminal
in question can't display ASCII, that shouldn't be a problem.

More likely Text::Unidecode doesn't know about those characters.
\x{25163} and \x{34920} are outside of the BMP. I notice all the codes
given by the OP contain only the digits 0-9, so maybe he meant decimal
25163 and 34920, not hexadecimal? Then these would have to be written as
"\x{624B}\x{8868}", which Text::Unidecode transliterates to "Shou Biao "
(which may or may not be correct - I don't know Chinese).

hp

--
_ | Peter J. Holzer | Löschung von at.usenet.schmankerl?
|_|_) | Symin WSR/LUGA |
| | | hjp@hjp.at | Diskussion derzeit in at.usenet.gruppen
__/ | http://www.hjp.at/ |

2006-03-20, 6:58 pm

Bart Van der Donck <bart@nijlen.com> wrote:

: perl -we 'use Text::Unidecode; print unidecode("\x{25163}\x{34920}")'
: [?] [?]

: It is not a perl problem (if perl version 5.8+), but a problem of the
: terminal. The shell probably expects ISO-8859-1 or so.

: The program that displays the characters (terminal, editor,
: browser,...) must have Unicode support (that is as complete as
: possible).

In principle, yes. However in real life, I also have less-than-perfect
experiences. The original poster's way of saying \x{code point} is
guaranteed to be least frustrating when Chinese data are to be preserved
over IO transitions.

In theory, on an utf8 terminal with locale set to an utf8-enabled status,
perl should print strings containing, e.g., Chinese (CJK) data, directly
to STDOUT without any problem.

Paradoxically, starting my script with the flag -CS, like in the shebang line

#!/usr/bin/perl -CS

breaks utf8 output of Chinese characters to an otherwise perfectly utf8-
transparent console, see my XML::Simple and utf8 woe posting of last w
and try yourself. So the opposite of what the perlrun manpage promises
happens.

I find the off-and-on auto-predictive and authoritarian style in which
Perl seems to treat utf8 data not really transparent and a source of
aweful headaches.

In addition, Perl's utf8 support occasionally slows down things significantly;
my latest experience is with bulk quantities of utf8 data (latin, CJK material,
_tons_ of characters with accents and diacritics in one soup).

When I try to segment such a string with approx. 400kB of data into an array
using split(), and my regex contains a single utf8 character then the whole
thing gets terribly slow when being done in utf8 mode. Actually, split()-ing
becomes so slow that I can't use the script for production purposes any more.
If in contrary I treat my 400kB long string as series of octets, ignoring
character semantics, and let my regex in split() search for two adjacent
octets of a given type, then the whole thing is lightning fast, as usual,
and as expected.

So I think, either Perl's control features of which data are utf8 and which
are not, need a significant overhaul, or Perl's utf8 processing capabilities
need streamlining.

One of the main points of potential conflict is certainly the way in which
regex automata are built, and notably how to define atoms. For me, it would
be fine if a complex Perl script could do all its data processing, IO trans-
fers etc. in pure octet semantics unless instructed otherwise. If I really
really need it I still could say 'the next operation must treat my data by
character semantics'; this would be great, and would help to alleviate many
inconsistencies that make Perl scripts more vulnerable and less portable
which are supposed to run in heterogenous environments. I frequently
encountered the problem that Perl without any instruction treated my utf8
data correctly on a, e.g. Linux box, including console and file output,
but goofed in WinXP unless additional binmode() instructions were given;
to make things worse, utf8-clean stuff developed on XP failed miserably on
Linux.

I am a linguist, and as such I've always found Perl's natural language
analogies intriguing, and the typical elliptical Perl coding style has
been perfectly intuitive to me; Perl's sometimes awkward behaviour when
confronted with utf8 data has, for the first time ever when dealing with
Perl, raised my eyebrows.

Maybe I am missing some fundamental points here; if so, please let me know.

Oliver.
--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de

2006-03-21, 3:56 am

Nospam <nospam@home.com> wrote:
: I am trying to convert some unicode to their equivalent characters, however
: it is not printing out the character

: my $w = print
: unidecode(" \x{25163}\x{34920}\x{36275}\x{29699}\x{2
0026}\x{33258}\x{30001}")

You don't unidecode here. Simply say chr() instead of \x{} does the trick.
chr() takes decimal numbers as such, while the \x{} notation insists in
being fed with hex data.

Your assignments work if you define your strings as:

my $w=chr(25163).chr(34920).......;

print $w will tell you something about wristwatches, soccer and freedom then.

If you really meant a statement like $w="print chr(65)" to print a letter "A"
by saying

$w;

then this doesn't work either; you really wanted to say:

eval $w;

Oliver.

--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de
robic0

2006-03-21, 10:02 pm

On 20 Mar 2006 21:41:09 GMT, <corff@zedat.fu-berlin.de> wrote:

>Bart Van der Donck <bart@nijlen.com> wrote:
>
>: perl -we 'use Text::Unidecode; print unidecode("\x{25163}\x{34920}")'
>: [?] [?]
>
>: It is not a perl problem (if perl version 5.8+), but a problem of the
>: terminal. The shell probably expects ISO-8859-1 or so.
>
>: The program that displays the characters (terminal, editor,
>: browser,...) must have Unicode support (that is as complete as
>: possible).
>
>In principle, yes. However in real life, I also have less-than-perfect
>experiences. The original poster's way of saying \x{code point} is
>guaranteed to be least frustrating when Chinese data are to be preserved
>over IO transitions.
>
>In theory, on an utf8 terminal with locale set to an utf8-enabled status,
>perl should print strings containing, e.g., Chinese (CJK) data, directly
>to STDOUT without any problem.
>
>Paradoxically, starting my script with the flag -CS, like in the shebang line
>
>#!/usr/bin/perl -CS
>
>breaks utf8 output of Chinese characters to an otherwise perfectly utf8-
>transparent console, see my XML::Simple and utf8 woe posting of last w
>and try yourself. So the opposite of what the perlrun manpage promises
>happens.
>
>I find the off-and-on auto-predictive and authoritarian style in which
>Perl seems to treat utf8 data not really transparent and a source of
>aweful headaches.
>
>In addition, Perl's utf8 support occasionally slows down things significantly;
>my latest experience is with bulk quantities of utf8 data (latin, CJK material,
>_tons_ of characters with accents and diacritics in one soup).
>
>When I try to segment such a string with approx. 400kB of data into an array
>using split(), and my regex contains a single utf8 character then the whole
>thing gets terribly slow when being done in utf8 mode. Actually, split()-ing
>becomes so slow that I can't use the script for production purposes any more.
>If in contrary I treat my 400kB long string as series of octets, ignoring
>character semantics, and let my regex in split() search for two adjacent
>octets of a given type, then the whole thing is lightning fast, as usual,
>and as expected.
>
>So I think, either Perl's control features of which data are utf8 and which
>are not, need a significant overhaul, or Perl's utf8 processing capabilities
>need streamlining.
>

Not a critique of what you are saying because I have some grey areas, but,
an octet is a what? 8-bit binary, and 2 of them side by side is what,
Unicode? Whats a multi-byte character then?

>One of the main points of potential conflict is certainly the way in which
>regex automata are built, and notably how to define atoms. For me, it would
>be fine if a complex Perl script could do all its data processing, IO trans-
>fers etc. in pure octet semantics unless instructed otherwise. If I really
>really need it I still could say 'the next operation must treat my data by
>character semantics'; this would be great, and would help to alleviate many
>inconsistencies that make Perl scripts more vulnerable and less portable
>which are supposed to run in heterogenous environments. I frequently
>encountered the problem that Perl without any instruction treated my utf8
>data correctly on a, e.g. Linux box, including console and file output,
>but goofed in WinXP unless additional binmode() instructions were given;
>to make things worse, utf8-clean stuff developed on XP failed miserably on
>Linux.
>
>I am a linguist, and as such I've always found Perl's natural language
>analogies intriguing, and the typical elliptical Perl coding style has
>been perfectly intuitive to me; Perl's sometimes awkward behaviour when
>confronted with utf8 data has, for the first time ever when dealing with
>Perl, raised my eyebrows.
>
>Maybe I am missing some fundamental points here; if so, please let me know.
>
>Oliver.


Whats does a "elliptical Perl coding style" mean? Is there anything in what
you say that really means jack shit? You seem to be all over the place
and only you seem to know where your going.

The OP used a module to do some Unicode translations. Could that be
his problem? Could it be that language is not compatible in character
translation, and *that* could be the problem, ie: translation?
I only say this because you state you are a linguist Doctor (PhD?).

As far as regular expressions go, I think Unicode is represented.
It may be slow, as you say, not sure.
I wrote a pure Perl XML 1.1, parser using regexp, incorporates Unicode and
fully parses (handlers et all) roughly 1 megabyte a second, depending on cpu.

Guess I'll show off here, but its not intended.
Actually I'm trying to sell the code. Its very robust.

sub InitVars
{
%Dflth = (
'hstart' => \&dflt_start,
'hend' => \&dflt_end,
'hchar' => \&dflt_char,
'hcdata' => \&dflt_cdata,
'hcomment' => \&dflt_comment,
'hmeta' => \&dflt_meta,
'hattlist' => \&dflt_attlist,
'hentity' => \&dflt_entity,
'hdoctype' => \&dflt_doctype,
'helement' => \&dflt_element,
'hxmldecl' => \&dflt_xmldecl,
'hproc' => \&dflt_proc,
);

@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
@UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
$Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
$Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
$Name = "(?:$Nstrt$Nchar*?)";
#die "$Name\n";

$RxParse =
qr/(?:<(?:(?:(\/*)($Name)\s*(\/*))|(?:META(.*?))|(?:($Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(?:(?:DOCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ENTITY(.*?)))))> )|(.+?)/s;
# ( <( ( 1 12 2 3 3)|( 4 4)|( 5 56( ) 6 7 7)|( 8 8 )|( !( ( 9 9)|( 0 0 )|( 1 1 )|(
2 2)|( 3 3))))> )|4 4


Nospam

2006-03-22, 7:02 pm


<corff@zedat.fu-berlin.de> wrote in message
news:489p9eFj1mbiU1@uni-berlin.de...
> Nospam <nospam@home.com> wrote:
> : I am trying to convert some unicode to their equivalent characters,

however
> : it is not printing out the character
>
> : my $w = print
> :

unidecode(" \x{25163}\x{34920}\x{36275}\x{29699}\x{2
0026}\x{33258}\x{30001}")
>
> You don't unidecode here. Simply say chr() instead of \x{} does the trick.
> chr() takes decimal numbers as such, while the \x{} notation insists in
> being fed with hex data.
>
> Your assignments work if you define your strings as:
>
> my $w=chr(25163).chr(34920).......;
>
> print $w will tell you something about wristwatches, soccer and freedom

then.
>
> If you really meant a statement like $w="print chr(65)" to print a letter

"A"
> by saying
>
> $w;
>
> then this doesn't work either; you really wanted to say:
>
> eval $w;
>

This may sound like a silly question, but I am not well versed with e val,
if I was to use the character represented with $w within quotes, would
something like this work:

my $w=chr(25163).chr(34920).......;

eval $w;

print "In chinese $w";


2006-03-22, 7:02 pm

Nospam <nospam@home.com> wrote:

: This may sound like a silly question, but I am not well versed with e val,
: if I was to use the character represented with $w within quotes, would
: something like this work:

: my $w=chr(25163).chr(34920).......;

: eval $w;

: print "In chinese $w";

No. You'd say
my $w=chr(25163.chr(...)....;
print $w;
# or:
print "Chinese: $w\n";

Using eval is not meaningful here. Look again at the former statement:
You said - my $w="print chr(25163)...". Now, if you print $w, like
print $w;
it will show - what? Right, it will show: "print chr(25163)..." (without
the quotation marks, of course).
That's certainly not what you want; you want the result of that statement,
not the statement itself. Hence you have to _eval_uate the statement,
meaning: treat the contents of this variable not as passive data to be
printed or otherwise manipulated, treat this data as piece of _code_.
The code will be executed if you say:
eval $w;
which would be equivalent to saying
print chr(25163);

The eval operator can be quite useful if you have code which should be
input at runtime; think of your mini shell interpreter realized in Perl,
rather than in bash, e.g.

Oliver.

--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de

2006-03-23, 8:02 am

corff@zedat.fu-berlin.de wrote:

: No. You'd say
: my $w=chr(25163.chr(...)....;
Should read:
: my $w=chr(25163).chr(...)....;

Please accept my apologies for the typo.

Oliver.
--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de
Donald King

2006-03-23, 7:02 pm

corff@zedat.fu-berlin.de wrote:
[...]
>
> Paradoxically, starting my script with the flag -CS, like in the
> shebang line
>
> #!/usr/bin/perl -CS
>
> breaks utf8 output of Chinese characters to an otherwise perfectly
> utf8- transparent console, see my XML::Simple and utf8 woe posting of
> last w and try yourself. So the opposite of what the perlrun
> manpage promises happens.
>


Works fine for me with this example script:

#!/usr/bin/perl -CS
use strict;
use warnings;
use Text::Unidecode;

my $str = "\x{624B}\x{8868}";
print "$str\n";
print unidecode($str), "\n";

# As-is, prints:
# 手表
# Shou Biao
# Without the -CS, prints the following:
# Wide character in print at unitest.pl line 7.
# 手表
# Shou Biao

As I explained in the other thread, what's probably happening is that,
without -CS, your data is being read in by Perl as octets, then printed
out as octets; however, under -CS your data is still read as octets
(since it's not one of the STDFOO handles that's affected by -CS) yet
printed to a UTF8-aware filehandle (which assumes that your octets are
actually ISO-8859-1).

> I find the off-and-on auto-predictive and authoritarian style in
> which Perl seems to treat utf8 data not really transparent and a
> source of aweful headaches.
>


Most of that's for backwards compatibility with pre-Unicode versions of
Perl. In Perl 5.6, you used the "use utf8" and "use bytes" pragmata to
treat *all* strings as chars or octets in a given block. Having each
string remember is a blessing compared to that.

> In addition, Perl's utf8 support occasionally slows down things
> significantly; my latest experience is with bulk quantities of utf8
> data (latin, CJK material, _tons_ of characters with accents and
> diacritics in one soup).
>
> When I try to segment such a string with approx. 400kB of data into
> an array using split(), and my regex contains a single utf8 character
> then the whole thing gets terribly slow when being done in utf8
> mode. Actually, split()-ing becomes so slow that I can't use the
> script for production purposes any more. If in contrary I treat my
> 400kB long string as series of octets, ignoring character semantics,
> and let my regex in split() search for two adjacent octets of a
> given type, then the whole thing is lightning fast, as usual, and as
> expected.
>


What version of Perl are you using? I'm using Perl 5.8.8 on Debian
testing, and I don't see the slowdown you're having. I wrote a simple
benchmark that generates a string of over 1 million Unicode characters
(from the U+2400 block, so they're 3 octets each) and does various
string ops on it, such as m//g, s///g, and split. Using utf8::encode()
to create the equivalent UTF-8 byte string, I compared char-vs-byte
performance and it was within a few percent (with character-oriented ops
just a hair slower than byte-oriented).

chronos@isis:~/temp$ ./unicode-benchmark.pl -c
Using characters
Creation: 1.687 seconds
length = 1250000
␙␋␃␎␣␂␝␐␐␣␗␜␏
␣␒␀␚␔␣...
Match One: 0.000 seconds
Match All: 0.131 seconds
Split: 0.264 seconds
s///g: 0.054 seconds

chronos@isis:~/temp$ ./unicode-benchmark.pl -b
Using bytes
Creation: 1.675 seconds
length = 3750000
E2 90 99 E2 90 8B E2 90 83 E2 90 8E E2 90 A3 E2 90 82 E2 90...
Match One: 0.000 seconds
Match All: 0.121 seconds
Split: 0.246 seconds
s///g: 0.042 seconds

The benchmark program is available at
<http://chronos-tachyon.net/~chronos...de-benchmark.pl>.

> So I think, either Perl's control features of which data are utf8 and
> which are not, need a significant overhaul, or Perl's utf8
> processing capabilities need streamlining.
>


If you're not using Perl 5.8, the biggest selling point of the whole 5.8
series is that UTF-8 support has been overhauled and streamlined
compared to 5.6. Also, there have been many bugfixes and Unicode
optimizations since the early 5.8's, so if you're not using it already,
you might try your problem code on the newest 5.8 release.

> One of the main points of potential conflict is certainly the way in
> which regex automata are built, and notably how to define atoms. For
> me, it would be fine if a complex Perl script could do all its data
> processing, IO trans- fers etc. in pure octet semantics unless
> instructed otherwise.


That's basically how things worked in 5.6, except that instead of giving
you an option, all regexps had octet semantics, period. Octets by
default drove people nuts, hence 5.8.

The "use bytes" pragma almost but not quite does what you ask for;
unfortunately, it doesn't affect regexps. Other than calling
utf8::encode() and utf8::decode() liberally by hand, I don't think Perl
is currently capable of what you ask.

[...]
> I frequently encountered the problem that Perl without any
> instruction treated my utf8 data correctly on a, e.g. Linux box,
> including console and file output, but goofed in WinXP unless
> additional binmode() instructions were given; to make things worse,
> utf8-clean stuff developed on XP failed miserably on Linux.


Strange. I never had much trouble with "binmode(HANDLE, ':utf8');" on
either OS, which is the official way of doing any UTF-8 I/O in modern
Perl. (However, that was Cygwin Perl, not ActivePerl. I don't think
I've ever tried Unicode under ActivePerl, so YMMV.)

Note that there are a lot of situations (esp. under Unix) where a buggy
Perl program can still end up spitting out valid UTF-8. Unless the I/O
handles involved have been marked as :utf8, 8-bit octet strings are
output literally, Unicode strings are output as UTF-8, and all input is
treated as octets. That can result in some very strange and corrupt
output the program mixes octets with Unicode -- especially since "octet"
is a synonym for "ISO-8859-1" as far as Perl is concerned -- but
depending on the program and circumstances, just because the output is
valid UTF-8 doesn't mean the program's working correctly.

Hope I've helped more than .

--
Donald King, a.k.a. Chronos Tachyon
http://chronos-tachyon.net/

2006-03-24, 7:05 pm

Donald King <dlking@cpan.org> wrote:

[lots of thoughtful insight, and a benchmark as well]

: Hope I've helped more than .

No, not at all. I'll take your suggestions and collect all my utf8
experiences into a little memo, with minimal examples, making my
point, and I'll include my original program trying to split 400kB
worth of data which went so slow when being told to use utf8.
I'll also include some observations from different platforms (FC3,
XP) so as to make my observations a little bit less random.

Please give me a few hours before I can do that, though.

Best regards and many thanks,

Oliver.

--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de
Donald King

2006-03-25, 9:59 pm

Nospam wrote:
> "Donald King" <dlking@cpan.org> wrote in message
> news:FjBUf.847$wC1.824@dukeread01...
>
>
>
> Each time I run your example with the -CS flag I still receive
> "Wide character in print" I am using activeperl 5.8 could this be a factor?
>
>


As I just mentioned in the "XML::Simple and utf8 woes" thread, Perl
doesn't pick up the -C flag from the shebang line when you run "perl
script.pl" (the way it does for -w/-T). Since under ActivePerl the
shebang line goes 100% unused by Windows, -CS in your shebang line does
*nothing*.

The following code has the same effect as the -CS flag:

BEGIN {
binmode($_, ':utf8') foreach(*STDIN, *STDOUT, *STDERR);
}

--
Donald King, a.k.a. Chronos Tachyon
http://chronos-tachyon.net/

2006-03-26, 3:59 am

Donald King <dlking@cpan.org> wrote:

: As I just mentioned in the "XML::Simple and utf8 woes" thread, Perl
: doesn't pick up the -C flag from the shebang line when you run "perl
: script.pl" (the way it does for -w/-T). Since under ActivePerl the
: shebang line goes 100% unused by Windows, -CS in your shebang line does
: *nothing*.

Hi Donald,

I am _not_ using ActivePerl on WinXP, I am running Perl on a Fedora C3 box,
and always say ./myscript.pl rather than perl myscript.pl.

: The following code has the same effect as the -CS flag:

: BEGIN {
: binmode($_, ':utf8') foreach(*STDIN, *STDOUT, *STDERR);
: }

I'll try, nonetheless.

Oliver.
--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de
Brad Baxter

2006-03-27, 7:00 pm

Peter J. Holzer wrote:
....
>
> More likely Text::Unidecode doesn't know about those characters.
> \x{25163} and \x{34920} are outside of the BMP. I notice all the codes
> given by the OP contain only the digits 0-9, so maybe he meant decimal
> 25163 and 34920, not hexadecimal? Then these would have to be written as
> "\x{624B}\x{8868}", which Text::Unidecode transliterates to "Shou Biao "
> (which may or may not be correct - I don't know Chinese).


Good observation.

#!/usr/local/bin/perl
use warnings;
use strict;

use Text::Unidecode;

print
unidecode(chr(25163).chr(34920).chr(36275).chr(29699).chr(20026).chr(33258).chr(30001));
__END__
Shou Biao Zu Qiu Wei Zi You


Regards,

--
Brad

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com