Code Comments
Programming Forum and web based access to our favorite programming groups.perl 5.8.5 Does a string hold any extra information additional to its pure characters? I managed to create two strings that are equal to the 'eq'-operator and have equal ord-values of all characters, but gives different results if feeded to the very same subroutine. It seems one of the two strings does not know fully that its actually unicode. (length gives the correct result. wrong lengths are usually a first hint that the string does not feel as unicode) I didnt manage to provide a really short example, the whole script is 46lines and includes CGI. I read a text (one char per line) from a CGI-field (UTF8) and print out the sorted text. The sorting is supposed to be according to german locales, so I use the locale-pragma (which is ways faster than unicode::collate) The sorting order however of my output is wrong. I manually included a possible input as reference to the script and here the output is correct. If I enter the reference-string in my textfield the output is still wrong, but the two strings are exactely the same according to 'eq' and the hex-dump. If I do a chr(ord($_)) on all chars the result is ok again. So obviously I miss something very important about unicode here. Some extra information is stored somewhere but I dont know about it. the example is online under http://www.customers.goldfisch.at/c...unicodetest9.pl If you enter (one line each, dont forget the last newline after p) ä b ö a o p in the mask, you'll produce the same input than the referencestring, but will see different results. Where am I stuck? thnx, peter --------------------------------------------------------- #!/usr/local/bin/perl -w use strict; # step1: prepare for german locales use POSIX qw(locale_h); use locale; setlocale(LC_COLLATE, "de_AT"); # step2: prepare for unicode binmode(STDOUT,":utf8"); binmode(STDIN,":utf8"); # step3: prepare for CGI use CGI; my $query = new CGI; my $charset = 'UTF-8'; $CGI::XHTML= 0; print $query->header(-charset=>$charset),$query->start_html(-title=>'Unicodetest') ; print "cgi-version = ",$CGI::VERSION,"<br><br>\n"; # set reference-string my $sr=(" \x{00e4}\n\x{0062}\n\x{00f6}\n\x{0061}\n \x{006f}\n\x{0070}\n"); # stepA : get unicode and print it print "<h4>your input</h4>"; my $si=$query->param('unicode'); $si=~s/\r//g; #my $sin='';foreach(0..length($si)-1) {$sin.=chr(ord(substr($si,$_,1)))};$si=$sin; print_and_sort($si); # stepB : get reference and print it print "<h4>reference</h4>"; print "(input and reference are considered equal)<br>" if $si eq $sr; $sr=~s/\r//g; print_and_sort($sr); # stepC : print text-field and finish CGI print '<br><br>enter your unicode-testtext here : ',$query->start_multipart_form, $query->textarea(-name=>'unicode',-rows=>10,-columns=>100), "\n<br>\n", $query->submit(-name=>'submit',-value=>'proceed'),"\n", $query->endform,"\n"; print $query->end_html; # sub : get a string, print its ord, split it by its linebreaks and then # sort the data and print it out sub print_and_sort { my $s=shift; print "hexdump : "; foreach my $i (0..length($s)-1) { print sprintf ("%04x",ord(substr($s,$i,1)))." "; } print "<br>\n"; print "<br>sorted:<br>\n"; my @data=split(/\n/,$s); foreach (sort(@data)) { print $_; print " (length=",length($_),")"; print " "; foreach my $j (0..length($_)-1) { print sprintf ("%04x",ord(substr($_,$j,1)))." "; } print "<br>\n"; } } -- http://www2.goldfisch.at/know_list http://leblogsportif.sportnation.at
Post Follow-up to this messageOn Mon, 27 Sep 2004, peter pilsl wrote: > perl 5.8.5 haven't got that far yet... > Does a string hold any extra information additional to its pure characters ? > I managed to create two strings that are equal to the 'eq'-operator and ha ve > equal ord-values of all characters, but gives different results if feeded to > the very same subroutine. This sounds like an FAQ to me. > It seems one of the two strings does not know fully > that its actually unicode. At least in the versions of Perl that I've been familiar with, Perl will not upgrade an iso-8859-1 string to Unicode unless it finds some reason to do so. This can result in identical strings appearing to not match. I don't have the references to hand, but I'm sure it's either an FAQ or in the unicode tutorials. Hope this helps a bit.
Post Follow-up to this messagepeter pilsl wrote: > > perl 5.8.5 > > Does a string hold any extra information additional to its pure characters ? > I managed to create two strings that are equal to the 'eq'-operator and > have equal ord-values of all characters, but gives different results if > feeded to the very same subroutine. It seems one of the two strings does > not know fully that its actually unicode. (length gives the correct > result. wrong lengths are usually a first hint that the string does not > feel as unicode) > I haven't read your code but you can start with: perldoc perluniintro perldoc perlunicode perldoc encode And yes, there are two types of strings in Perl 5.8+, one is_utf8(), the other not. --- Shawn
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.