Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

unicode: equal strings give different results?
perl 5.8.5

Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and
have equal ord-values of all characters, but gives different results if
feeded to the very same subroutine. It seems one of the two strings does
not know fully that its actually unicode. (length gives the correct
result. wrong lengths are usually a first hint that the string does not
feel as unicode)

I didnt manage to provide a really short example, the whole script is
46lines and includes CGI.

I read a text (one char per line) from a CGI-field (UTF8) and print out
the sorted text. The sorting is supposed to be according to german
locales, so I use the locale-pragma (which is ways faster than
unicode::collate)
The sorting order however of my output is wrong. I manually included a
possible input as reference to the script and here the output is
correct. If I enter the reference-string in my textfield the output is
still wrong, but the two strings are exactely the same according to 'eq'
and the hex-dump.
If I do a chr(ord($_)) on all chars the result is ok again.
So obviously I miss something very important about unicode here. Some
extra information is stored somewhere but I dont know about it.


the example is online under

http://www.customers.goldfisch.at/c...unicodetest9.pl

If you enter (one line each, dont forget the last newline after p)
ä
b
ö
a
o
p
in the mask, you'll produce the same input than the referencestring, but
will see different results.

Where am I stuck?

thnx,
peter

---------------------------------------------------------
#!/usr/local/bin/perl -w
use strict;

# step1: prepare for german locales
use POSIX qw(locale_h);
use locale;
setlocale(LC_COLLATE, "de_AT");

# step2: prepare for unicode
binmode(STDOUT,":utf8");
binmode(STDIN,":utf8");

# step3: prepare for CGI
use CGI;
my $query = new CGI;
my $charset = 'UTF-8';
$CGI::XHTML= 0;
print
$query->header(-charset=>$charset),$query->start_html(-title=>'Unicodetest')
;
print "cgi-version = ",$CGI::VERSION,"<br><br>\n";


# set reference-string
my $sr=(" \x{00e4}\n\x{0062}\n\x{00f6}\n\x{0061}\n
\x{006f}\n\x{0070}\n");

# stepA : get unicode and print it
print "<h4>your input</h4>";
my $si=$query->param('unicode');
$si=~s/\r//g;
#my $sin='';foreach(0..length($si)-1)
{$sin.=chr(ord(substr($si,$_,1)))};$si=$sin;
print_and_sort($si);


# stepB : get reference and print it
print "<h4>reference</h4>";
print "(input and reference are considered equal)<br>" if $si eq $sr;
$sr=~s/\r//g;
print_and_sort($sr);


# stepC : print text-field and finish CGI
print '<br><br>enter your unicode-testtext here :
',$query->start_multipart_form,
$query->textarea(-name=>'unicode',-rows=>10,-columns=>100),
"\n<br>\n",
$query->submit(-name=>'submit',-value=>'proceed'),"\n",
$query->endform,"\n";
print $query->end_html;

# sub : get a string, print its ord, split it by its linebreaks and then
# sort the data and print it out
sub print_and_sort {
my $s=shift;
print "hexdump : ";
foreach my $i (0..length($s)-1) {
print sprintf ("%04x",ord(substr($s,$i,1)))." ";
}
print "<br>\n";
print "<br>sorted:<br>\n";
my @data=split(/\n/,$s);
foreach (sort(@data)) {
print $_;
print "  (length=",length($_),")";
print "  ";
foreach my $j (0..length($_)-1) {
print sprintf ("%04x",ord(substr($_,$j,1)))." ";
}
print "<br>\n";
}

}




--
http://www2.goldfisch.at/know_list
http://leblogsportif.sportnation.at

Report this thread to moderator Post Follow-up to this message
Old Post
peter pilsl
09-28-04 02:01 AM


Re: unicode: equal strings give different results?
On Mon, 27 Sep 2004, peter pilsl wrote:

> perl 5.8.5

haven't got that far yet...

> Does a string hold any extra information additional to its pure characters
?
> I managed to create two strings that are equal to the 'eq'-operator and ha
ve
> equal ord-values of all characters, but gives different results if feeded 
to
> the very same subroutine.

This sounds like an FAQ to me.

> It seems one of the two strings does not know fully
> that its actually unicode.

At least in the versions of Perl that I've been familiar with, Perl
will not upgrade an iso-8859-1 string to Unicode unless it finds some
reason to do so.  This can result in identical strings appearing to
not match.  I don't have the references to hand, but I'm sure it's
either an FAQ or in the unicode tutorials.

Hope this helps a bit.

Report this thread to moderator Post Follow-up to this message
Old Post
Alan J. Flavell
09-28-04 02:01 AM


Re: unicode: equal strings give different results?
peter pilsl wrote:
>
> perl 5.8.5
>
> Does a string hold any extra information additional to its pure characters
?
> I managed to create two strings that are equal to the 'eq'-operator and
> have equal ord-values of all characters, but gives different results if
> feeded to the very same subroutine. It seems one of the two strings does
> not know fully that its actually unicode. (length gives the correct
> result. wrong lengths are usually a first hint that the string does not
> feel as unicode)
>

I haven't read your code but you can start with:

perldoc perluniintro
perldoc perlunicode
perldoc encode

And yes, there are two types of strings in Perl 5.8+, one is_utf8(), the
other not.

--- Shawn

Report this thread to moderator Post Follow-up to this message
Old Post
Shawn Corey
09-28-04 02:01 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

PERL Miscellaneous archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:33 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.