For Programmers: Free Programming Magazines  


Home > Archive > PERL POE > March 2008 > Filter::HTTPD, UTF-8 and cursing









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Filter::HTTPD, UTF-8 and cursing
Philip Gwyn

2008-03-22, 4:53 am

Hello all, I have a very very strange problem.

Perl seems to be reencoding UTF-8 data.

Versions: Perl 5.8.5, POE 0.9999, CentOS 4.4.

I added a bunch of cookie crumbs to trace the problem. This is what I'm seeing.


My reverse HTTP proxy outputs the following :

http: content=[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]
http: chars=42
http: Content-Length=42

POE::HTTPD::Filter->put() converts it to :

Filter::HTTPD: content=[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]
Filter::HTTPD: chars=42
Filter::HTTPD: Content-Length=42
Filter::HTTPD: [-1]='HTTP/1.1 200 (OK) (OK)
Date: Sat, 22 Mar 2008 05:11:06 GMT
Server: POE HTTPD Component/0.10-PG (5.008005)
Content-Length: 42
Content-Type: application/json
Set-Cookie: BID=2439; path=/
X-POE-XUL: 0.04

[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]'
Filter::HTTPD: length=252

POE::Driver::SysRW->put() then sees the following :

Driver::SysRW: put='HTTP/1.1 200 (OK) (OK)
Date: Sat, 22 Mar 2008 05:11:06 GMT
Server: POE HTTPD Component/0.10-PG (5.008005)
Content-Length: 42
Content-Type: application/json
Set-Cookie: BID=2439; path=/
X-POE-XUL: 0.04

[["set", "START-USER_", "value", "\xC3\x83\xC2\xA9t\xC3\x83\xC2\xA9"]]'
Driver::SysRW: chars=256


WOAH! "\xC2\xA9t\xC2\xA9" became "\xC3\x83\xC2\xA9t\xC3\x83\xC2\xA9" ! And
the length changed. I HATE YOU MILKMAN ENCODING-COCK-UP!

(Note that when I write \xC2 in this email, i'm seeing the binary octet C2 in
the data.)

The work around is for me to use JSON::XS->ascii. But this still boggles me.
Anyone understand UTF-8 encoding? Or have any pointers?

-Philip

Scott Wiersdorf

2008-03-22, 7:35 pm

On Sat, Mar 22, 2008 at 01:29:03AM -0500, Philip Gwyn wrote:
>
> POE::Driver::SysRW->put() then sees the following :


Just an idea: I notice POE::Driver::SysRW::put() uses 'use bytes':

<snip>
# Need to check lengths in octets, not characters.
use bytes;

foreach (grep { length } @$chunks) {
$self->[TOTAL_OCTETS_LEFT] += length;
push @{$self->[OUTPUT_QUEUE]}, $_;
}
</snip>

Maybe once it finishes calculating the bytes, we need to put in 'no
bytes' to turn byte-semantics off again:

<snip>
use bytes;

# Need to check lengths in octets, not characters.
foreach (grep { length } @$chunks) {
$self->[TOTAL_OCTETS_LEFT] += length;
}

no bytes;

## push it all at once, but w/o byte semantics
push @{$self->[OUTPUT_QUEUE]}, @$chunks;
</snip>

I've seen this problem before a few years ago in a (non-POE) webmail
application that was doing sort of the same thing. I'm just guessing
here (i.e., untested code may break things!).

HTH,

Scott
--
Scott Wiersdorf
<scott@perlcode.org>
hsw

2008-03-23, 5:21 am

On Mar 22, 9:29 am, li...@artware.qc.ca (Philip Gwyn) wrote:
> http: content=[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]


$s = "\xC3\x83\xC2\xA9t\xC3\x83\xC2\xA9" - is not utf-8 chars. it's
octets.
Encode::decode_utf8($s) == "\x{a9}t\x{a9}" - it's chars (with utf8
flasg on).

When you join octets (with 8-bit set) and unicode chars, octets will
be converted to unicode chars:
"\xC3" => "\x{C3}" ( == "\xc3\x83" in octets).

Solution: decode data to native unicode before process data.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com