Home > Archive > PERL POE > March 2008 > Filter::HTTPD, UTF-8 and cursing
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Filter::HTTPD, UTF-8 and cursing
|
|
| Philip Gwyn 2008-03-22, 4:53 am |
| Hello all, I have a very very strange problem.
Perl seems to be reencoding UTF-8 data.
Versions: Perl 5.8.5, POE 0.9999, CentOS 4.4.
I added a bunch of cookie crumbs to trace the problem. This is what I'm seeing.
My reverse HTTP proxy outputs the following :
http: content=[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]
http: chars=42
http: Content-Length=42
POE::HTTPD::Filter->put() converts it to :
Filter::HTTPD: content=[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]
Filter::HTTPD: chars=42
Filter::HTTPD: Content-Length=42
Filter::HTTPD: [-1]='HTTP/1.1 200 (OK) (OK)
Date: Sat, 22 Mar 2008 05:11:06 GMT
Server: POE HTTPD Component/0.10-PG (5.008005)
Content-Length: 42
Content-Type: application/json
Set-Cookie: BID=2439; path=/
X-POE-XUL: 0.04
[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]'
Filter::HTTPD: length=252
POE::Driver::SysRW->put() then sees the following :
Driver::SysRW: put='HTTP/1.1 200 (OK) (OK)
Date: Sat, 22 Mar 2008 05:11:06 GMT
Server: POE HTTPD Component/0.10-PG (5.008005)
Content-Length: 42
Content-Type: application/json
Set-Cookie: BID=2439; path=/
X-POE-XUL: 0.04
[["set", "START-USER_", "value", "\xC3\x83\xC2\xA9t\xC3\x83\xC2\xA9"]]'
Driver::SysRW: chars=256
WOAH! "\xC2\xA9t\xC2\xA9" became "\xC3\x83\xC2\xA9t\xC3\x83\xC2\xA9" ! And
the length changed. I HATE YOU MILKMAN ENCODING-COCK-UP!
(Note that when I write \xC2 in this email, i'm seeing the binary octet C2 in
the data.)
The work around is for me to use JSON::XS->ascii. But this still boggles me.
Anyone understand UTF-8 encoding? Or have any pointers?
-Philip
| |
| Scott Wiersdorf 2008-03-22, 7:35 pm |
| On Sat, Mar 22, 2008 at 01:29:03AM -0500, Philip Gwyn wrote:
>
> POE::Driver::SysRW->put() then sees the following :
Just an idea: I notice POE::Driver::SysRW::put() uses 'use bytes':
<snip>
# Need to check lengths in octets, not characters.
use bytes;
foreach (grep { length } @$chunks) {
$self->[TOTAL_OCTETS_LEFT] += length;
push @{$self->[OUTPUT_QUEUE]}, $_;
}
</snip>
Maybe once it finishes calculating the bytes, we need to put in 'no
bytes' to turn byte-semantics off again:
<snip>
use bytes;
# Need to check lengths in octets, not characters.
foreach (grep { length } @$chunks) {
$self->[TOTAL_OCTETS_LEFT] += length;
}
no bytes;
## push it all at once, but w/o byte semantics
push @{$self->[OUTPUT_QUEUE]}, @$chunks;
</snip>
I've seen this problem before a few years ago in a (non-POE) webmail
application that was doing sort of the same thing. I'm just guessing
here (i.e., untested code may break things!).
HTH,
Scott
--
Scott Wiersdorf
<scott@perlcode.org>
| |
|
| On Mar 22, 9:29 am, li...@artware.qc.ca (Philip Gwyn) wrote:
> http: content=[["set", "START-USER_", "value", "\xC2\xA9t\xC2\xA9"]]
$s = "\xC3\x83\xC2\xA9t\xC3\x83\xC2\xA9" - is not utf-8 chars. it's
octets.
Encode::decode_utf8($s) == "\x{a9}t\x{a9}" - it's chars (with utf8
flasg on).
When you join octets (with 8-bit set) and unicode chars, octets will
be converted to unicode chars:
"\xC3" => "\x{C3}" ( == "\xc3\x83" in octets).
Solution: decode data to native unicode before process data.
|
|
|
|
|