For Programmers: Free Programming Magazines  


Home > Archive > PERL Modules > June 2006 > CGI.pm: encoding problems









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author CGI.pm: encoding problems
Ben Bullock

2006-06-17, 8:04 am

I have a problem with inputing utf-8 via a text window using CGI.pm. This
problem concerns UTF8 so apologies for posting something with Chinese
characters in it.

The following code is a minimal working example of the problem with a lot of
extraneous material removed. It needs to be run under a web server to see
the problem. When the text is submitted using the form, the default text of
Chinese characters (they are the numbers from one to four) are munged into
some gibberish stuff, and the test of the input, which checks whether the
input is valid Chinese numerals, fails:

Input text:

一二三四

Output of program:

Input 一二三四 was not a valid number

Thank you very much for any assistance, suggestions or advice about this
problem.
[color=darkred]

#!/usr/bin/perl
use warnings;
use strict;
use CGI;
use utf8;
binmode (STDOUT, ":utf8");
my $query = CGI->new();
$query->charset('UTF-8');
print $query->header();
my $kanji;
if ($query->param('kanji')) {
my $inputnumber = $query->param('kanji');
if ($inputnumber =~ /^([一二三四五_七八九十]+)$/) {
$kanji = $1;
} else {
print "<p>Input $inputnumber was not a valid number</p>";
$kanji = "";
}
} else {
$kanji = "一二三四";
}
print $query->start_form(-method => 'POST',-action => $query->url());
print $query->textarea(-name => 'kanji',
-default => $kanji);
print $query->submit();
print $query->endform();
print "<table><tr>\n<th>Value</th><td>",
$kanji, "</td></tr>\n", "</table>\n</form>\n<p>\n";
print $query->end_html();

Dr.Ruud

2006-06-17, 8:04 am

Ben Bullock schreef:

> use warnings;
> use strict;
> use CGI;
> use utf8;
> binmode (STDOUT, ":utf8");


Try to replace those 5 lines with these (reordered) 4:

use strict;
use warnings;
use encoding 'utf8' ;
use CGI;

This would also set the PerlIO layer of STDIN to ':utf8'.

See perldoc encoding.

--
Affijn, Ruud

"Gewoon is een tijger."


Mumia W.

2006-06-17, 8:04 am

Ben Bullock wrote:
> I have a problem with inputing utf-8 via a text window using CGI.pm.
> This problem concerns UTF8 so apologies for posting something with
> Chinese characters in it.
>
> The following code is a minimal working example of the problem with a
> lot of extraneous material removed. It needs to be run under a web
> server to see the problem. When the text is submitted using the form,
> the default text of Chinese characters (they are the numbers from one to
> four) are munged into some gibberish stuff, and the test of the input,
> which checks whether the input is valid Chinese numerals, fails:
>
> Input text:
>
> 一二三四
>
> Output of program:
>
> Input 一二三四 was not a valid number
>
> Thank you very much for any assistance, suggestions or advice about this
> problem.
>
>
> #!/usr/bin/perl
> use warnings;
> use strict;
> use CGI;
> use utf8;
> binmode (STDOUT, ":utf8");
> my $query = CGI->new();
> $query->charset('UTF-8');
> print $query->header();
> my $kanji;
> if ($query->param('kanji')) {
> my $inputnumber = $query->param('kanji');
> if ($inputnumber =~ /^([一二三四五_七八九十]+)$/) {
> $kanji = $1;
> } else {
> print "<p>Input $inputnumber was not a valid number</p>";
> $kanji = "";
> }
> } else {
> $kanji = "一二三四";
> }
> print $query->start_form(-method => 'POST',-action => $query->url());
> print $query->textarea(-name => 'kanji',
> -default => $kanji);
> print $query->submit();
> print $query->endform();
> print "<table><tr>\n<th>Value</th><td>",
> $kanji, "</td></tr>\n", "</table>\n</form>\n<p>\n";
> print $query->end_html();
>


I made a few changes to your program. I don't know exactly what the
problem is, but I hope that this sheds some light on it:

#!/usr/bin/perl
use warnings;
use strict;
use CGI;
use utf8;
use Encode (); # changed
binmode (STDOUT, ":utf8");
my $query = CGI->new();
$query->charset('UTF-8');
print $query->header('-cache-control' => 'no-cache'); # changed

my $kanji;
if ($query->param('kanji')) {
my $inputnumber = $query->param('kanji');

print <<EOF;
<p> Interesting decodings of
"$inputnumber" <br>
UTF-8: @{[ Encode::decode('utf8', $inputnumber) ]} <br>
</p>
<hr>

EOF

# Add this to decode the number:
$inputnumber = Encode::decode('utf8', $inputnumber);

if ($inputnumber =~ /^([一二三四五_七八九十]+)$/) {
$kanji = $1;
} else {
print "<p>Input $inputnumber was not a valid number</p>";
$kanji = "";
}
} else {
$kanji = "一二三四";
}

print <<EOF;
<p> The value if \$kanji is: $kanji
</p>

EOF

print $query->start_form(
-method => 'POST',
-action => $query->url()
);
print $query->textarea(-name => 'kanji',
-default => $kanji);

print <<EOF;
<textarea name=alternate>
DATA = $kanji
</textarea>
EOF

print $query->submit();
print $query->endform();
print "<table><tr>\n<th>Value</th><td>",
$kanji, "</td></tr>\n", "</table>\n</form>\n<p>\n";
print $query->end_html();
Mumia W.

2006-06-17, 8:04 am

Dr.Ruud wrote:
> Ben Bullock schreef:
>
>
> Try to replace those 5 lines with these (reordered) 4:
>
> use strict;
> use warnings;
> use encoding 'utf8' ;
> use CGI;
>
> This would also set the PerlIO layer of STDIN to ':utf8'.
>
> See perldoc encoding.
>


I still get the problem when running Ben's program. The problem is that
using the CGI module to initialize the textarea works the first time and
not the second; however, bypassing CGI.pm and writing the textarea
directly using print seems to work consistently.

The bug might be logic related, but it's more likely CGI.pm-related.

There is a "hint" that the CGI.pm on my Sarge system is not UTF-8 ready.
This appears at the top of every page of output:
<?xml version="1.0" encoding="iso-8859-1"?>

This happens even when the HTTP header says utf8.

Ben Bullock

2006-06-17, 8:04 am

Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I was
able to get this working, but I also noticed a couple of interesting
phenomena in debugging this program. As Mumia W. says the text in the box is
done incorrectly. Also, if I use my own "<input" box the input is mangled,
and if I use the "straight" function calls of CGI.pm rather than the
object-oriented ones, things stop working again, so it does look rather like
there is something wrong inside CGI.pm. If anyone is interested, let me know
and I'll post example code.

Thanks again.

Mumia W.

2006-06-17, 8:04 am

Ben Bullock wrote:
> Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I
> was able to get this working, but I also noticed a couple of interesting
> phenomena in debugging this program. As Mumia W. says the text in the
> box is done incorrectly. Also, if I use my own "<input" box the input is
> mangled, and if I use the "straight" function calls of CGI.pm rather
> than the object-oriented ones, things stop working again, so it does
> look rather like there is something wrong inside CGI.pm. If anyone is
> interested, let me know and I'll post example code.
>
> Thanks again.
>


How were you able to get it working? Re-ordering the prologue and using
utf8 didn't work for me.

Mumia W.

2006-06-17, 8:04 am

Ben Bullock wrote:
> Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I
> was able to get this working, but I also noticed a couple of interesting
> phenomena in debugging this program. As Mumia W. says the text in the
> box is done incorrectly. Also, if I use my own "<input" box the input is
> mangled, and if I use the "straight" function calls of CGI.pm rather
> than the object-oriented ones, things stop working again, so it does
> look rather like there is something wrong inside CGI.pm. If anyone is
> interested, let me know and I'll post example code.
>
> Thanks again.
>


It's not a bug; it's a feature ;)

For whatever reason, on my system, CGI.pm always interprets the STDIN
data in raw mode, regardless of the script encoding, so form elements
have to be explicitly decoded.

And CGI.pm has a nifty feature that allows the programmer to
automatically create forms with the same values that were in the posted
data.

These two behaviors combine to create the problems you had. The
workarounds are to explicitly decode the form elements and to delete the
old form element before creating another one with the same name.

This program should demonstrate the issue and workarounds:

#!/usr/bin/perl
# kanji-2.cgi
use strict;
use warnings;
use encoding 'utf8';
use CGI ();
use CGI::Carp 'fatalsToBrowser';

$\ = "\n";

# Invoke this script without a query string to
# get the default (broken) behavior.
#
# Invoke this script with a query string of 'recode'
# to get the 'kanji' form element recoded into
# utf8. Example:
#
# http://server.com/kanji-2.cgi?recode
#
# Or, if you want the old textarea data deleted
# upon successive invocations of the form, add
# a query string of 'delete' like so:
#
# http://server.com/kanji-2.cgi?delete
my $RECODE_QUERY = 0;
my $DELETE_QUERY = 0;
$RECODE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/recode/;
$DELETE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/delete/;

my $kanji;
my $text;
my $query = new CGI;

print $query->header(
-type => 'text/html',
-charset => 'utf8',
);

print $query->start_html(
-title => 'Kanji Test',
-head => CGI::meta ({-http_equiv => 'Content-Type',
-content => 'text/html; charset=utf8' ,
}),
),
$query->h1('Kanji Test');

print <<EOF;
<p> Let's see if it's possible to send
and receive kanji numeric characters.
</p>
EOF

if (! defined $query->param('kanji')) {

$kanji = "一二三四";

} else {

$kanji = $query->param('kanji');
$kanji = Encode::decode('utf8', $kanji);
my $old_kanji = $query->param('kanji');

if ($RECODE_QUERY) {
$query->param('kanji', $kanji);
}

if ($DELETE_QUERY) {
$query->delete('kanji');
}

($text = <<EOF) =~ s/^\s*//mg;
<pre> The data received was:
ORIGINAL: $old_kanji
DECODED: $kanji
</pre>
EOF


print $text;
}

my $qs = '' eq $ENV{QUERY_STRING} ? '' :
"?$ENV{QUERY_STRING}" ;

print $query->start_form(
-method => 'POST',
-action => $query->url() . $qs );

print $query->textarea(
-name => 'kanji',
-default => $kanji,
);

print $query->submit();

print $query->end_form();


print $query->end_html;

harryfmudd [AT] comcast [DOT] net

2006-06-17, 8:04 am

Mumia W. wrote:
> Ben Bullock wrote:
>
>
> It's not a bug; it's a feature ;)
>
> For whatever reason, on my system, CGI.pm always interprets the STDIN
> data in raw mode, regardless of the script encoding, so form elements
> have to be explicitly decoded.
>
> And CGI.pm has a nifty feature that allows the programmer to
> automatically create forms with the same values that were in the posted
> data.
>
> These two behaviors combine to create the problems you had. The
> workarounds are to explicitly decode the form elements and to delete the
> old form element before creating another one with the same name.
>
> This program should demonstrate the issue and workarounds:


Interesting. I found that the following program blew up on the
Encode::decode, but that $kanji_orig appeared to display correctly.
Also, the 'kanji' element displayed correctly even if I did not specify
a query string. Do we have a version problem? I'm

Perl 5.8.6
CGI.pm 3.20
OS: Darwin 7.9.0 (a.k.a. Mac OS X)
Server: apache 1.3.33
Browser: Firefox 1.5.0.4 (though I doubt this has anything to do with it).

>

#!/usr/local/bin/perl
> # kanji-2.cgi
> use strict;
> use warnings;
> use encoding 'utf8';
> use CGI ();
> use CGI::Carp 'fatalsToBrowser';
>
> $\ = "\n";
>
> # Invoke this script without a query string to
> # get the default (broken) behavior.
> #
> # Invoke this script with a query string of 'recode'
> # to get the 'kanji' form element recoded into
> # utf8. Example:
> #
> # http://server.com/kanji-2.cgi?recode
> #
> # Or, if you want the old textarea data deleted
> # upon successive invocations of the form, add
> # a query string of 'delete' like so:
> #
> # http://server.com/kanji-2.cgi?delete
> my $RECODE_QUERY = 0;
> my $DELETE_QUERY = 0;
> $RECODE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/recode/;
> $DELETE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/delete/;
>
> my $kanji;
> my $text;
> my $query = new CGI;
>
> print $query->header(
> -type => 'text/html',
> -charset => 'utf8',
> );
>

# I found I got redundant meta headers with the original
# script, so:
> print $query->start_html(
> -title => 'Kanji Test',

## -head => CGI::meta ({-http_equiv => 'Content-Type',
## -content => 'text/html; charset=utf8' ,
## }),
> ),
> $query->h1('Kanji Test');
>
> print <<EOF;
> <p> Let's see if it's possible to send
> and receive kanji numeric characters.
> </p>
> EOF
>
> if (! defined $query->param('kanji')) {
>
> $kanji = "一二三四";
>
> } else {
>
> $kanji = $query->param('kanji');

eval {$kanji = Encode::decode('utf8', $kanji)};
$@ and $kanji = $@;
> my $old_kanji = $query->param('kanji');
>
> if ($RECODE_QUERY) {
> $query->param('kanji', $kanji);
> }
>
> if ($DELETE_QUERY) {
> $query->delete('kanji');
> }
>
> ($text = <<EOF) =~ s/^\s*//mg;
> <pre> The data received was:
> ORIGINAL: $old_kanji
> DECODED: $kanji
> </pre>
> EOF
>
>
> print $text;
> }
>
> my $qs = '' eq $ENV{QUERY_STRING} ? '' :
> "?$ENV{QUERY_STRING}" ;
>
> print $query->start_form(
> -method => 'POST',
> -action => $query->url() . $qs );
>
> print $query->textarea(
> -name => 'kanji',
> -default => $kanji,
> );
>
> print $query->submit();
>
> print $query->end_form();
>
>
> print $query->end_html;
>


Tom Wyant
Mumia W.

2006-06-17, 8:04 am

harryfmudd [AT] comcast [DOT] net wrote:
> Mumia W. wrote:
>
> Interesting. I found that the following program blew up on the
> Encode::decode, but that $kanji_orig appeared to display correctly.
> Also, the 'kanji' element displayed correctly even if I did not specify
> a query string. Do we have a version problem? [...]


Quite likely. I have perl 5.8.4 and CGI.pm 3.04 (old). That's probably
why Dr. Ruud's advice of moving the "use" statements around didn't work
for me.

So it seems that re-decoding the data is a bad idea with newer versions
of the module. As you were everybody.

Ben Bullock

2006-06-17, 8:04 am

If anyone cares, the original program is on the web as follows:

http://www.sljfaq.org/cgi/numbers.cgi
http://www.sljfaq.org/cgi/kanjinumbers.cgi

The bottom one was the one with the problems.

Ordering the statements correctly solved the problem with the encoding, but
some problems remained.

Thanks for the help.

Mumia W.

2006-06-17, 8:04 am

Ben Bullock wrote:
> If anyone cares, the original program is on the web as follows:
> [...]
> http://www.sljfaq.org/cgi/kanjinumbers.cgi
>
> [...]


I'm not having any problems with it. Am I supposed to?


Ben Bullock

2006-06-17, 8:04 am

"Mumia W." <mumia.w.18.spam+nospam.usenet@earthlink.net> wrote in message
news:Pr%jg.13048$921.9261@newsread4.news.pas.earthlink.net...
> Ben Bullock wrote:
>
> I'm not having any problems with it. Am I supposed to?


No, not really. But one interesting problem occurs if you type in numbers
like this:

一ニ三四五xyz

then the xyz is preserved after you convert. If you go the other way round,

12345xyz

then the xyz disappears. The code is exactly the same going either way, so
you tell me why that should be.

Mumia W.

2006-06-17, 8:04 am

Ben Bullock wrote:
> "Mumia W." <mumia.w.18.spam+nospam.usenet@earthlink.net> wrote in
> message news:Pr%jg.13048$921.9261@newsread4.news.pas.earthlink.net...
>
> No, not really. But one interesting problem occurs if you type in
> numbers like this:
>
> 一ニ三四五xyz
>
> then the xyz is preserved after you convert. If you go the other way round,
>
> 12345xyz
>
> then the xyz disappears. The code is exactly the same going either way,
> so you tell me why that should be.


I don't know, but perhaps you can create your own character class that
matches only numbers from the various languages you're using.


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com