For Programmers: Free Programming Magazines  


Home > Archive > PHP DB > July 2007 > Re: [PHP-DB] PHP + PostgreSQL: invalid byte sequence for encoding









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: [PHP-DB] PHP + PostgreSQL: invalid byte sequence for encoding
aldnin

2007-07-21, 6:58 pm

> My guess is that your PHP is not setup to handle UTF8, and is really
> sending something else. UTF8 is the default client encoding because that
> is the encoding of the database. It does not mean that PHP has set the
> right one. Before running your test, try executing this: "SET
> client_encoding TO LATIN1;" and see if that fixes it.


I already did this and all encoding settings are right, but I figured out something more.

1) Using pg_query for fetching UTF8 data from database is working properly. Of course when I try to output it direclty then I get something like that as output "lacarrière" - but when I use utf8_decode() on the UTF8-bytes I get it the right way "lacarriè
re".

2) I found another PHP application which is able to insert UTF8 data properly, phpPgAdmin, but it seems that it uses the ADODB-Layers for executing SQL-statements.
Well, the fact that phpPgAdmin runs on the same machine handling properly UTF8 data means that my PHP is well configurated handling UTF8.

3) When I add to my DB-Class utf8_encode() on the querystring I send to the database, it works properly, the insert is fine, so that's a temporary solution for my first problem.

4) When I get data from database I usually would have to do a utf8_decode on EVERY string which is fetched from database. So my solution is now, to handle all strings comming UTF8 from database as they are comming with UTF8-bytes, and really only then wh
en I need to decode them I decode them for further use.

Problem:
--------
Just declaring the string 'lacarrière' 10 millions times takes 5 seconds, when doing a utf8_encode() on it takes 13 seconds. So it needs 2-3 times more ressources when using always a utf8_encode() on a string, also when the string does not include special
characters. And this ressources are also wasted when the strings don't need to be utf8-encoded.

Workaround:
-----------
To don't waste ressources you have to do a utf8_encode only when you "guess" that there might be special characters - have fun with that, but it's the only way I see to work properly with that special characters in combination with postgres.
aldnin

2007-07-21, 6:58 pm

> Please configure your email client so we don't receive 5 copies of your
> mail.


Just fixed that issue, don't be afraid of that in the future.

> This indicates that PHP not using UTF-8. That output is typical of
> UTF-8 output as Latin characters.


Well, maybe the output is not correct - when running the php script on console (cli) it outputs me the content in the wrong charset, but that's not the problem, doing a utf8_decode() lets me output it in the right charset.

> Not true, it only indicates that phpPgAdmin is is configured to handle
> UTF-8 correctly.


Well, I searched all the source code of phpPgAdmin for charsets and I found:

"echo "\t<meta http-equiv=\"Content-Type\" content=\"text/html; charset={$data->codemap[$dbEncoding]}\" />\r\n";"

So this means, phpPgAdmin sets the output charset to the charset which is used by the databased connected to - but that's still not the problem, because I also know how to fix charset output in browsers.

> Once again indicating your data needs to be converted from some other
> character set.


It's already converted to be compatible to utf8 when fetching it from some other ressources.

> I had similar problems getting PHP to work with UTF-8 and MySQL. Many
> of PHP's function are not multibyte aware and assume a Latin character set.
> What, if any, output buffering are you using? What is your
> default_charset set to?


Well, I've set the default_charset to UTF8, it was set before to "" (empty) - but the output on console (cli) and the problem is still the same also after changing this to UTF8, so: this is not the problem, and I don't need proper output on console withou
t utf8_decode() - if I want proper output there I just do a decode, like I do when I want it to get outputed in the browser properly.

Maybe a cleaner explanation of the problem:

I fetch something from database, which looks like "lacarrière" when I output it in PHP - well don't let us get from PHPs output. Then I fetch something from another ressource looking like "lacarrière" - when I compare both strings in PHP it tell
s me that they are "not equal".

So I HAVE TO do either an utf8_encode() on the string from the other ressource OR a utf8_decode() on the string from the database to compare them as "equal".

....and THIS means a lot of more code in my classes.

Hint: The other ressource is a socket connection (API) to another server.

The problem is quite simple I think, everything comming from the database is UTF8-byte encoded and needs to get UTF8-Decoded before you can work with it properly.

The default_charset seems to work only on output buffer, so the solution for that problem could only be a mechanism to tell PHP handling all strings UTF8 byte encoded, which should mean a lot of more ressources to be taken for this process - I understand
that this is not a solution.

So the only solutions could be:

a) Decode and encode properly utf8 stuff and to take care if the content is utf8-byte encoded so it needs to be decoded before using it properly with other strings

b) A mechanism to tell the pg-functions in PHP to decode all data which is UTF8-Encoded. The ADODB-Layers seems to do that properly, but the pg-functions don't do that as I can see.

You can use this to reproduce it:

1. Create a table in postgres, on a UTF8 initialized database, insert something like "lacarrière" into it. Check if it's inserted correctly..

2. Check with psql the normal output, you should get either "lacarrière" or "lacarrière" so you can be sure it's inserted correctly.

3. Make a script which fetchs the string from the database to $dbString.

4. Set a string $phpString = "lacarrière";

5. Compare both strings with "==" - you'll get "false"

Another hint:

Try to send "select 'lacarrière' as test;' with pg_query to any postgres database, you'll get an error, if not... well, then I'm wrong and I've set up PHP wrong to handle UTF8-stuff.

If you send "select '".utf8_encode(lacarrière)."' as test;" to your database this should work.

Also the above meant $phpString is NOT EQUAL to the result you would get from "select '".utf8_encode(lacarrière)."' as test;", you would need to compare it to utf8_decode($dbString) to be EQUAL.
aldnin

2007-07-21, 6:58 pm

> You did not answer the most important question. What, if any, output
> buffering are you using? Are you using the mbstring module? If so, is
> it set to overload the old string functions?


Well, i checked for Multi Byte String functions, and it was enabled and configured before compiling with "=all".

After performing the query with pg_query, fetching the result with pg_fetch_all and putting the utf8 string into $dbString I tried to detect the encoding with:

mb_detect_encoding($dbSring)

I tells me:
ASCII

The content of $dbString is:
lacarrière

I overloaded the mbstring variables with:
mbstring.func_overload = 6
Setting it to "7" won't let me even echo something else.

mbstring.encoding_translation = On
mbstring.internal_encoding = UTF8

That's it, rest is default.

Is it possible for mbstring to overload the pg-functions I need?
aldnin

2007-07-21, 10:01 pm

> output_handler=mb_output_handler

This helped me to fix any output to the browser properly, so I don't need to do any utf8_decode() any more, thanks.

> Setting it to "7" won't let me even echo something else.


Right, it's strange, but true... :-(

> mbstring.detect_order = UTF-8,eucjp-win,sjis-win


That solved the problem that mb_detect_encoding() was resulting with ASCII, now its saying "UTF-8", BUT only when running the script on console, with browser it tells me still ASCII, well not important.

But still the comparison test is "not equal", so the ut8_decode() is still needed when data comes from database, it's the same result in browser and on console (even it shows UTF-8 as detected).

> The other thing to be wary of, is output to the console. Some OSes do
> not support unicode in the console. So unless you're certain yours does,
> I wouldn't use it as a test.


I know, that's why I use the comparison test ;-)

Niel wrote:
> Hi
>
> You still haven't answered whether you're using any output handler, and
> if so which one. I use
>
> output_handler=mb_output_handler
>
>
> Very strange, the only additional function overloaded is mail() and that
> shouldn't stop you using echo.
>
> As well as setting the internal encoding and enabling it with
> mbstring.encoding_translation = On
> mbstring.internal_encoding = UTF-8
>
> I would also use:
> mbstring.language = English
> ; or German in your case
> mbstring.detect_order = UTF-8,eucjp-win,sjis-win
> mbstring.http_input = UTF-8,SJIS,EUC-JP
> mbstring.http_output = UTF-8
>
> No, and it shouldn't be needed. Those functions should be UTF-8 enabled
> in order to communicate with the database and supply the correct data
>
> You're still referring to 'UTF8' which as I pointed out isn't the
> official name of the encoding system. I have no idea if PHP will
> recognise it, but to be safe I suggest you use the official 'UTF-8'
> (hyphen between letters and number) in case it's causing problems.
> The other thing to be wary of, is output to the console. Some OSes do
> not support unicode in the console. So unless you're certain yours does,
> I wouldn't use it as a test.
>
> --
> Niel Archer

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com