Code Comments
Programming Forum and web based access to our favorite programming groups.Hello, There is a unicode string, I want to change it to ansi string. but it raise an exception. Could you help me? ## I want to change s1 to s2. s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(60002 8) ' s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028 ) '
Post Follow-up to this messageWhat do you mean by "ansi string"?
Here is a superficially not-unreasonable answer to your more specific
question:
# >>> s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(60002
8) '
# >>> s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028
) '
# >>> s3 = s1.encode('latin1')
# >>> s2 == s3
# True
But what are you really trying to achieve? Where does your Unicode data
come from? What ranges of characters do you expect it to contain? You
need to crunch it into an 8-bit representation because ... what?
Post Follow-up to this messageMr. John Machin
This question come form the flow codes. I use the PyXml to build a DOM
tree.
from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
title_string = title_elem.firstChild.data
print title_string
# the title_string is unicode, but it is not "latin1" code, so I wantto
change it.
Post Follow-up to this message
zdwang@xinces.com wrote:
> Mr. John Machin
>
> This question come form the flow codes. I use the PyXml to build a DOM
> tree.
>
> from xml.dom.ext.reader import HtmlLib
> doc =
> HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028
')
> title_elem = doc.documentElement.getElementsByTagName("TITLE")[0]
> title_string = title_elem.firstChild.data
> print title_string
>
> # the title_string is unicode, but it is not "latin1" code, so I wantto
> change it.
Errr, but the title of the page is written in Chinese and it is not
supposed to be crammed into latin1 encoding. What are you trying to do
with the string after you squeezed Chinese into latin1?
Post Follow-up to this messageErrrrrrrr, it get's worse: not only is the title written in Chinese, it
is encoded as gb2312 -- here is the repr() of the first few chunks:
"<html>\n<head>\n <title> \xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028)
:
\xc4\xd
a\xb2\xbf\xc8\xcb\xd4\xb1\xb3\xd6\xb9\xc
9 -
\xcb\xd1\xba\xfc\xb9\xc9\xc6\xb1</ti
tle>\n<meta http-equiv='Content-Type' content='text/html;
charset=gb2312'>\n"
and here is what you get after that_guff.decode('gb2312')
u"<html>\n<head>\n <title>\u4e2d\u56fd\u77f3\u5316(600028) :
\u5185\u90e8\u
4eba\u5458\u6301\u80a1 - \u641c\u72d0\u80a1\u7968</title>\n<meta
http-equiv='Con
tent-Type' content='text/html; charset=gb2312'>\n"
The first 2 characters of the title are recognisable both visually on
the browser title and in the unicode as "zhong guo" i.e. China.
BUT the OP's first message is interpreting that gb2312-encoded stuff as
Unicode:
s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(60002
8) '
*SOMEBODY* is seriously deluded, and it ain't me, and it ain't Serge
:-)
... and yes Peter, info travels faster also from China that it does
from Armenia :-())
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.