Home > Archive > Visual Basic > October 2004 > UTF-8 encoding xml documents using msxml4 sp2
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
UTF-8 encoding xml documents using msxml4 sp2
|
|
|
| Dear list,
I have an application which exports and writes an xml file to the filesystem.
This works fine (until my doubts!). We are using vb6 sp5 with msxml 4.
I am setting a UTF-8 character encoding in the head of my xml file, using
the createElement methods and setting node texts..etc
Suprisingly(because I have nowhere set an encoding propertry -except in the
header!), when i say mynode.text = "ü" and perstist the file to the
filesystem(using the dom save method), it is being written as 2 bytes
(decimal e.g. 195 188) in the resultant file. I believe this to be correct
UTF-8/UTF-16 (i.e. Unicode).
AM i right in understanding that VB 6 uses utf16 internally, and therefore
the MSXML2.DOMDocument object save method willl also write the xml file as
utf16.
In addition because the content(latin-european based characters) is utf16,
its also valid utf8 in this case.
I would be very grateful if some could help.
| |
| Tony Proctor 2004-10-08, 8:55 am |
| VB uses Unicode internally for all text ben. Hence, your "ü" character will
be held in memory in a 16-bit character as the Unicode value 252. However,
the .Save method on the DOM writes it out to a file according to the
document's charset encoding, i.e. "utf-8" in your case, and you get the
correct pair of bytes: 195, 188.
Tony Proctor
"ben" <ben@discussions.microsoft.com> wrote in message
news:9961C626-FD9E-47CF-B42F-62B1E6101C1D@microsoft.com...
> Dear list,
>
> I have an application which exports and writes an xml file to the
filesystem.
> This works fine (until my doubts!). We are using vb6 sp5 with msxml 4.
>
> I am setting a UTF-8 character encoding in the head of my xml file, using
> the createElement methods and setting node texts..etc
>
> Suprisingly(because I have nowhere set an encoding propertry -except in
the
> header!), when i say mynode.text = "ü" and perstist the file to the
> filesystem(using the dom save method), it is being written as 2 bytes
> (decimal e.g. 195 188) in the resultant file. I believe this to be correct
> UTF-8/UTF-16 (i.e. Unicode).
>
> AM i right in understanding that VB 6 uses utf16 internally, and therefore
> the MSXML2.DOMDocument object save method willl also write the xml file as
> utf16.
> In addition because the content(latin-european based characters) is utf16,
> its also valid utf8 in this case.
>
> I would be very grateful if some could help.
>
>
| |
|
| Thanks tony. I guess that must be the reason.
urf16 is a superset of utf8? what is the relationship between
the unicode 252 value and the actual decimal byte "195, 188" values?
regards
Ben
"Tony Proctor" wrote:
> VB uses Unicode internally for all text ben. Hence, your "ü" character will
> be held in memory in a 16-bit character as the Unicode value 252. However,
> the .Save method on the DOM writes it out to a file according to the
> document's charset encoding, i.e. "utf-8" in your case, and you get the
> correct pair of bytes: 195, 188.
>
> Tony Proctor
>
> "ben" <ben@discussions.microsoft.com> wrote in message
> news:9961C626-FD9E-47CF-B42F-62B1E6101C1D@microsoft.com...
> filesystem.
> the
>
>
>
| |
| Tony Proctor 2004-10-08, 3:55 pm |
| UTF-16 is not strictly Unicode ben.
Unicode is a 16-bit character set. ISO 10646 is a superset of Unicode that
can define 32-bit characters.
UTF-8/UTF-16 are "Transformation Formats" that can encode characters from
either of these standards as either a series of byte codes, or 16-bit codes,
for transmission and storage. There is a lot of semantic subtlety in these
terms and acronyms. :-)
There are algorithms published in the Unicode/ISO standards for the
generation of UTF-8 sequences for arbitrary character codes.
Have a look at the following for some background:
http://en.wikipedia.org/wiki/UCS-2
http://en.wikipedia.org/wiki/UTF-16
http://www.terena.nl/library/multil...code/utf16.html
http://www.faqs.org/rfcs/rfc2781.html
Tony Proctor
"ben" <ben@discussions.microsoft.com> wrote in message
news:BA6C7F64-FCED-4CFE-84BF-AE120371071D@microsoft.com...[color=darkred]
> Thanks tony. I guess that must be the reason.
> urf16 is a superset of utf8? what is the relationship between
> the unicode 252 value and the actual decimal byte "195, 188" values?
>
> regards
>
> Ben
>
> "Tony Proctor" wrote:
>
will[color=darkred]
However,[color=darkred]
using[color=darkred]
in[color=darkred]
correct[color=darkred]
therefore[color=darkred]
file as[color=darkred]
utf16,[color=darkred]
|
|
|
|
|