For Programmers: Free Programming Magazines  


Home > Archive > VC Language > January 2006 > XML Processing instruction for UTF-8









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author XML Processing instruction for UTF-8
ama

2006-01-24, 7:07 pm

hi

i need to create a UTF-8 xml on the fly and
i used the folowing code :

[ ps: m_pDoc is a IXMLDOMDocument ]

BOOL CXmlHelper::CreateHeader()
{
if( ! m_pDoc )
return FALSE;

CComBSTR target("xml");
CComBSTR data(" version='1.0' encoding='UTF-8' ");
CComPtr<IXMLDOMProcessingInstruction> ins;

if( m_pDoc->createProcessingInstruction(target, data, &ins) != S_OK )
return FALSE;

if( m_pDoc->appendChild((MSXML::IXMLDOMNode*)ins, NULL) != S_OK )
return FALSE;

return TRUE;
}

After a call to CreateHeader(), i add a node and
and works ok. But the output file is not UTF-8, its plain AINSI.
Why is this ? thanks.




Jochen Kalmbach [MVP]

2006-01-24, 7:07 pm

Hi ama!

> CComBSTR data(" version='1.0' encoding='UTF-8' ");


A simmple question:
Have your tried:
CComBSTR data("version='1.0' encoding='utf-8'");
or
CComBSTR data("version=\"1.0\" encoding=\"utf-8\"");

I think utf must be in small letters... but I am not sure...

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
ama

2006-01-24, 9:58 pm

>> CComBSTR data(" version='1.0' encoding='UTF-8' ");
>
> A simmple question:
> Have your tried:
> CComBSTR data("version='1.0' encoding='utf-8'");
> or
> CComBSTR data("version=\"1.0\" encoding=\"utf-8\"");
>
> I think utf must be in small letters... but I am not sure...
>
> --
> Greetings
> Jochen


thanks, unfortunatly it makes no diffrence.

It raises a question though, how would MSXML know
that i'm attempting to generate a UTF document unless
the data chunk is parsed ? And if so why not document
the exact syntax it expects ? Why not use a enum instead
of a string like that.. seems very strange.

thanks again



Alex Blekhman

2006-01-25, 7:58 am

ama wrote:
> hi
>
> i need to create a UTF-8 xml on the fly and
> i used the folowing code :
>
> [ ps: m_pDoc is a IXMLDOMDocument ]
>
> BOOL CXmlHelper::CreateHeader()
> {
> if( ! m_pDoc )
> return FALSE;
>
> CComBSTR target("xml");
> CComBSTR data(" version='1.0' encoding='UTF-8' ");
> CComPtr<IXMLDOMProcessingInstruction> ins;
>
> if( m_pDoc->createProcessingInstruction(target, data,
> &ins) != S_OK ) return FALSE;
>
> if( m_pDoc->appendChild((MSXML::IXMLDOMNode*)ins, NULL)
> != S_OK ) return FALSE;
>
> return TRUE;
> }
>
> After a call to CreateHeader(), i add a node and
> and works ok. But the output file is not UTF-8, its plain
> AINSI. Why is this ? thanks.


How can you tell that it is not UTF-8? If your document
contains only ANSI characters, then it will be
indistinguishable from ANSI, since first 128 characters are
the same for Basic Latin and UTF-8 encodings.


Jochen Kalmbach [MVP]

2006-01-25, 7:58 am

Hi Alex!

> How can you tell that it is not UTF-8?


Most UTF-8 documents containing a BOM:
http://www.unicode.org/faq/utf_bom.html

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
Alex Blekhman

2006-01-25, 7:23 pm

Jochen Kalmbach [MVP] wrote:
> Hi Alex!
>
>
> Most UTF-8 documents containing a BOM:
> http://www.unicode.org/faq/utf_bom.html


Yes, I know that BOM can be used to determine serialization
encoding. However, MSXML will not save BOM for UTF-8, as you
probably already know. Actually, BOM is not required by XML
specification and UTF-8 is always assumed unless BOM is
present or processing instruction specifies otherwise.

Strictly speaking, BOM is misnomer for UTF-8 stream since
"byte ordering" concept is inapplicable for UTF-8 (unlike
UTF-16/32) and actually BOM is used as magic number. It is
noted in above mentioned FAQ, too.
(http://www.unicode.org/faq/utf_bom.html#3)

So, under Windows using MSXML one will get BOM'less XML
files by default.


Jochen Kalmbach [MVP]

2006-01-25, 7:23 pm

Hi ama!
> and works ok. But the output file is not UTF-8, its plain AINSI.
> Why is this ? thanks.


The following works perfectly for me...
You just need to rember, that the File is *not* saved with a BOM for
UTF-8 (as Alex already pointed out).
The BOM is only written if you use UTF-16 (I donīt know why, but this is
fact).

<code>

#include <windows.h>
#include <tchar.h>
#include <atlbase.h>
#include <msxml2.h>
#pragma comment (lib, "msxml2.lib")

int _tmain()
{
CoInitialize(NULL);

CComPtr<IXMLDOMDocument2> pDom;
CLSID clsid = CLSID_DOMDocument;
if (FAILED(CLSIDFromString(CComBSTR("Msxml2.DOMDocument.3.0"), &clsid)))
return false;

if ( SUCCEEDED (CoCreateInstance(clsid, NULL, CLSCTX_INPROC_SERVER,
IID_IXMLDOMDocument2, (void**)(&pDom))))
{
HRESULT hr;
CComPtr<IXMLDOMNode> pNewNode;

CComPtr<IXMLDOMProcessingInstruction> pProcInstr;
// You can use:
// - UTF-16BE (UTF-16 Big Endian) => no BOM
// - UTF-16 (UTF-16 Little Endian) => BOM!
// - UTF-8 => no BOM
hr = pDom->createProcessingInstruction(CComBSTR("xml"),
CComBSTR("version='1.0' encoding='UTF-16'"), &pProcInstr);
//hr = pDom->createProcessingInstruction(CComBSTR("xml"),
CComBSTR("version='1.0' encoding='UTF-8'"), &pProcInstr);
CComVariant vNullVal;
vNullVal.vt = VT_NULL;
hr = pDom->insertBefore(pProcInstr, vNullVal, &pNewNode);

CComPtr<IXMLDOMNode> pRootNode;
CComVariant varNodeType((short)NODE_ELEMENT);
hr = pDom->createNode(varNodeType, CComBSTR("ROOT"), CComBSTR(""),
&pRootNode);

pNewNode = NULL;
hr = pDom->appendChild(pRootNode, &pNewNode);

CComPtr<IXMLDOMAttribute> pAttr;
pDom->createAttribute(CComBSTR("test"), &pAttr);
pAttr->put_text(CComBSTR("1234567890ü"));
CComPtr<IXMLDOMNamedNodeMap> pAttrs;
pRootNode->get_attributes(&pAttrs);
pNewNode = NULL;
pAttrs->setNamedItem(pAttr, &pNewNode);
pNewNode = NULL;
pRootNode->appendChild(pAttr, &pNewNode);

hr = pDom->save(CComVariant("c:\\test.xml"));

pDom = NULL;
}

CoUninitialize();
}

</code>

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/
ama

2006-01-25, 7:23 pm

> The following works perfectly for me...
> You just need to rember, that the File is *not* saved with a BOM for UTF-8
> (as Alex already pointed out).
> The BOM is only written if you use UTF-16 (I donīt know why, but this is
> fact).
>
> <code>
> <sniped> </code>
>
> --
> Greetings
> Jochen


Hello and thank you all.

I was looking for the BOM , my bad.
By default MSXML generates a BOMless UTF-8 ?

Im using the Flash player as well so ill have to look
into this because they use strict rules for their parsers.

thanks again.




Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com