Home > Archive > VC STL > February 2006 > wide character output using iostream library
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
wide character output using iostream library
|
|
| chupeev alexander 2006-02-15, 7:04 pm |
| Hello,
Visual Studio .NET 2003 Documentation states that if I create
std::ofstream object in binary mode, it will output Unicode
characters without translation to MBCS. The following sample
demonstrates that output will be always in mulibyte sequance
regardless whether object operates in binary mode or not.
For comparision, I can achieve desired effect with std::wcout
object by putting stdout in binary mode.
/////////////////////////////////////////////////////////////////////
// widechar.cpp
// Working with wide-character output by means of iostream library
// compile as
// cl /MDd /Zi /GR /GX /nologo /D USE_CHANGE_MODE /D USE_BINARY_MODE
widechar.cpp
// use as
// widechar.exe > wcout.txt
// note:
// wcout.txt is in UCS-16 format
// widechar.cpp.txt expected in UCS-16 format, but output occurs
// in plain ASCII, contrary to VC documentation
//
//
#include <tchar.h>
#include <fstream>
#include <iostream>
#define STRINGIZE(lex) #lex
#define STRINGIZE2(lex) STRINGIZE(lex)
#define WIDECHAR_INLINE(text) L##text
#define WIDECHAR_INLINE2(text) WIDECHAR_INLINE(text)
#define WIDECHAR_CR L"\x000D"
#define WIDECHAR_LF L"\x000A"
#define WIDECHAR_UCS16 L"\xFEFF"
#define WIDECHAR_ENDL WIDECHAR_CR WIDECHAR_LF
#ifdef USE_CHANGE_MODE
#include <stdio.h>
#include <fcntl.h>
#include <io.h>
#include <locale.h>
#endif
#pragma setlocale("rus")
#define FIELD_X1 1386000
int main()
{
const wchar_t text[] = \
L"Когда я ем, я глух и нем" \
L", хитёр и быстр" \
L", силён и ловок..." \
L"\r\n\t... и дьявольски умён";
#ifdef USE_CHANGE_MODE
int result = _setmode( _fileno( stdout ), _O_BINARY );
if (result == -1)
{
perror( "Cannot set mode" );
}
else
{
fprintf(stderr, "'stdout' sucecssfully changed to binary mode\n" );
}
std::ios::sync_with_stdio();
#endif
std::wcout << WIDECHAR_UCS16 << text << WIDECHAR_ENDL;
long long x = 1000000000000LL;
std::wcout << x << WIDECHAR_ENDL;
std::locale sys("Russian_Russia.866");
#ifdef USE_BINARY_MODE
std::wofstream out(__FILE__ ".txt", std::ios_base::binary);
#else
std::wofstream out(__FILE__ ".txt");
#endif
out.imbue(sys);
// out << WIDECHAR_UCS16;
out << WIDECHAR_INLINE2(STRINGIZE2(FIELD_X1)) WIDECHAR_ENDL;
out << text << WIDECHAR_ENDL;
if (!out)
{
perror("wide stream output error");
}
return 0;
}
| |
| Ulrich Eckhardt 2006-02-16, 3:59 am |
| chupeev alexander wrote:
> Visual Studio .NET 2003 Documentation states that if I create
> std::ofstream object in binary mode, it will output Unicode
> characters without translation to MBCS.
You need to look up what Unicode means, there is no such thing as a "Unicode
character". I guess you mean UTF-16 or UCS2.
> The following sample
> demonstrates that output will be always in mulibyte sequance
> regardless whether object operates in binary mode or not.
> For comparision, I can achieve desired effect with std::wcout
> object by putting stdout in binary mode.
>
> /////////////////////////////////////////////////////////////////////
> // widechar.cpp
> // Working with wide-character output by means of iostream library
> // compile as
> // cl /MDd /Zi /GR /GX /nologo /D USE_CHANGE_MODE /D USE_BINARY_MODE
> widechar.cpp
> // use as
> // widechar.exe > wcout.txt
> // note:
> // wcout.txt is in UCS-16 format
UTF-16 or UCS2?
> #define STRINGIZE(lex) #lex
> #define STRINGIZE2(lex) STRINGIZE(lex)
> #define WIDECHAR_INLINE(text) L##text
> #define WIDECHAR_INLINE2(text) WIDECHAR_INLINE(text)
> #define WIDECHAR_CR L"\x000D"
> #define WIDECHAR_LF L"\x000A"
You have heard of C++ constants, right?
> #define WIDECHAR_UCS16 L"\xFEFF"
Again, UCS16 doesn't exist. This thing is btw called a byte order marker
(BOM).
> const wchar_t text[] = \
> L"Когда я ем, я глух и нем" \
> L", хитёр и быстр" \
> L", силён и ловок..." \
> L"\r\n\t... и дьявольски умён";
The backslashes are useless, string literal concatenation works across any
kind of whitespace.
> std::ios::sync_with_stdio();
No need to activate this, it is the default - what do you expect this to do?
> #ifdef USE_BINARY_MODE
> std::wofstream out(__FILE__ ".txt", std::ios_base::binary);
> #else
> std::wofstream out(__FILE__ ".txt");
> #endif
Okay, according to C++, the 'binary' flag suppresses conversion of
lineendings (CR/CRLF/LF). Nothing less, nothing more.
> out.imbue(sys);
> // out << WIDECHAR_UCS16;
> out << WIDECHAR_INLINE2(STRINGIZE2(FIELD_X1)) WIDECHAR_ENDL;
> out << text << WIDECHAR_ENDL;
You might want to flush here, otherwise the check below is not necessarily
meaningful.
> if (!out)
> {
> perror("wide stream output error");
> }
Now, the problem is not in the way the binary flag works, but rather that
your expectations are wrong. There is a so-called facet in the locale that
controls conversion between the internally used characterset and the bytes
written to disk. You use the "Russian_Russia.866" locale, and I guess that
the 866 is the codepage. IOW, the locale contains a facet that outputs text
in that codepage. What you probably want is to use a different locale, like
e.g. "Russian_Russia.UTF-8" (danger, I don't know if this locale really
exists) to get a textformat that is fully Unicode capable.
Lastly, the fact that redirecting things to a file works as you want it to
might be due to the additional layer imposed by the shell you run this
with. I don't understand how the binary mode on stdout would change that
though, but I'm not too versed with C file IO anyway.
Uli
| |
| Tom Widmer [VC++ MVP] 2006-02-16, 8:01 am |
| chupeev alexander wrote:
> Hello,
>
> Visual Studio .NET 2003 Documentation states that if I create
> std::ofstream object in binary mode, it will output Unicode
> characters without translation to MBCS.
Where does it say this? It isn't true AFAIK.
The following sample
> demonstrates that output will be always in mulibyte sequance
> regardless whether object operates in binary mode or not.
> For comparision, I can achieve desired effect with std::wcout
> object by putting stdout in binary mode.
std::basic_filebuf<wchar_t> uses the codecvt facet to work out how to
output wchar_t characters as bytes. The default facet simply throws away
the high order byte, so each wchar_t is output as a single char - rarely
what is desired. If you want another encoding, such as UTF-8, UTF-16LE
or whatever, you need to imbue a locale with the appropriate codecvt
facet on the stream. Unfortunately (and I think this is terrible),
VC2003 doesn't ship with any such facets, and you either have to write
them yourself (which isn't trivial) or you have to buy Dinkumware's
CoreX library: http://www.dinkumware.com/libDCorX.html.
Someone else might know of a free UTF-16 facet though. Someone at Boost
(www.boost.org) was working on a codecvt library, but I don't know what
happened to it.
Tom
| |
| chupeev alexander 2006-02-16, 7:05 pm |
| Hi,
Ulrich, Tom, thank you very much for such valuable replies. I understood
that I was wrong regarding way of iostream library handle wide-character
streams. Today I solved the problem with do-nothing facet (code below)
borrowed from codeproject that will make it possible output in UTF-16
encoding. I looked through boost source code tonight and found the same
stuff written in much better way. I will try it next day.
----------------------------------------------------------------------------
----------
#include <iosfwd>
#include <locale>
typedef std::codecvt < wchar_t, char, mbstate_t > wchar_null_codecvt_base;
class wchar_null_codecvt: public wchar_null_codecvt_base
{
public:
typedef wchar_t _E;
typedef char _To;
typedef mbstate_t _St;
explicit wchar_null_codecvt( size_t _R=0 ): wchar_null_codecvt_base( _R )
{
}
virtual result do_in( _St& _State , const _To* _F1 , const _To* _L1 , \
const _To*& _Mid1 , _E* F2 , _E* _L2 , _E*& _Mid2 ) const
{
return noconv ;
}
virtual result do_out( _St& _State , const _E* _F1 , const _E* _L1 , \
const _E*& _Mid1 , _To* F2, _E* _L2 , _To*& _Mid2 ) const
{
return noconv ;
}
virtual result do_unshift( _St& _State , _To* _F2 , _To* _L2 , _To*&
_Mid2 ) const
{
return noconv ;
}
virtual int do_length( _St& _State , const _To* _F1 , \
const _To* _L1 , size_t _N2 ) const _THROW0()
{
return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ;
}
virtual bool do_always_noconv() const _THROW0()
{
return true ;
}
virtual int do_max_length() const _THROW0()
{
return 2 ;
}
virtual int do_encoding() const _THROW0()
{
return 2 ;
}
};
| |
| Tom Widmer [VC++ MVP] 2006-02-17, 7:58 am |
| chupeev alexander wrote:
> Hi,
>
> Ulrich, Tom, thank you very much for such valuable replies. I understood
> that I was wrong regarding way of iostream library handle wide-character
> streams. Today I solved the problem with do-nothing facet (code below)
> borrowed from codeproject that will make it possible output in UTF-16
> encoding.
Ok, the code outputs in the native wchar_t encoding, which on Windows is
UTF-16LE, which is probably enough for many uses. I hadn't realised that
noconv lead to characters being byte-copied to the file. Checking the
standard, this appears to be an extension in the Dinkumware libraries -
noconv is only supposed to be returned when the internal type equals the
external type (in this case, the internal type is wchar_t but the
external type is char). Given this extension, I think the library is
missing an optimization - I think it could enable buffering when
do_always_noconv() returns true.
I looked through boost source code tonight and found the same
> stuff written in much better way. I will try it next day.
Where did you find it?
Thanks,
Tom
| |
| chupeev alexander 2006-02-17, 7:02 pm |
|
"Tom Widmer [VC++ MVP]" <tom_usenet@hotmail.com> wrote in message > chupeev
alexander wrote:
>
> Where did you find it?
>
I mean boost::archive::codecvt<wchar_t> class from serialization library
located in files libs\serialization\src\codecvt_null.cpp and
boost\archive\codecvt_null.hpp.
|
|
|
|
|