For Programmers: Free Programming Magazines  


Home > Archive > Tcl > July 2004 > "string map" with binary data









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author "string map" with binary data
Wolfram Roesler

2004-07-28, 9:08 pm

Hello,

I tried the following to translate special characters in a
binary string:

set Str2 [string map \
{\x84 \{ \x94 | \x81 \} \x8E [ \x99 \\ \x9A ] \xE1 ~} \
$Str1]

I was surprised to find that translation worked only for some
of the characters, namely \x81, \x8E, and \xE1. The others (\x84,
\x94 etc.) didn't get translated and ended up in Str2 unchanged.

I suppose that this is some kind of character set/encoding issue,
is it? If so, is there a way to make "string map" work with binary
data?

Thanks for your help
W. Rösler
Benjamin Riefenstahl

2004-07-28, 9:08 pm

Hi Wolfram,


Wolfram Roesler <wr@spam.la> writes:
> I tried the following to translate special characters in a
> binary string:
>
> set Str2 [string map \
> {\x84 \{ \x94 | \x81 \} \x8E [ \x99 \\ \x9A ] \xE1 ~} \
> $Str1]
>
> I was surprised to find that translation worked only for some of the
> characters, namely \x81, \x8E, and \xE1. The others (\x84, \x94
> etc.) didn't get translated and ended up in Str2 unchanged.


Works fine here:

% info patchlevel
8.4.5
% set Str1 "\x84\x94\x81\x8E\x99\x9A\xE1"
„”ÂŽ™šá
% set Str2 [string map \
{\x84 \{ \x94 | \x81 \} \x8E [ \x99 \\ \x9A ] \xE1 ~} \
$Str1]

{|}[\]~
%

> I suppose that this is some kind of character set/encoding issue, is
> it? If so, is there a way to make "string map" work with binary
> data?


The problem is probably not [string map], but how you read and save
the data. Can you provide a complete program that shows the problem?

Background: In Tcl "binary data" is just a string consisting of
pseudo-characters (bytes) with code points anywhere in the range 0 to
255. In contrast a "normal" text string can have characters with any
Unicode code point (actually UTF-16). That covers 0x0000 to 0xFFFF
except that a number of code points are unassigned or reserved by
Unicode, so they don't occur in regular text.


benny
Wolfram Roesler

2004-07-28, 9:08 pm

Hi,

> Works fine here:
>
> % info patchlevel
> 8.4.5
> % set Str1 "\x84\x94\x81\x8E\x99\x9A\xE1"
> „”ÂŽ™šá
> % set Str2 [string map \
> {\x84 \{ \x94 | \x81 \} \x8E [ \x99 \\ \x9A ] \xE1 ~} \
> $Str1]
>
> {|}[\]~


This works for me too, except that "info patchlevel" says
8.3.3 and "set Str1" says "??üÄ??ß".

> The problem is probably not [string map], but how you read and save
> the data. Can you provide a complete program that shows the problem?


The Tcl program is invoked by a C++ program like this:

int Eval(list<string> const &Lst)
{
Tcl_Obj **const objv = new Tcl_Obj*[Lst.size()];
list<string>::const_iterator Ptr;
int i;
for(i=0,Ptr=Lst.begin();Ptr!=Lst.end()++Ptr,++i)
{
string const UTF = WinToUTF(*Ptr);
objv[i] = Tcl_NewStringObj(UTF.c_str(),UTF.size());
}

int const Ret = Tcl_EvalObjv(interp,Lst.size(),objv,0);

delete[] objv;
return Ret;
}

One of the strings in Lst is the binary string I want to use
"string map" on. WinToUTF is:

string WinToUTF(string const &Str)
{
Tcl_DString DString;
Tcl_DStringInit(&DString);
Tcl_ExternalToUtfDString(NULL,Str.data(),Str.size(),&DString);
string const Ret(Tcl_DStringValue(&DString),Tcl_DStringLength(&DString));
Tcl_DStringFree(&DString);
return Ret;
}

I suppose that this is where the translation takes place that's
disturbing my "string map": After all, I'm converting my binary
string into UTF8. However, surprisingly, even although "string map"
doesn't work, the following does:

set Str2 ""
for {set i 0} {$i<[string length $Str1]} {incr i} {
set Ch [string index $Str1 $i]
switch -exact [charcode $Ch] {
132 { append Str2 "\{" }
148 { append Str2 "|" }
129 { append Str2 "\}" }
142 { append Str2 "\[" }
153 { append Str2 "\\" }
154 { append Str2 "\]" }
225 { append Str2 "~" }
default { append Str2 $Ch }
}
}

charcode is a procedure implemented in the C++ part of my application
which returns the decimal ASCII code of the character passed as its
argument.

I'll try and go without WinToUTF in my Eval function, thus passing
a true binary string.

Thanks for your help
Wolfram
Benjamin Riefenstahl

2004-07-28, 9:08 pm

Hi Wolfram,


Wolfram Roesler <wr@spam.la> writes:
> The Tcl program is invoked by a C++ program like this:
> for(i=0,Ptr=Lst.begin();Ptr!=Lst.end()++Ptr,++i)
> {
> string const UTF = WinToUTF(*Ptr);
> objv[i] = Tcl_NewStringObj(UTF.c_str(),UTF.size());
> }


Ah, we are talking about the C level. That changes the POV and the
rules ;-).

You don't want to convert to UTF-8. You just want to use
Tcl_NewByteArrayObj directly on your original bytes.

> Tcl_ExternalToUtfDString(NULL,Str.data(),Str.size(),&DString);


This line assumes that your input is in the system encoding (that is
cp1252 usually on Windows). Which it isn't, so there you get all
kinds of unexpected Unicode characters for your bytes. Your code
would work, if you used the "binary" pseudo-encoding, but that would
be pointless.

Your original characters survive, because they are decoded in Unicode
to something that you don't test for (like the cp1252 character "\x84"
maps to Unicode "\u201E"), and than, after your processing, they are
probably re-encoded to the cp1252 character. Roundtrip (cp1252 ->
Unicode -> cp1252) works fine in your case.

> charcode is a procedure implemented in the C++ part of my
> application which returns the decimal ASCII code of the character
> passed as its argument.


<nitpicking>
None of your test characters is ASCII, ASCII only defines the
character codes from 0 to 127.
</nitpicking>

Other than that, I don't know why your Tcl version of [string map]
seems to work. But than I don't know what [charcode] does exactly, it
may re-encode into cp1252.


benny
Wolfram Roesler

2004-07-29, 3:57 am

Hi,

> You don't want to convert to UTF-8. You just want to use
> Tcl_NewByteArrayObj directly on your original bytes.


Yes, this would indeed solve the problem. I'm using the UTF
translation because my functions' arguments sometimes contain
international characters which would get messed up when using
pure byte array objects. Seems like a short blanket problem.
Unfortunately, in my C++ program I don't know if the string
I'm passing to Tcl is to be used as a character string or as
a byte array. The only solution seems to be to document that
it's the former, and have the Tcl program go the hard way if
it wants it to be the latter.

> Other than that, I don't know why your Tcl version of [string map]
> seems to work. But than I don't know what [charcode] does exactly, it
> may re-encode into cp1252.


You are right, it uses UTF-to-Windows conversion internally.

Thanks for your help
Wolfram
Benjamin Riefenstahl

2004-07-29, 3:58 pm

Hi Wolfram,


Benjamin.Riefenstahl@epost write:
>

Wolfram Roesler <wr@spam.la> writes:
> Yes, this would indeed solve the problem. I'm using the UTF
> translation because my functions' arguments sometimes contain
> international characters which would get messed up when using pure
> byte array objects.


You can postpone the text/binary decision. Just use
Tcl_NewByteArrayObj initially and use the Tcl command [encoding
convertfrom] later, when you know that you want to use it as text.

Make sure that you have the reverse conversion (or non-conversion)
covered, too.


benny
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com