For Programmers: Free Programming Magazines  


Home > Archive > Tcl > July 2004 > Question on string bytelength









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Question on string bytelength
Stephen Huntley

2004-07-28, 9:08 pm

My system encoding is cp1252, which is a one byte -> one character
encoding, so I expected the results of [string length] and [string
bytelength] to be the same. However, when I read in arbitrary binary
data to a variable, [string bytelength] always seems to return a
larger number. For example:

%encoding system cp1252
%set f [open binary.file r]
%fconfigure $f -translation binary
%set binstr [read $f 10]
%string length $binstr
10
%string bytelength $binstr
15
%binary scan $binstr c* binlist
%llength $binlist
10
%puts $binlist
73 73 42 0 8 0 0 0 16 0

According to the doc the values can differ when the encoding is
unicode, which I can understand, since unicode can use multiple bytes
to represent a character. But why the discrepancy in values when the
encoding is of the single-byte kind?
Benjamin Riefenstahl

2004-07-29, 3:57 am

Hi Donald,

Donald Arseneau <asnd@triumf.ca> writes:
> Such behaviour I would expect to give the correct string bytelength
> but a meaningless string length.


AFAIK, [string bytelength] is never meaningfull on binary data
regardless how you get to it. It counts the bytes of an UTF-8 string.
Which binary data never is at the Tcl language level.

benny
Don Porter

2004-07-29, 3:57 am

Donald Arseneau wrote:
> The "-translation binary" doesn't do what you and I expect. It shocked
> me the first time, and now I have to check things carefully; and I mostly
> avoid it.


Mostly avoid [string bytelength] I hope, and not -translation binary.

> It does *not* cause a direct copying of the external data into the memory
> allocated for a string. Such behaviour I would expect to give the
> correct string bytelength but a meaningless string length. Instead,
> it translates to Tcl's internal UTF-8 string encoding, using a reversible
> mapping.


Note, though, that the creation of the UTF-8 encoded version of the
data will only happen if commands (like [string bytelength] or [puts])
need the data in that form.

If you stick to commands that are more suitable for binary data (notably
the [binary] command), that UTF-8 encoded string will never get created.

For those folks familiar with the C level interfaces, this is the
two-representation nature of Tcl_Obj's, applied specifically to the
"bytearray" Tcl_ObjType.

--
| Don Porter Mathematical and Computational Sciences Division |
| donald.porter@nist.gov Information Technology Laboratory |
| http://math.nist.gov/~DPorter/ NIST |
|_______________________________________
_______________________________|
Stephen Huntley

2004-07-29, 3:58 pm

It seems the money quote appeared in the docs sometime between 8.3 and
8.4: "If the object is a ByteArray object (such as those returned from
reading a binary encoded channel), then this will return the actual
byte length of the object." Until this quote appeared it was by no
means clear that [string length] was a reliable tool for measuring
data block lengths in bytes.

Of course this seems to contradict the concept that "everything is a
string," since sometimes the thing is a byte array.

I expected "string length" to be an encoding-aware command and "string
bytelength" to be a binary operation. I think the fact that "string
length" can shift mode to binary operation under the hood is
confusing. Something like the following would be more clear:

Say file eucjp.txt contained hexadecimal chars \xA4\xCF.

%encoding system euc-jp
%set f [open eucjp.txt r]
%fconfigure $f -translation binary
%set s [read $f]
%string length $s
1
%string bytelength $s
2
%string internalstoragelength
3
Benjamin Riefenstahl

2004-07-29, 3:58 pm

Hi Stephen,


blacksqr@usa.net (Stephen Huntley) writes:
> I expected "string length" to be an encoding-aware command


Tcl is full Unicode inside, no other encoding is used ever, nothing in
Tcl is or needs to be "encoding-aware". Not needing
encoding-awareness is a actually major advantage of Unicode as
compared to older systems.

Encodings in Tcl are usually applied when data is exchanged with the
outside of Tcl, e.g. in I/O.

> and "string bytelength" to be a binary operation.


To repeat, [string bytelength] counts the bytes to represent a string
in UTF-8. Thus it is defininatly not a "binary" operation, quite the
contrary, it only makes sense for text.

> I think the fact that "string length" can shift mode to binary
> operation under the hood is confusing.


There are no shifts AFAIK. If you can show any place in Tcl where
Tcl-visible "mode shifting" of any kind takes place, you should post a
bug report. The internal difference between string objects and binary
array objects inside the Tcl C library is not supposed to be visible
from Tcl. It's supposed to be an internal optimization.

This doesn't mean that there isn't a difference between text and
binary, but that is not in the Tcl_Obj on the C level, it's a semantic
difference on the Tcl level.

On the C level, while ByteArrays are better suited for binary data, a
string object created by Tcl_NewStringObj can be used to represent
binary data (the opposite is not true in general, because the
ByteArray object representation can not contain elements >= 255).
It's just needlessly complicated to create such C level string objects
or to work with them through the C level UTF-8 string interfaces.

> Say file eucjp.txt contained hexadecimal chars \xA4\xCF.
>
> %encoding system euc-jp
> %set f [open eucjp.txt r]
> %fconfigure $f -translation binary


"-translation binary" implies "-encoding binary" which overrides the
global [encoding system] setting. So your first line is not relevant
any more at this point.

> %set s [read $f]


You are reading two bytes without any interpretation. The result is
not text, even though Tcl represents the result as a string.

> %string length $s
> 1


No. Tcl-wise you have a string containing the uninterpreted byte
values 0xA4 and 0xCF. The result should be 2. Think about it: You
never told Tcl that this is EUC-JP text. How is Tcl supposed to know
this?

> %string bytelength $s
> 2


This doesn't make sense, because we are talking about bytes, not
actual characters. (O.k., so both U+00A4 and U+00CF are incidentally
valid Unicode characters, but in general you could have any byte
values in that string at this point.)

> %string internalstoragelength
> 3


That is what [string bytelength] probably should have been called.
But than again, I'm really asking myself, if we shouldn't remove it
altogether, as I can't see any valid use for it, and it obviously
creates a lot of confusion.


benny

Donald Arseneau

2004-07-30, 3:58 am

Don Porter <dgp@email.nist.gov> writes:

> Donald Arseneau wrote:
>
> Mostly avoid [string bytelength] I hope, and not -translation binary.


Mostly [string bytelength], but I am very nervous of -translation binary
too. As you could tell, I never fully understood the workings.

The comments here have been good for me.

Donald Arseneau asnd@triumf.ca
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com