Home > Archive > Tcl > July 2004 > Question on string bytelength
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Question on string bytelength
|
|
| Stephen Huntley 2004-07-28, 9:08 pm |
| My system encoding is cp1252, which is a one byte -> one character
encoding, so I expected the results of [string length] and [string
bytelength] to be the same. However, when I read in arbitrary binary
data to a variable, [string bytelength] always seems to return a
larger number. For example:
%encoding system cp1252
%set f [open binary.file r]
%fconfigure $f -translation binary
%set binstr [read $f 10]
%string length $binstr
10
%string bytelength $binstr
15
%binary scan $binstr c* binlist
%llength $binlist
10
%puts $binlist
73 73 42 0 8 0 0 0 16 0
According to the doc the values can differ when the encoding is
unicode, which I can understand, since unicode can use multiple bytes
to represent a character. But why the discrepancy in values when the
encoding is of the single-byte kind?
| |
| Benjamin Riefenstahl 2004-07-29, 3:57 am |
| Hi Donald,
Donald Arseneau <asnd@triumf.ca> writes:
> Such behaviour I would expect to give the correct string bytelength
> but a meaningless string length.
AFAIK, [string bytelength] is never meaningfull on binary data
regardless how you get to it. It counts the bytes of an UTF-8 string.
Which binary data never is at the Tcl language level.
benny
| |
| Don Porter 2004-07-29, 3:57 am |
| Donald Arseneau wrote:
> The "-translation binary" doesn't do what you and I expect. It shocked
> me the first time, and now I have to check things carefully; and I mostly
> avoid it.
Mostly avoid [string bytelength] I hope, and not -translation binary.
> It does *not* cause a direct copying of the external data into the memory
> allocated for a string. Such behaviour I would expect to give the
> correct string bytelength but a meaningless string length. Instead,
> it translates to Tcl's internal UTF-8 string encoding, using a reversible
> mapping.
Note, though, that the creation of the UTF-8 encoded version of the
data will only happen if commands (like [string bytelength] or [puts])
need the data in that form.
If you stick to commands that are more suitable for binary data (notably
the [binary] command), that UTF-8 encoded string will never get created.
For those folks familiar with the C level interfaces, this is the
two-representation nature of Tcl_Obj's, applied specifically to the
"bytearray" Tcl_ObjType.
--
| Don Porter Mathematical and Computational Sciences Division |
| donald.porter@nist.gov Information Technology Laboratory |
| http://math.nist.gov/~DPorter/ NIST |
|_______________________________________
_______________________________|
| |
| Stephen Huntley 2004-07-29, 3:58 pm |
| It seems the money quote appeared in the docs sometime between 8.3 and
8.4: "If the object is a ByteArray object (such as those returned from
reading a binary encoded channel), then this will return the actual
byte length of the object." Until this quote appeared it was by no
means clear that [string length] was a reliable tool for measuring
data block lengths in bytes.
Of course this seems to contradict the concept that "everything is a
string," since sometimes the thing is a byte array.
I expected "string length" to be an encoding-aware command and "string
bytelength" to be a binary operation. I think the fact that "string
length" can shift mode to binary operation under the hood is
confusing. Something like the following would be more clear:
Say file eucjp.txt contained hexadecimal chars \xA4\xCF.
%encoding system euc-jp
%set f [open eucjp.txt r]
%fconfigure $f -translation binary
%set s [read $f]
%string length $s
1
%string bytelength $s
2
%string internalstoragelength
3
| |
| Benjamin Riefenstahl 2004-07-29, 3:58 pm |
| Hi Stephen,
blacksqr@usa.net (Stephen Huntley) writes:
> I expected "string length" to be an encoding-aware command
Tcl is full Unicode inside, no other encoding is used ever, nothing in
Tcl is or needs to be "encoding-aware". Not needing
encoding-awareness is a actually major advantage of Unicode as
compared to older systems.
Encodings in Tcl are usually applied when data is exchanged with the
outside of Tcl, e.g. in I/O.
> and "string bytelength" to be a binary operation.
To repeat, [string bytelength] counts the bytes to represent a string
in UTF-8. Thus it is defininatly not a "binary" operation, quite the
contrary, it only makes sense for text.
> I think the fact that "string length" can shift mode to binary
> operation under the hood is confusing.
There are no shifts AFAIK. If you can show any place in Tcl where
Tcl-visible "mode shifting" of any kind takes place, you should post a
bug report. The internal difference between string objects and binary
array objects inside the Tcl C library is not supposed to be visible
from Tcl. It's supposed to be an internal optimization.
This doesn't mean that there isn't a difference between text and
binary, but that is not in the Tcl_Obj on the C level, it's a semantic
difference on the Tcl level.
On the C level, while ByteArrays are better suited for binary data, a
string object created by Tcl_NewStringObj can be used to represent
binary data (the opposite is not true in general, because the
ByteArray object representation can not contain elements >= 255).
It's just needlessly complicated to create such C level string objects
or to work with them through the C level UTF-8 string interfaces.
> Say file eucjp.txt contained hexadecimal chars \xA4\xCF.
>
> %encoding system euc-jp
> %set f [open eucjp.txt r]
> %fconfigure $f -translation binary
"-translation binary" implies "-encoding binary" which overrides the
global [encoding system] setting. So your first line is not relevant
any more at this point.
> %set s [read $f]
You are reading two bytes without any interpretation. The result is
not text, even though Tcl represents the result as a string.
> %string length $s
> 1
No. Tcl-wise you have a string containing the uninterpreted byte
values 0xA4 and 0xCF. The result should be 2. Think about it: You
never told Tcl that this is EUC-JP text. How is Tcl supposed to know
this?
> %string bytelength $s
> 2
This doesn't make sense, because we are talking about bytes, not
actual characters. (O.k., so both U+00A4 and U+00CF are incidentally
valid Unicode characters, but in general you could have any byte
values in that string at this point.)
> %string internalstoragelength
> 3
That is what [string bytelength] probably should have been called.
But than again, I'm really asking myself, if we shouldn't remove it
altogether, as I can't see any valid use for it, and it obviously
creates a lot of confusion.
benny
| |
| Donald Arseneau 2004-07-30, 3:58 am |
| Don Porter <dgp@email.nist.gov> writes:
> Donald Arseneau wrote:
>
> Mostly avoid [string bytelength] I hope, and not -translation binary.
Mostly [string bytelength], but I am very nervous of -translation binary
too. As you could tell, I never fully understood the workings.
The comments here have been good for me.
Donald Arseneau asnd@triumf.ca
|
|
|
|
|