Code Comments
Programming Forum and web based access to our favorite programming groups.My system encoding is cp1252, which is a one byte -> one character encoding, so I expected the results of [string length] and [string bytelength] to be the same. However, when I read in arbitrary binary data to a variable, [string bytelength] always seems to return a larger number. For example: %encoding system cp1252 %set f [open binary.file r] %fconfigure $f -translation binary %set binstr [read $f 10] %string length $binstr 10 %string bytelength $binstr 15 %binary scan $binstr c* binlist %llength $binlist 10 %puts $binlist 73 73 42 0 8 0 0 0 16 0 According to the doc the values can differ when the encoding is unicode, which I can understand, since unicode can use multiple bytes to represent a character. But why the discrepancy in values when the encoding is of the single-byte kind?
Post Follow-up to this messageHi Donald, Donald Arseneau <asnd@triumf.ca> writes: > Such behaviour I would expect to give the correct string bytelength > but a meaningless string length. AFAIK, [string bytelength] is never meaningfull on binary data regardless how you get to it. It counts the bytes of an UTF-8 string. Which binary data never is at the Tcl language level. benny
Post Follow-up to this messageDonald Arseneau wrote: > The "-translation binary" doesn't do what you and I expect. It shocked > me the first time, and now I have to check things carefully; and I mostly > avoid it. Mostly avoid [string bytelength] I hope, and not -translation binary. > It does *not* cause a direct copying of the external data into the memory > allocated for a string. Such behaviour I would expect to give the > correct string bytelength but a meaningless string length. Instead, > it translates to Tcl's internal UTF-8 string encoding, using a reversible > mapping. Note, though, that the creation of the UTF-8 encoded version of the data will only happen if commands (like [string bytelength] or [puts]) need the data in that form. If you stick to commands that are more suitable for binary data (notably the [binary] command), that UTF-8 encoded string will never get created. For those folks familiar with the C level interfaces, this is the two-representation nature of Tcl_Obj's, applied specifically to the "bytearray" Tcl_ObjType. -- | Don Porter Mathematical and Computational Sciences Division | | donald.porter@nist.gov Information Technology Laboratory | | http://math.nist.gov/~DPorter/ NIST | |_______________________________________ _______________________________|
Post Follow-up to this messageIt seems the money quote appeared in the docs sometime between 8.3 and 8.4: "If the object is a ByteArray object (such as those returned from reading a binary encoded channel), then this will return the actual byte length of the object." Until this quote appeared it was by no means clear that [string length] was a reliable tool for measuring data block lengths in bytes. Of course this seems to contradict the concept that "everything is a string," since sometimes the thing is a byte array. I expected "string length" to be an encoding-aware command and "string bytelength" to be a binary operation. I think the fact that "string length" can shift mode to binary operation under the hood is confusing. Something like the following would be more clear: Say file eucjp.txt contained hexadecimal chars \xA4\xCF. %encoding system euc-jp %set f [open eucjp.txt r] %fconfigure $f -translation binary %set s [read $f] %string length $s 1 %string bytelength $s 2 %string internalstoragelength 3
Post Follow-up to this messageHi Stephen, blacksqr@usa.net (Stephen Huntley) writes: > I expected "string length" to be an encoding-aware command Tcl is full Unicode inside, no other encoding is used ever, nothing in Tcl is or needs to be "encoding-aware". Not needing encoding-awareness is a actually major advantage of Unicode as compared to older systems. Encodings in Tcl are usually applied when data is exchanged with the outside of Tcl, e.g. in I/O. > and "string bytelength" to be a binary operation. To repeat, [string bytelength] counts the bytes to represent a string in UTF-8. Thus it is defininatly not a "binary" operation, quite the contrary, it only makes sense for text. > I think the fact that "string length" can shift mode to binary > operation under the hood is confusing. There are no shifts AFAIK. If you can show any place in Tcl where Tcl-visible "mode shifting" of any kind takes place, you should post a bug report. The internal difference between string objects and binary array objects inside the Tcl C library is not supposed to be visible from Tcl. It's supposed to be an internal optimization. This doesn't mean that there isn't a difference between text and binary, but that is not in the Tcl_Obj on the C level, it's a semantic difference on the Tcl level. On the C level, while ByteArrays are better suited for binary data, a string object created by Tcl_NewStringObj can be used to represent binary data (the opposite is not true in general, because the ByteArray object representation can not contain elements >= 255). It's just needlessly complicated to create such C level string objects or to work with them through the C level UTF-8 string interfaces. > Say file eucjp.txt contained hexadecimal chars \xA4\xCF. > > %encoding system euc-jp > %set f [open eucjp.txt r] > %fconfigure $f -translation binary "-translation binary" implies "-encoding binary" which overrides the global [encoding system] setting. So your first line is not relevant any more at this point. > %set s [read $f] You are reading two bytes without any interpretation. The result is not text, even though Tcl represents the result as a string. > %string length $s > 1 No. Tcl-wise you have a string containing the uninterpreted byte values 0xA4 and 0xCF. The result should be 2. Think about it: You never told Tcl that this is EUC-JP text. How is Tcl supposed to know this? > %string bytelength $s > 2 This doesn't make sense, because we are talking about bytes, not actual characters. (O.k., so both U+00A4 and U+00CF are incidentally valid Unicode characters, but in general you could have any byte values in that string at this point.) > %string internalstoragelength > 3 That is what [string bytelength] probably should have been called. But than again, I'm really asking myself, if we shouldn't remove it altogether, as I can't see any valid use for it, and it obviously creates a lot of confusion. benny
Post Follow-up to this messageDon Porter <dgp@email.nist.gov> writes: > Donald Arseneau wrote: > > Mostly avoid [string bytelength] I hope, and not -translation binary. Mostly [string bytelength], but I am very nervous of -translation binary too. As you could tell, I never fully understood the workings. The comments here have been good for me. Donald Arseneau asnd@triumf.ca
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.