Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Question on string bytelength
My system encoding is cp1252, which is a one byte -> one character
encoding, so I expected the results of [string length] and [string
bytelength] to be the same.  However, when I read in arbitrary binary
data to a variable, [string bytelength] always seems to return a
larger number.  For example:

%encoding system cp1252
%set f [open binary.file r]
%fconfigure $f -translation binary
%set binstr [read $f 10]
%string length $binstr
10
%string bytelength $binstr
15
%binary scan $binstr c* binlist
%llength $binlist
10
%puts $binlist
73 73 42 0 8 0 0 0 16 0

According to the doc the values can differ when the encoding is
unicode, which I can understand, since unicode can use multiple bytes
to represent a character.  But why the discrepancy in values when the
encoding is of the single-byte kind?

Report this thread to moderator Post Follow-up to this message
Old Post
Stephen Huntley
07-29-04 02:08 AM


Re: Question on string bytelength
Hi Donald,

Donald Arseneau <asnd@triumf.ca> writes:
> Such behaviour I would expect to give the correct string bytelength
> but a meaningless string length.

AFAIK, [string bytelength] is never meaningfull on binary data
regardless how you get to it.  It counts the bytes of an UTF-8 string.
Which binary data never is at the Tcl language level.

benny

Report this thread to moderator Post Follow-up to this message
Old Post
Benjamin Riefenstahl
07-29-04 08:57 AM


Re: Question on string bytelength
Donald Arseneau wrote:
> The "-translation binary" doesn't do what you and I expect.  It shocked
> me the first time, and now I have to check things carefully; and I mostly
> avoid it.

Mostly avoid [string bytelength] I hope, and not -translation binary.

> It does *not* cause a direct copying of the external data into the memory
> allocated for a string.  Such behaviour I would expect to give the
> correct string bytelength but a meaningless string length.  Instead,
> it translates to Tcl's internal UTF-8 string encoding, using a reversible
> mapping.

Note, though, that the creation of the UTF-8 encoded version of the
data will only happen if commands (like [string bytelength] or [puts])
need the data in that form.

If you stick to commands that are more suitable for binary data (notably
the [binary] command), that UTF-8 encoded string will never get created.

For those folks familiar with the C level interfaces, this is the
two-representation nature of Tcl_Obj's, applied specifically to the
"bytearray" Tcl_ObjType.

--
| Don Porter          Mathematical and Computational Sciences Division |
| donald.porter@nist.gov             Information Technology Laboratory |
| http://math.nist.gov/~DPorter/                                  NIST |
 |_______________________________________
_______________________________|

Report this thread to moderator Post Follow-up to this message
Old Post
Don Porter
07-29-04 08:57 AM


Re: Question on string bytelength
It seems the money quote appeared in the docs sometime between 8.3 and
8.4: "If the object is a ByteArray object (such as those returned from
reading a binary encoded channel), then this will return the actual
byte length of the object." Until this quote appeared it was by no
means clear that [string length] was a reliable tool for measuring
data block lengths in bytes.

Of course this seems to contradict the concept that "everything is a
string," since sometimes the thing is a byte array.

I expected "string length" to be an encoding-aware command and "string
bytelength" to be a binary operation.  I think the fact that "string
length" can shift mode to binary operation under the hood is
confusing.  Something like the following would be more clear:

Say file eucjp.txt contained hexadecimal chars \xA4\xCF.

%encoding system euc-jp
%set f [open eucjp.txt r]
%fconfigure $f -translation binary
%set s [read $f]
%string length $s
1
%string bytelength $s
2
%string internalstoragelength
3

Report this thread to moderator Post Follow-up to this message
Old Post
Stephen Huntley
07-29-04 08:58 PM


Re: Question on string bytelength
Hi Stephen,


blacksqr@usa.net (Stephen Huntley) writes:
> I expected "string length" to be an encoding-aware command

Tcl is full Unicode inside, no other encoding is used ever, nothing in
Tcl is or needs to be "encoding-aware".  Not needing
encoding-awareness is a actually major advantage of Unicode as
compared to older systems.

Encodings in Tcl are usually applied when data is exchanged with the
outside of Tcl, e.g. in I/O.

> and "string bytelength" to be a binary operation.

To repeat, [string bytelength] counts the bytes to represent a string
in UTF-8.  Thus it is defininatly not a "binary" operation, quite the
contrary, it only makes sense for text.

> I think the fact that "string length" can shift mode to binary
> operation under the hood is confusing.

There are no shifts AFAIK.  If you can show any place in Tcl where
Tcl-visible "mode shifting" of any kind takes place, you should post a
bug report.  The internal difference between string objects and binary
array objects inside the Tcl C library is not supposed to be visible
from Tcl.  It's supposed to be an internal optimization.

This doesn't mean that there isn't a difference between text and
binary, but that is not in the Tcl_Obj on the C level, it's a semantic
difference on the Tcl level.

On the C level, while ByteArrays are better suited for binary data, a
string object created by Tcl_NewStringObj can be used to represent
binary data (the opposite is not true in general, because the
ByteArray object representation can not contain elements >= 255).
It's just needlessly complicated to create such C level string objects
or to work with them through the C level UTF-8 string interfaces.

> Say file eucjp.txt contained hexadecimal chars \xA4\xCF.
>
> %encoding system euc-jp
> %set f [open eucjp.txt r]
> %fconfigure $f -translation binary

"-translation binary" implies "-encoding binary" which overrides the
global [encoding system] setting.  So your first line is not relevant
any more at this point.

> %set s [read $f]

You are reading two bytes without any interpretation.  The result is
not text, even though Tcl represents the result as a string.

> %string length $s
> 1

No.  Tcl-wise you have a string containing the uninterpreted byte
values 0xA4 and 0xCF.  The result should be 2.  Think about it: You
never told Tcl that this is EUC-JP text.  How is Tcl supposed to know
this?

> %string bytelength $s
> 2

This doesn't make sense, because we are talking about bytes, not
actual characters.  (O.k., so both U+00A4 and U+00CF are incidentally
valid Unicode characters, but in general you could have any byte
values in that string at this point.)

> %string internalstoragelength
> 3

That is what [string bytelength] probably should have been called.
But than again, I'm really asking myself, if we shouldn't remove it
altogether, as I can't see any valid use for it, and it obviously
creates a lot of confusion.


benny


Report this thread to moderator Post Follow-up to this message
Old Post
Benjamin Riefenstahl
07-29-04 08:58 PM


Re: Question on string bytelength
Don Porter <dgp@email.nist.gov> writes:

> Donald Arseneau wrote: 
>
> Mostly avoid [string bytelength] I hope, and not -translation binary.

Mostly [string bytelength], but I am very nervous of -translation binary
too.  As you could tell, I never fully understood the workings.

The comments here have been good for me.

Donald Arseneau                          asnd@triumf.ca

Report this thread to moderator Post Follow-up to this message
Old Post
Donald Arseneau
07-30-04 08:58 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Tcl archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 04:30 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.