For Programmers: Free Programming Magazines  


Home > Archive > Tcl > November 2007 > Re: Is there a way to dynamically determine (in 'c') if an Tcl_Obj is in fact a Unic









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Is there a way to dynamically determine (in 'c') if an Tcl_Obj is in fact a Unic
Joe English

2007-11-26, 7:22 pm

Todd Helfter wrote:
> Ok, let me try to ask the question another way.
>
> If for instance I have a database table with a single column of
> varchar2(4000).
>
> In the ascii case, I can use a memory buffer of 4000 + 1 (for the null
> byte). In the case when I would use Unicode,


[ You mean "UCS2", not "Unicode" ]

> I must use a memory buffer of 8002. column size + 1 *
> sizeof(utext).
>
> It seems wasteful to me, to always use a larger buffer unless it is
> necessary.


[ Side note: isn't it also wasteful to allocate a 4001-byte array
to hold a VARCHAR2(4000) value? If memory pressure is a concern,
wouldn't it be better to allocate just enough space to hold
the value in question, instead of always allocating enough
space to hold the largest possible value? ]

> I guess my question is : given a pointer to a byte array. Is it
> possible easily determine if the contents if the array can be
> represented by the ascii character set or not.


What is the encoding of the contents of the byte array?

If it's an ASCII-superset 8-bit encoding like ISO8859-*
or the various Microsoft code pages, then you can scan for
any bytes with the high-order bit (0x80) set. If it's a
stateful encoding like ISO2022 or SHIFT-JIS, you can scan
for the presence of escape sequences. For KOI8-*, GB*, BIG5,
and others -- I don't know offhand, but can probably find
out with some digging.

But that's the wrong question. What you want to do is
get it into UTF-8.

If it's already UTF-8, you're golden: just pass it to
Tcl_NewStringObj(). (A quick googling indicates that
UTF-8 is Oracle's preferred encoding too, so that's
probably the best way to go). If it's not, then you
can use Tcl_ExternalToUtf* to convert it first.

You could also convert to UCS-2 and use Tcl_NewUnicodeObj(),
but -- depending on what happens downstream -- that's going
to be more expensive overall, since Tcl is likely to convert
the value to UTF-8 before doing anything with it anyway.

> If Tcl_GetUnicodeFromObj() really returns a UCS-2 strings then it
> should be Tcl_GetUCS2FromObj() :)


Yes, that's what it should have been called.


--Joe English
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com