Home > Archive > Cobol > February 2005 > COBOL (zOS) and Unicode
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
COBOL (zOS) and Unicode
|
|
| Bertram 2005-02-03, 8:55 am |
| Is there a possibility to convert the e.g. html-notation of unicode like
〹 to a "normal" cobol structure like PIC N(...) and can COBOL convert
it back?
Whow does COBOL store data in PIC N if the unicode needs more than 2 bytes?
| |
|
| Bertram wrote:
> Is there a possibility to convert the e.g. html-notation of unicode like
> 〹 to a "normal" cobol structure like PIC N(...) and can COBOL convert
> it back?
You can look at Function Char - strip off the , and it'll give you the
character. I'm not sure how high up those codes go, though, and whether
it only supports ASCII output, or Unicode. It would be where I would
start, though. :)
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~
~ / \ / ~ Live from Montgomery, AL! ~
~ / \/ o ~ ~
~ / /\ - | ~ daniel@thebelowdomain ~
~ _____ / \ | ~ http://www.djs-consulting.com ~
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
~ GEEKCODE 3.12 GCS/IT d s-:+ a C++ L++ E--- W++ N++ o? K- w$ ~
~ !O M-- V PS+ PE++ Y? !PGP t+ 5? X+ R* tv b+ DI++ D+ G- e ~
~ h---- r+++ z++++ ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~
| |
| Lueko Willms 2005-02-03, 3:55 pm |
| .. On 03.02.05
wrote bfj.geiger@tiscali.de (Bertram)
on /COMP/LANG/COBOL
in ctsv2d$1il0$1@ulysses.news.tiscali.de
about COBOL (zOS) and Unicode
bg> Whow does COBOL store data in PIC N if the unicode needs more than 2
bg> bytes?
Unicode normally is represented in 32 bits, but there are
transformation functions to represent it in 16 bits (UTF-16) and 8
bits (UTF-8), even 7 bits (UTF-7).
I would think that your compiler can be directed to use the one or
the other.
Yours,
Lüko Willms http://www.willms-edv.de
/--------- L.WILLMS@jpberlin.de -- Alle Rechte vorbehalten --
Er kann die Tinte nicht halten, und wenn es ihm ankommt, jemand zu besudeln, so besudelt er sich gemeiniglich am meisten. -G.C.Lichtenberg
| |
| Joe Zitzelberger 2005-02-03, 3:55 pm |
| In article <ctsv2d$1il0$1@ulysses.news.tiscali.de>,
"Bertram" <bfj.geiger@tiscali.de> wrote:
> Is there a possibility to convert the e.g. html-notation of unicode like
> 〹 to a "normal" cobol structure like PIC N(...) and can COBOL convert
> it back?
>
> Whow does COBOL store data in PIC N if the unicode needs more than 2 bytes?
You might want to look at the new TRANSLATE instructions. There is a
translate-two-to-one that would let you convert a string of hex digits
into a string of the same byte values.
But you would need a bit of assembler to pull it off.
| |
| Clark F. Morris, Jr. 2005-02-03, 8:55 pm |
| Top posted with nothing in the attached message.
Unicode is UTF-16 which is a proper subset of ISO-10646 which can be 1,
2, or 4 bytes IIRC. I don't' think HTML notation is Unicode although
the character set used to express the notation may be Unicode where each
character in 〹 is 16 bytes for a total of 112 bytes.
Joe Zitzelberger wrote:
> In article <ctsv2d$1il0$1@ulysses.news.tiscali.de>,
> "Bertram" <bfj.geiger@tiscali.de> wrote:
>
>
>
>
> You might want to look at the new TRANSLATE instructions. There is a
> translate-two-to-one that would let you convert a string of hex digits
> into a string of the same byte values.
>
> But you would need a bit of assembler to pull it off.
>
| |
| William M. Klein 2005-02-03, 8:55 pm |
| The NSYNBMOL compiler option impacts whether or not PIC N data (without a USAGE
clause) is treated as
- "traditional" IBM DBCS
or
- UTF-16 (Unicode / ISO-10646)
See:
http://publibz.boulder.ibm.com/cgi-...igy3pg20/2.4.32
the NATIONAL-OF and DISPLAY-OF functions convert PIC N to X and vice versa (with
an optional CCSID specification). See:
http://publibz.boulder.ibm.com/cgi-...IGY3LR20/7.1.20
and
http://publibz.boulder.ibm.com/cgi-...IGY3LR20/7.1.36
and some example code at:
http://publibz.boulder.ibm.com/cgi-...gy3pg20/1.7.3.5
NOTE WELL:
The IBM DISPLAY-OF function is NOT conforming to the definition provided in
the '02 ANSI/ISO Standard - as it doesn't provide the optional "default" bad
conversion character.
--
Bill Klein
wmklein <at> ix.netcom.com
"Bertram" <bfj.geiger@tiscali.de> wrote in message
news:ctsv2d$1il0$1@ulysses.news.tiscali.de...
> Is there a possibility to convert the e.g. html-notation of unicode like
> 〹 to a "normal" cobol structure like PIC N(...) and can COBOL convert
> it back?
>
> Whow does COBOL store data in PIC N if the unicode needs more than 2 bytes?
>
>
| |
| S Comstock 2005-02-03, 8:55 pm |
| LX-i lxi0007 writes ...
>Bertram wrote:
>convert
Hmm. Strange, I can't seem to find the original
post in this thread.
[color=darkred]
>
>You can look at Function Char - strip off the , and it'll give you the
>character. I'm not sure how high up those codes go, though, and whether
>it only supports ASCII output, or Unicode. It would be where I would
>start, though. :)
>
Enterprise COBOL on z/OS supports the National
data type, which is also called UTF-16. The intrinsic
functions NATIONAL-OF and DISPLAY-OF convert
between classic DISPLAY (EBCDIC) characters
and Unicode characters. The intrinsic functions
CHAR and ORD can convert between character
values and integer values but only between 1
and 256.
If your notation is encountered in XML, however,
the XML-PARSE statement might be able to help
you.
Be aware that Unicode can be encoded in three
formats:
UTF-32 - 4 bytes per character
UTF-16 - 2 bytes per character, except some
values use 2 2-byte characters
(called surrogate characters),
which is how Unicode can support
more than 64K characters
UTF-8 - 1, or 2, or 3, or 4 bytes per character
Enterprise COBOL supports UTF-16 (except for
surrogate pairs) and UTF-8 (to some degree).
<ad>
For details and hands-on practice, consider our one-day course "Enterprise
COBOL Update II: Unicode and XML Support".
Course description here:
http://www.trainersfriend.com/COBOL...s/d705descr.htm
which contains a link to the detailed topical outline.
</ad>
Kind regards,
-Steve Comstock
800-993-9716
303-393-8716
www.trainersfriend.com
email: steve@trainersfriend.com
256-B S. Monaco Parkway
Denver, CO 80224
USA
| |
| Schroeder 2005-02-04, 8:55 pm |
| This topic came up in conversation with a friend of mine recently as well. I
think the confusion is that some Unicode characters require more than a
single UTF-8 or UTF-16 character position to hold them (when converting from
UTF-32 down to UTF-16 or UTF-8). So if you have a PIC N(1) field in COBOL,
if that is supposed to hold a national character, clearly there are some
characters that won't fit.
My understanding is that COBOL gets around this by letting the implementor
define whether UTF-16 or UTF-8 is being used and as such only characters
which fit in a single character position would ever be supported by that
implementation. I don't think the 02 Standard allowed for UTF-32 as a native
implementation (I would have to dig it up to check). So within a purely
COBOL world, there shouldn't be a problem as no offending characters (in
theory) could ever get in.
But I think this user brings up a potential problem when reading XML or some
other data which potentially includes national characters which might
require multiple character positions to represent one glyph. If you MOVE
such a character to a PIC N field, it might not fit.
I don't remember if the 02 Standard has some sort of exception condition
defined for that case, but if not there probably should be. What do you
think?
Jeff Friedman
"Bertram" <bfj.geiger@tiscali.de> wrote in message
news:ctsv2d$1il0$1@ulysses.news.tiscali.de...
> Is there a possibility to convert the e.g. html-notation of unicode like
> 〹 to a "normal" cobol structure like PIC N(...) and can COBOL
> convert it back?
>
> Whow does COBOL store data in PIC N if the unicode needs more than 2
> bytes?
>
>
| |
| William M. Klein 2005-02-04, 8:55 pm |
| z/OS (Enterprise COBOL) versus ANSI/ISO 2002 COBOL are quite different issues.
(Enterprise COBOL does NOT claim to support '02 "internationalization"
features - and in fact its DISPLAY-OF feature is incompatile with the '02
Standard - which defines a different 2nd argument).
The '02 Standard is ABSOLUTELY clear that "USAGE NATIONAL" (and the
implementor's "native national character set) *may* be ISO-10646 (or Unicode)
but NEED NOT be. If it is Unicode and/or 10646, then it may be any UTF format.
Enterprise COBOL *does* claim to support UTF-16 but I can find no evidence that
it supports UTF-8. It *does* support the previous IBM DBCS format which is
"similar" to UTF-8 - but is certainly NOT identical.
I *beleive* that the current Entprrise COBOL supports both
PIC N Usage Display-1
and
PIC N Usage National
within the same program and would allow "MOVE" statements between the 2. I
*know* that it supports multiple CCSID values in both the CODEPAGE compiler
option and the DIDSPLAY_OF and NATIONAL-OF intrinsic functions.
Finally, the SPECIAL-NAMES paragraph in the '02 Standard (see page 175) supports
LOCALE [ locale-name-2 ]
NATIVE
UCS-4
UTF-8
UTF-16
code-name-2
Therefore, I beleive (but won't swear to it) that all of those character
repertories can be (semi-) supported within the same '02 conforming program
(depending upon "processor dependent" implementations). Other than "LOCALE" - I
do *not* think this impacts the meaning of PIC N data items. Page 317 of the
'02 Standard states,
"Each symbol 'N' represents a national character position that shall contain a
character from the computer's national character set".
I *beleive* the '02 Standard is written "assuming" (requiring?) a SINGLE
"computer's national character set" (which may be UTF/UCS- ??? - or "full ISO
10646/Unicode - or something else. Therefore, how you "convert" (handle) mixed
values is dependent upon the computer's national character set "rules".
NOTE WELL:
I am semi-weak on "all of this" - so I am more than willing to admit that
anything stated above MAY be in error.
--
Bill Klein
wmklein <at> ix.netcom.com
"Schroeder" <jfriedman@nc.rr.com> wrote in message
news:35QMd.54381$dt3.5196509@twister.southeast.rr.com...
> This topic came up in conversation with a friend of mine recently as well. I
> think the confusion is that some Unicode characters require more than a single
> UTF-8 or UTF-16 character position to hold them (when converting from UTF-32
> down to UTF-16 or UTF-8). So if you have a PIC N(1) field in COBOL, if that is
> supposed to hold a national character, clearly there are some characters that
> won't fit.
>
> My understanding is that COBOL gets around this by letting the implementor
> define whether UTF-16 or UTF-8 is being used and as such only characters which
> fit in a single character position would ever be supported by that
> implementation. I don't think the 02 Standard allowed for UTF-32 as a native
> implementation (I would have to dig it up to check). So within a purely COBOL
> world, there shouldn't be a problem as no offending characters (in theory)
> could ever get in.
>
> But I think this user brings up a potential problem when reading XML or some
> other data which potentially includes national characters which might require
> multiple character positions to represent one glyph. If you MOVE such a
> character to a PIC N field, it might not fit.
>
> I don't remember if the 02 Standard has some sort of exception condition
> defined for that case, but if not there probably should be. What do you think?
>
>
>
> Jeff Friedman
>
>
>
> "Bertram" <bfj.geiger@tiscali.de> wrote in message
> news:ctsv2d$1il0$1@ulysses.news.tiscali.de...
>
>
|
|
|
|
|