Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

[OT]The Unicode Rant
In response to several people who asked for it, I'm posting the
Unicode Rant.

Since this is not particularly about scheme, I've included a followup-to
my email address in the headers; please don't flame in the newsgroup.

I've also toned it down a bit, given it more technical justification
and less anger, in hopes that it won't just trigger flames.



The Unicode Rant

-----

I remember hearing about unicode back in, I think, around
1993.

It was supposed to be an expanded character set, more or
less a sixteen bit replacement for ascii, that would
simplify our lives as coders by allowing us to not worry
about code pages, character set translations, etc; every
character should have a single, simple, sixteen-bit code,
and there'd finally be one universal format with a universal
code width across all kinds of different platforms.  So,
finally, we'd have one unambiguous way to write things that
used an expanded character set.  That was a good idea.

So I looked forward to it.  And when more information was
available, I checked it out.  And when I checked it out, I
began to have doubts. You see, the delivered standard was
not actually very much like the promised one.

There'd be one way to write things, was the promise; well,
unless you used accented characters, in which case you had
several different ways you could write most of them.  You
could use the precomposed character, or decompose it into a
character followed by the accent. Why are these precomposed
"compatibility" characters in the standard?  In olden days,
the only way to have an accented character was to have a
precomposed accented character.  Since unicode requires that
a character-followed-by-accent is valid, the complication of
implementation for a more general encoding scheme exists
anyway; why do we also have precomposed characters?

We have them for codepoint compatibility with preexisting
character set standards -- those "code pages" we were
supposed to be able to forget about.  We have allowed
redundant characters into the so-called universal standard,
for the sole purpose of being compatible with the mess we
were trying to be better than.  The idea was that text
written in any particular codepage set could be converted to
Unicode by the simple expedient of adding an integer offset
to all character codes over 127.

Except that a lot of different codepages used many of the
same accented characters, and we didn't want multiple copies
of the same precomposed character in the standard.  So what
do we do with all the codepoints that mapped to the same
accented character in different codepages (at their
respective offsets into the unicode set)?  We let one of
them, for the most popular codepage, have that mapping, and
then leave holes in the codepoint to character mapping at
all the other locations.

Holes.  There are codepoints that do not map to characters.
In fact, there are lots and lots of them.  These are mostly
codepoints that will NEVER map to characters.  And they're
scattered haphazardly all the way through the character set,
not gathered into a few sensible blocks for future
expansion.  That means you can't even do something as simple
as iterating over the set without constantly consulting
tables; if you do, then some of the numbers in your
iteration will not correspond to valid codepoints.  In an
era when cache misses are known to be the single most
expensive thing you can do in a computer program, a standard
which could have avoided them made data tables mandatory for
something as simple as iteration.  That sucks.

Now, one effect of having all these codepages preserved into
unicode was that since the codepoints within codepages
aren't changed, differently accented versions of the same
letter are scattered throughout the codespace.  If you want
to collate all A's before all B's and so on, you must use
another external table to find your way through the
codespace.  Furthermore, different characters have different
subsets of accented versions, and even with the same
accents, do not usually appear in the same order.

Another promise that was made to us was that Unicode would
represent characters, not fonts.  looking through the
current unicode standard, I see fractur, monospace, sans
serif, italic, bold, small, bold script, and double width
versions of the entire latin alphabet, plus others.  Not
only did they fail to hold the line, they failed
spectacularly, by dupicating the entire alphabet dozens of
times.  I accept the same argument that the unicode
committee accepted; in many cases, these different forms are
part of basic expression in some realms like mathematics.  I
disagree with the conclusion that these alphabets needed
these repetitions; acknowledging the need for fonts, I'd
have made modifier codepoints to express fonts.  Instead, we
have many many wasted codepoints and many many repetitions
of characters that, for most purposes, ought to be
considered the same character.  Moreover, we have admitted
font differences into the standard but have not made them
generally applicable across alphabets; we've dedicated
thousands of codepoints to miscellaneous versions of the
latin alphabet to accomodate fonts, but we still can't apply
fonts to accented characters or to non-latin characters.
Treating fonts as a modifier character instead would have
used only a dozen codepoints and at the same time would have
allowed the uniform application of fonts across alphabets
and accents.

Let's move on to another promise; remember the idea of a
uniform-width codespace?  Great idea, wasn't it?  But along
the way, Unicode attempted to swallow several very large
character sets whole and wound up needing more than sixteen
bits.  Oops. It's now a 21-bit standard.  And we set aside
thousands of numbers in the 16-codespace that will never map
to codepoints, because they are used to express the first
half and last half of codepoints larger than 16 bits. These
are called "surrogates" - I suppose they are a "surrogate"
for a standard with a uniform character width.

But that's not the only failure of the set to have a uniform
width; Unicode has no fewer than seven different encodings,
named UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE,
and UTF-32LE.  UTF-8 allows codepoints to be represented as
units of four different widths.  The three UTF-16 forms
allow codepoints of two different widths.  The three UTF-32
forms are in fact uniform-width encodings, but an extra
codepoint (the byte order marker) is required to tell the
system which encoding you're using if it's not a
"whatever-BE" or "whatever-LE" form.

The result is that two strings known to be "unicode" cannot
be compared to one another in a straightforward bitwise
comparison for equality or collation.  Each must be decoded
into a sequence of uniformly represented codepoints, trimmed
of the byte order marker if necessary, and then compared --
and this is pure overhead.

If you want to know whether the three-thousand and first
codepoint in two different files is identical, it should be
possible, in my opinion, to s to the three thousand and
first codepoint (uniform widths are nice that way) and see
whether the bit patterns found there are identical.
Wouldn't that be nice?  But no, instead you have to process
six thousand characters worth of overhead through
decompression, not tripping if there's a "fake" codepoint
inserted in one to show byte order, and only then will you
be able to make your comparison.

Now, I've already talked a little about accented characters,
which are implemented in unicode as precomposed characters
with accents built-in *and* as character-combiner sequences.
But let me revisit that point; This doesn't just mean there
is more than one way to write a given character, this means
that there are ways of differing widths as measured in
codepoints to write a given character. Unicode requires
these different ways of writing a character to be recognized
as identical for (canonical) character comparisons, but this
opens up a huge can of worms in the process of comparing
characters and strings.  In the first place, it means that
the same codepoint index in two different strings does not
necessarily refer to the same character index, even if the
two strings are otherwise canonically identical.  In order
to do a substring comparison, you have to process the entire
string up to the end of the substrings you're interested in
in order to know the character indexes.  This throws out
every possibility of efficient text comparisons.

Also, if you are parsing a (formal or natural) language,
then any character that can be represented by more than one
codepoint sequence doubles (or more) the size of the
resulting state machine for processing codepoints, so the
possibility of efficient parsing is also out the window with
this polymorphism.  If your parser intends to support
multiple encodings, it gets even worse, because then
absolutely EVERY codepoint has multiple possible
representations.  There is no taking input directly from a
file for parsing any more; instead, you have to preprocess
it to deal with its encoding forms and combining codepoints
in order to get a "logical" stream of unicode characters in
some known encoding before you have even a prayer of getting
a parser to run efficiently; and having done that, your
count of characters in a known encoding bears no reliable
relationship to the actual location in the file of the
corresponding codepoints.  The result is that when your
parser finds something it's supposed to change, it cannot
simply s to the appropriate codepoint in the file, based
on the character count in the parser, and change it.  This
needlessly complicates the design of stream editors such as
'sed' - in a Unicode world, they require overhead to work
that can easily add an order of magnitude to their cost.

Now let's talk about different single codepoints that all
represent the exact same character.  In lots of writing
systems, a character can appear as any of a dozen or more
different glyphs depending on context.  The best known
example of this is the "final sigma" form in gr, where
the lowercase character sigma took a different shape or
glyph if it was the final character in a word.  And in the
history of computing, the only way to get a different
character shape on the screen was to use a different
codepoint.  And Unicode inherits everything that ever got
different codepoints in any codepage.  As a result,
lowercase sigma has two different codepoints in unicode; one
for its "normal" form and one for its "final" form.

It is however far from the only character that this is done
to; others include most of the characters in heavily
contextual scripts such as myanmar and arabic. The problem
is that despite looking different, these are distinguished
solely by the written context in which they appear, and
recognized by their user communities as being the same
character.  If a Gr user searches a document for "sigma"
and instances of "final sigma" don't show up, it will be
unexpected for that Gr user.  Likewise for an Arabic user
searching for a given character and not finding its
isolated, medial, initial, or final forms.  If the Gr
user cuts "final sigma" and pastes into a non-final context,
he expects the regular form of sigma rather than the final
form.  Once again, we are forced to go to massive code
tables to figure out all the things we ought to be looking
for or replacing with rather than just taking an input and
going looking.  In a world where cache misses are the single
most expensive thing you can do on a computer, gigantic data
tables are required just for a cut-and-paste operation or a
"search" operation.  It would be a better solution to leave
distinctions between forms solely dependent on their written
context to the rendering engine that prints their
glyphs. Then we could have the same codepoint representing
the same character and do the simple efficient thing in text
searches and search-and-replace operations.

I guess the next thing to mention is unicode's
"bidirectional algorithm."  The first thing I want to point
out is the writing system of China, whose characters are
normally written from top to bottom, in columns from the
right side of the page to the left.  The second thing I want
to point out is the scripts of some south sea islanders,
which are normally written from bottom to top, in columns
from the right side of the page to the left.  The third
thing I want to point out is classical gr, which was
often written in "boustrophedon" form, that is, in lines
alternating left-to-right and right-to-left, from the top of
the page to the bottom.  I guess my point here is that
"bidirectional" doesn't even begin to cover the universe of
writing systems.

Moreover, it's not properly part of a character set standard
so much as a proper part of a display standard.  Characters
are always transmitted or recorded in computer files in the
order in which they'd be written or read by a user of the
natural language recorded; that's a reasonable "given," and
I believe a sufficient one.  In mandating the behavior in
the "bidi" algorithm, Unicode has made illegal the actual
preferred behavior used in several texts and historical
periods.  And it has once again made unpredictable knowledge
about individual codepoints an absolute requirement for
correct character handling, meaning anything that deals with
this aspect of unicode has to be driven by large data tables
that will cause lots and lots of cache misses.  While it's
worthwhile for display purposes to be able to know which
characters are l2r and r2l, the character set can support
this much better by having an r2l subrange and an l2r
subrange.  Then querying for r2l or l2r could be done using
a simple predicate on the codepoint rather than using yet
more hideously expensive table lookups.

The next Unicode mistake is the presence of precomposed
ligatures. Ligation, in most cases, is more properly a
function of the rendering engine rather than the character
standard. In the cases where we don't want to leave choices
about ligation to the rendering engine, precomposed
ligatures are the wrong approach.  A zero-width non-breaking
space can be used to inhibit ligation between two normally
joining characters; a corresponding codepoint forcing
ligation would be an appropriate complement.  Thus, you'd
have A, ligation joiner, e, instead of the Ae ligature.  The
benefits of this procedure are manifold.

First, it leaves common ligation to the rendering engine;
this leaves it primarily under locale control, which is
appropriate.  Most people who don't use an alphabet natively
can't look at heavily ligated scripts like arabic or myanmar
and even be able to pick out individual characters, which
will prevent them from being able to work with text in that
script in even the most basic of ways.  In this case the
ligation works against usability.  In precisely these
locales, the ligation will be left undone, leaving people
with a much simpler image wherein non-natives can at least
count characters, look for morphological patterns, etc.  In
locales where the heavy ligation will be understood and
useful to most people, on the other hand, ligation works for
usability, and in precisely those cases it will be present.

Second, it allows explicit ligation or non-ligation in the
character standard for when it is actually essential, and
does not constrict, as the current approach does, choices
about which characters can be ligated.

Third, it eliminates the need for "Titlecase" which has
needlessly complicated the Unicode standard (and added yet
another table lookup to working with these characters).

Fourth, it simplifies case conversions on ligatures - one
simply converts the cases of the component characters - and
leaves case-converted ligatures, with the sole exception of
eszett, taking exactly the same number of codepoints to
express as the original ligature.  (of which more anon;
having strings change length just because of a case change
is madness, for reasons I will explain soon),

Fifth, it reduces the number of different ways to write the
same string, which, together with a bunch of other such
changes, can help make efficient lexers and parsers possible
again.


Now, let's talk about case conversion.  In Unicode,
converting a cased character from lowercase to uppercase and
back is insane, for several reasons.  First of all it
requires table lookups which can for the most part be
eliminated by judicious layout of codepoints. Second, the
length of the string (as measured in codepoints) is likely
to change if there are any accented characters or ligations;
Third, the operation is not reversible since there exist
many cased characters which are not the preferred
opposite-case elements of their own opposite-case elements;
Fourth, there is Titlecase and its associated complexities
to worry about if ligatures are involved.  Let's examine
each of these issues in turn, in the light of possible other
ways of handling things.

First of all the table lookups required for case conversion
can be largely eliminated.  The 'c' code for converting an
ascii character to uppercase is fairly simple and involves
no table lookups.

if (islower(ch))
ch += ('A' - 'a');

'islower' here is taken to be a compiler macro that checks
to see whether the codepoint 'ch' is between two known
codepoints representing the beginning and end of the
lowercase range, and since the expression ('A' - 'a') is a
comparison of two scalar constants, it generates no runtime
code; the compiler simply replaces it with a constant.  This
is simple, efficient, and relatively bugproof, and something
a lot like it can be used with an extended character set.
In the first place, for the moment all cased alphabets are
also left-to-right alphabets, so there is no conflict in
codepoint layout with the separation of left-to-right and
right-to-left into separate blocks.  All upper and
lower-case characters are members of the left-to-right
block.  We could simply have two sub-blocks, one of which
contains lowercase characters and the other of which
contains uppercase characters, and use much the same logic
as above.

The reasons why this is not feasible in Unicode are largely
eliminated by the changes already proposed; The problem with
precomposed accented characters with no altercase equivalent
is eliminated along with precomposed accented characters.
The problem with ligatures having no altercase equivalent is
eliminated along with precomposed ligatures.  The
complexities of titlecasing are eliminated along with
precomposed ligatures as well.  There remain two and only
two problems that we must address.  The first is the
singular character eszett, which is problematic in several
other ways and which I will deal with separately.  The
second is that some case mappings, particularly those
involving the letter i, depend on the locale.  Our modified
code for uppercasing in an extended character set, then,
looks like this:

if (locale.has_case_exceptions &&
!locale_case_exception(ch))
ch = locale.tolower(ch);
else ch += ('A' - 'a');


Since most locales won't have any case exceptions, the macro
locale.has_case_exceptions will usually be false, which
means table lookups can usually be completely avoided.
Since given a comprehensive 'default' for
extended-character-set casing, the case exceptions for a
given locale are usually going to be no more than two or
three characters, The table in the table lookups in lines 2
and 3 will therefore normally be trivial in size rather than
gigantic, and therefore should not frequently cause cache
misses. This code will run about two orders of magnitude
faster than the cheapest possible implementation of case
switching in unicode.

The second problem with unicode case operations is that they
are likely to change the length of the string.  With the
elimination of precomposed ligations and precomposed
accented characters, they are dramatically less likely to do
so.  In fact, the only remaining case where the length of
the string would change is, once again, the problematic
character eszett.  The special dementia of the character
eszett is that it's not a ligature, but it is a lowercase
character with no corresponding uppercase character.  When
it changes case it changes into two capital letter 'S'
characters.  It is also a poster child for non-reversible
case changes; if you take the german word that is spelled m,
a, eszett, e, and bump it to uppercase, you get M, A, S, S,
E.  When you bump that string back to lowercase, you get m,
a, s, s, e, which is a different word with a different
meaning than m, a, eszett, e.  In order to lowercase M, A,
S, S, E correctly in German, you have to know from context
which word was intended, so something as simple as case
operations winds up requiring human-level conceptual
knowledge from outside the text.

This is demented. In one fell swoop, if not somehow fixed,
this singular insane character makes case operations in the
redesigned extended character set nonreversible, makes for
uppercase forms that are ambiguous as to which word they
mean and what is the proper lowercase for them, and makes
case operations in the redesigned extended character set
capable of changing the length of the string; in short, all
the craziness of unicode case operations in one character.
This is so astonishingly stupid that it warrants a kluge to
fix it, and I name that kluge capital eszett.

When printed, capital eszett looks like two 'S' characters
side by size, but it is a single character, and when made
lowercase, it lowercases properly to eszett, meaning that
the uppercase form of the string is not ambiguous as to
which word it's the uppercase form of and has the same
length in codepoints as the lowercase. This leaves the
titlecase form 'Ss' unaccounted for, but I'm willing to
regard it as having the same value as other contextual forms
like 'final-sigma'; when printing uppercase eszett in an
initial context in a word and followed by a lowercase
character, You'd print the 'Ss' form instead of the 'SS'
form.  The decision can be, and properly should be, left to
a locale-aware rendering engine.

The third problem with unicode case changing operations is
that they are nonreversible because there are many cased
characters which are not the preferred opposite case of
their own opposite case.  Most of these characters are
ligatures or accented characters, and the proposed redesign
eliminates them.  By representing these combinations in
terms of simply cased characters, it becomes possible to
perform case operations simply by operating on the
components of the aggregates.

The fourth problem was properly representing titlecased
ligatures; This ceases to be a problem when precomposed
ligatures are removed from the character set design.  What
unicode calls a "titlecase ligature", in the redesigned
character set, is a capital character, a ligation joiner,
and a lowercase character.  The simple algorithm for
capitalizing a word - changing the first cased character to
uppercase and all others to lowercase - is in fact
sufficient even when words begin with ligatures, and
produces forms corresponding to Unicode's titlecase forms.

The next thing I want to address is Hangul.  This script is
represented in Unicode in two different ways, and one of
them ought to be eliminated for the sake of reducing
multiplicity so as to be able to produce efficient parsers
and lexers and for the sake of doing efficient and
unambiguous string comparisons.  For the same reasons as I
advocated getting rid of ligatures and precomposed accented
characters above, plus the fact that the Jamo form is more
general, I'd advocate getting rid of the precomposed Hangul
Syllables.

And finally, there are the sinogram blocks; The CJK ideogram
block, the CJK ideograph extension A, the CJK ideograph
extension B, and the many thousands of CJK compatibility
ideographs each of which is merely another way to write an
existing ideograph.  This is nuts for several reasons;
First, it's nuts because these aren't allocated in a
contiguous block.  Second, it's nuts because this is a
snapshot of the vocabulary of several living languages, and
as such is bound to continue to change. Third, it's nuts
because it's woefully incomplete; even with all these
hundreds of thousands of ideograms, the average Chinese
person still can't correctly write his or her own address
using these characters.

Addresses using Sinograms are particularly problematic.
Because this is essentially the vocabulary of place names,
it contains a lot of proper nouns.  In a universe where a
character stands for a word, the "character set" is
effectively the dictionary, and it's uncommon to find proper
nouns in the dictionary.  And Chinese culture uses a lot
more place names than American culture does; for example in
Chinese cities, every intersection has a name, whereas in
American cities, it is the streets that are named instead;
This means that the number of place names in a Chinese city
is on the order of the square of the number of place names
in an American city of the same size.  Codification of a set
of sinograms that includes all the place names has not yet
been done, by anyone, and opinions vary on whether it is a
worthwhile task.

Japan adopted a sensible solution to this problem; there are
auxiliary syllabary alphabets that are used to encode names
whose Kanji form isn't available.  In fact, these auxiliary
alphabets are rapidly overtaking Kanji as the preferred form
of written communication in Japan.  Korea also adopted a
very sensible solution to this problem, with the system of
Hangul writing and the Jamo, and the ideogram characters are
seen there as well with increasing rarity.  But China itself
is a land of deep traditions, and the ideographic writing
system looks as though it will be used there for several
more generations at least.

This problem goes deep, and unlike most of the other
problems I've mentioned, there just isn't a complete and
clever solution that will make everybody happy.  One
possibility that I think works for the redesigned extended
character set, but which won't make everyone happy, is to
allow (say) 4096 characters which are ideograph stroke
modifier characters, and have a sequence of strokes "modify"
a space character to build any ideogram.  You'd leave final
rendering of the ideogram to the rendering engine, which
ought to be able to tell what ideogram you mean from the
stroke sequence if it's a common one, and at least have
enough information to make an attempt at a rendering if it's
unknown.  The question here is whether 4096 strokes is the
right number.  The consideration involved was this:

+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+

In the above grid, there are eight columns of '+' signs and
eight rows.  A stroke between any two '+' signs can be
expressed approximately as 3 bits for beginning row, 3 bits
for beginning column, 3 bits for ending row, and 3 bits for
ending column. That's a total of 12 bits, giving 4096
possible strokes.  The ending points of the strokes can be
displaced by half-a-row down and half-a-column right for
slightly more control.  If we double the number of rows and
columns, we move up to 16 bits for 65536 possible strokes,
which is probably too many.

The problem here is what is it that we're calling a
character?  Is it in fact a stroke rather than an ideogram?
Are we comfortable with moving from a single ideogram to a
sequence of perhaps a dozen or more strokes?  Are we
comfortable with the idea that a "word" (or ideograph) may
be "misspelled" by using the wrong stroke and shifting a
line in the middle one column to the right or left?  I for
one don't find it more unlikely or more disturbing than the
fact that English speakers are perfectly capable of
mispelling words with our alphabet too, and usually don't,
and it's completely in line with the way accented characters
are built from multiple codepoints; So it's a consistent and
reasonably complete design.

The unicode standard aimed not to get into rendering issues,
but here I'm advocating an approach which is based on the
physical shape of a glyph for the character involved.  There
are many examples of typography where a given sinogram is
rendered differently in a different font, in a way that
would defeat identifying it with this scheme; a stroke
slants to the left rather than to the right, for example.
But there are also many regional differences in spelling, as
for example between US and UK English, when it is
recognizable to all natives that the same word is intended.
Further, a standardized "spelling" of a given Sinogram need
not dictate how it is rendered; remember the final glyph is
provided by the rendering engine, after it looks up the
combination of strokes.  So subtleties of typography and
character design for sinograms can be preserved in this
system, even while promoting a standard spelling for
"words."

To sum up: The unicode committee did not design a standard
for use.  They designed a standard for adoption.  Rather
than developing a new and sensible way to do things taking
into account all the world's writing systems, they co-opted
all the ways people were already doing things, mostly driven
by hardware and software limitations that unicode
conformance demands overcoming anyway, and duplicated all of
them, with all their kluges, redundancies, and problems,
into unicode, making a standard much more complex than it
needed to be and much more difficult to implement or work
with than it needed to be, promoting errors and
inefficiencies in character handling.  A much simpler system
for encoding writing is capable of encoding all the same
scripts, using far fewer codepoints, while minimizing table
lookups and cache misses, eliminating most case asymmetries,
providing far better support for lexing and parsing by
eliminating most multiplicities and ambiguities, and
providing better coverage.  It is true that under the
redesigned character set more codepoints would generally be
used to express the same strings, especially in ideographic
languages, but it is also true that "wasting" space in the
data flow is far less harmful to the functioning of programs
than complicating the algorithms used and bringing in big
data tables, and that the "wasted" space can be mostly
recouped, as far as the commodities of disk space and line
bandwidth are concerned, by standard compression algorithms
(NOT encoding schemes) and uncompressed for applications
where we want random-access.





Report this thread to moderator Post Follow-up to this message
Old Post
Ray Dillinger
05-10-05 01:59 AM


Re: [OT]The Unicode Rant
Ray Dillinger <bear@sonic.net> writes:
> Since this is not particularly about scheme, I've included a followup-to
> my email address in the headers; please don't flame in the newsgroup.

Let the discussion work in the newsgroup. A healthy discussion is good, the
reason I follow usenet. The topic started in c.l.s so keeping it here is not
 a
problem. People can always ignore threads, or put an ObScheme note.

ObScheme: unicode is a problem in scheme and in general. discuss why or why 
not.

Fascinating article, BTW.
--
Cheers,                                        The Rhythm is around me,
The Rhythm has control.
Ray Blaak                                      The Rhythm is inside me,
rAYblaaK@STRIPCAPStelus.net                    The Rhythm has my soul.

Report this thread to moderator Post Follow-up to this message
Old Post
Ray Blaak
05-10-05 09:01 AM


Re: [OT]The Unicode Rant
Ray Dillinger writes: 

Ray Blaak wrote:
> Let the discussion work in the newsgroup. A healthy discussion is
> good, the reason I follow usenet. The topic started in c.l.s so
> keeping it here is not a problem. People can always ignore threads, or
> put an ObScheme note.

I agree.

> ObScheme: unicode is a problem in scheme and in general. discuss why
> or why not.

Unicode can be tricky to implement and work with, but I think that's a
symptom of the underlying problem rather than the problem itself.
Natural language is messy, and Unicode attempts to tackle it all at
once. As a result, dealing with Unicode is harder than dealing with any
single encoding. However, in my opinion it's much easier than trying to
tackle each encoding separately, one at a time, and is a success from
that point of view.

Also, I think Scheme has the potential to handle Unicode more gracefully
than many other languages, because Scheme's superior abstraction
mechanisms can help to hide some of the ugliness.
--
Bradd W. Szonye
http://www.szonye.com/bradd

Report this thread to moderator Post Follow-up to this message
Old Post
Bradd W. Szonye
05-10-05 09:01 AM


Re: [OT]The Unicode Rant
Ray Blaak wrote:
> Let the discussion work in the newsgroup. A healthy discussion is good, th
e
> reason I follow usenet. The topic started in c.l.s so keeping it here is n
ot a
> problem.

I concur.

I thought it was an interesting rant and I'm curious to hear solutions
to the problems proposed.

--
.i mi rodo roda fraxu

Report this thread to moderator Post Follow-up to this message
Old Post
Sunnan
05-10-05 09:01 AM


Re: [OT]The Unicode Rant
Ray Dillinger wrote:
> In response to several people who asked for it, I'm posting the
> Unicode Rant.

Thanks! I have a few remarks in response, mainly to correct some factual
errors and misconceptions.

> There'd be one way to write things, was the promise; well, unless you
> used accented characters, in which case you had several different ways
> you could write most of them.  You could use the precomposed
> character, or decompose it into a character followed by the accent.
> Why are these precomposed "compatibility" characters in the standard?
> ... We have them for codepoint compatibility with preexisting
> character set standards -- those "code pages" we were supposed to be
> able to forget about.

You /can/ forget about code pages if you use only Unicode, but not
everyone has that luxury. Real applications must deal with legacy issues
like non-upgradeble toolchains and round-trip conversion. While it's
possible to keep track of the necessarily information with auxiliary
state (e.g., attach an "original encoding" field to strings), Unicode
becomes less of an advantage in that scenario.

In contrast, by providing compatibility characters, the standard offers
a smoother upgrade path. Furthermore, by specifying a canonical version
of each compatibility character, the standard ensures that everyone
eventually upgrades to the same place.

> We have allowed redundant characters into the so-called universal
> standard, for the sole purpose of being compatible with the mess we
> were trying to be better than.

But when it comes to judging the value of an encoding standard, the
degree of acceptance is an important part of "better." An encoding is
useless if compatibility issues prevent adoption.

> Holes.  There are codepoints that do not map to characters. In fact,
> there are lots and lots of them ... That means you can't even do
> something as simple as iterating over the set without constantly
> consulting tables ....

What kind of program needs to iterate over all Unicode characters
without knowing what they mean? You can't even sensibly display them or
make a list of their names without tables. (For that matter, what kind
of program needs to iterate over the whole set in the first place?)

> Now, one effect of having all these codepages preserved into unicode
> was that since the codepoints within codepages aren't changed,
> differently accented versions of the same letter are scattered
> throughout the codespace.  If you want to collate all A's before all
> B's and so on, you must use another external table to find your way
> through the codespace.

That collation method is not generally valid anyway, since accented
letters often collate differently from unaccented letters. It would work
for purely English-language texts, but then you're screwed if you ever
decide to globalize the program, because you'll need to switch all
instances of this hack to the more general form using tables.

When you globalize, the tables are inevitable. Why encourage hacks that
only make it harder to use the encoding as intended?

By the way, this also relates to the pre-composed characters issue. What
would be a "base letter + accent" in some languages is a unique letter
in others. This kind of ambiguity crops up throughout Unicode, and the
standard generally errs in favor of treating them as different
characters (albeit with a compatibility decomposition). You can treat
them as separate characters or not, with a simple conversion if you
don't get your preferred form. That's the reality that users want, even
if it makes the standard a bit "messier" algorithmically.

> Let's move on to another promise; remember the idea of a uniform-width
> codespace? [But the standard] set aside thousands of numbers in the
> 16-codespace that will never map to codepoints, because they are used
> to express the first half and last half of codepoints larger than 16
> bits ....

The surrogate characters cannot appear in the fixed-width encodings of
Unicode. For example, a UTF-32 application need not cope with variable-
width characters or surrogates at all; it's merely a big "hole" in the
encoding.

> But that's not the only failure of the set to have a uniform width;
> Unicode has no fewer than seven different encodings, named UTF-8,
> UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

It's somewhat disingenuous to describe these as "seven different
encodings," since four of them are just minor specializations of UTF-16
and UTF-32. Also, while I would agree that the UTF-16 series is an
unfortunate historical accident, the UTF-8 encoding exists for reasons
that would still be valid even if Unicode were originally conceived in
its 21-bit form.

> If you want to know whether the three-thousand and first codepoint in
> two different files is identical --

Why would you want to do this? In the presence of combining accents,
it's just as useless as checking the thousandth byte of an Asian "shift"
encoding. For most meaningful tasks, you need a whole character, and you
can't get that with simple code-point indices.

> Also, if you are parsing a (formal or natural) language, then any
> character that can be represented by more than one codepoint sequence
> doubles (or more) the size of the resulting state machine for
> processing codepoints ....

That's trivally avoidable by canonicalizing the data before parsing.
Indeed, many of your objections are trivially addressed by canonicalizing
data upon input and always using the canonical form internally.

> Now let's talk about different single codepoints that all represent
> the exact same character [e.g., medial sigma vs final sigma].

Again, compatibility is important here. Unicode has to cope with the
reality of fonts, which assign code points by glyph rather than by
character. While the Unicode ideal of ignoring purely graphical
differences is interesting, it clashes badly with the reality of how
characters get onto a screen or a sheet of paper. Ignoring the
difference between roman and italic A is OK, but if you sweep the sigma
issue under the rug, you'll just inspire competing and incompatible
standards, which is worse than useless.

> If a Gr user searches a document for "sigma" and instances of
> "final sigma" don't show up, it will be unexpected for that Gr
> user.

That depends on the context (i.e., whether you're writing a letter or
typesetting a book). In contexts where they should match: just don't do
that. The standard provides the information necessary to fold the two
letters in "typography-insensitive" matching situations.

> The next Unicode mistake is the presence of precomposed ligatures.

Perhaps, although this is arguably just another case of the issue
underlying the sigma problem.

> Now, let's talk about case conversion.  In Unicode, converting a cased
> character from lowercase to uppercase and back is insane, for several
> reasons.

It's mainly because case conversion is insane in natural languages. The
Unicode Consortium can't fix that, but can only deal with it.

> First of all the table lookups required for case conversion can be
> largely eliminated.  The 'c' code for converting an ascii character to
> uppercase is fairly simple and involves no table lookups.
>
> if (islower(ch)) ch += ('A' - 'a');

It'd be nice if you could apply this to all natural languages, but it
just doesn't work. What happens when you throw Japanese text at this
kind of code? It breaks horribly unless (1) you introduce a table lookup
to figure out whether the character has case variants, or (2) you create
fake "upper" and "lower" cases for Japanese, which creates the same kind
of comparison problems you complained about for Gr sigma.

This is another example of code that must die in globalized software.
The table lookup is inevitable, so use it to do things right instead of
relying on bit-pattern hacks.

> Our modified code for uppercasing in an extended character set, then,
> looks like this:
>
> if (locale.has_case_exceptions &&
>     !locale_case_exception(ch))
>     ch = locale.tolower(ch);
>     else ch += ('A' - 'a');
>
> Since most locales won't have any case exceptions --

That's not true unless you disallow Japanese text in Western locales.

> [German eszett] is demented.

German writers deal with it just fine. Unfortunately, computers are not
nearly as good at natural language.

> This is so astonishingly stupid that it warrants a kluge to fix it,
> and I name that kluge capital eszett. When printed, capital eszett
> looks like two 'S' characters side by size, but it is a single
> character ....

Unless you can somehow convince German typists to enter "capital eszett"
instead of "SS" on their keyboards, this is not a solution. It just
pushes the problem onto input devices/routines, which must guess whether
you meant "MASSE" or "MA<SS>E" when you typed it into your word
processor. While you might reasonably get professional typesetters to
explicitly indicate ligatures in their galleys, this "capital eszett"
idea seems doomed to failure. It would permit round-trip casing in one
direction, but it would not actually solve the problem.

> To sum up: The unicode committee did not design a standard for use.
> They designed a standard for adoption.

The latter is necessary to have the former. Character encodings are
literally useless unless you can share them with other people, and that
requires widespread adoption.
--
Bradd W. Szonye
http://www.szonye.com/bradd

Report this thread to moderator Post Follow-up to this message
Old Post
Bradd W. Szonye
05-10-05 09:01 AM


Re: [OT]The Unicode Rant
Ray Dillinger <bear@sonic.net> wrote:
> The special dementia of the character
> eszett is that it's not a ligature, but it is a lowercase
> character with no corresponding uppercase character.

This is because - at least in German - there are no words
that begin with an eszett. Are there other languages using
this character?

> When
> it changes case it changes into two capital letter 'S'
> characters.  It is also a poster child for non-reversible
> case changes; if you take the german word that is spelled m,
> a, eszett, e, and bump it to uppercase, you get M, A, S, S,
> E.  When you bump that string back to lowercase, you get m,
> a, s, s, e, which is a different word with a different
> meaning than m, a, eszett, e. [...]

This can be avoided by replacing the eszett with SZ (which
pronuonces es-zett in German, BTW): MASZE. 'Masze' used to be
an accepted way to type this word when lacking an eszett
character, but few people use it today. I do, because it
eliminates ambiguity.

Nils

--
Nils M Holm <nmh@despammed.com>         http://www.holm-und-jeschag.de/nils/
Symbolic Computing - an Introduction to Pure LISP: http://www.t3x.org/scipl/

Report this thread to moderator Post Follow-up to this message
Old Post
Nils M Holm
05-10-05 01:58 PM


Re: [OT]The Unicode Rant
Ray Dillinger <bear@sonic.net> writes:

> There'd be one way to write things, was the promise; well, unless
> you used accented characters, in which case you had several
> different ways you could write most of them. You could use the
> precomposed character, or decompose it into a character followed by
> the accent. Why are these precomposed "compatibility" characters in
> the standard?

Because various simple programs and environments are not prepared to
the complexity of composing characters, so they can at least handle
important simple scripts (e.g. all European languages).

> We have them for codepoint compatibility with preexisting character
> set standards -- those "code pages" we were supposed to be able to
> forget about.

They can't be forgotten immediately. Data should be migrated. When
only a part of a system understands Unicode, data should be converted
on the boundary. If data in a legacy encoding cannot be losslessly
represented in Unicode, it's harder to adopt Unicode.

> The idea was that text written in any particular codepage set could
> be converted to Unicode by the simple expedient of adding an integer
> offset to all character codes over 127.

This sentence is false. Except for ISO-8859-1, this has never been a
constraint on assigning code points.

If some exotic script has this property, it's because there is no
point in randomly permuting characters which have already been encoded
in a different encoding.

There are no duplicate characters caused by the desire to have a
particular order of code points. "Duplicates" happen only because some
other encoding had both.

> Except that a lot of different codepages used many of the same
> accented characters, and we didn't want multiple copies of the same
> precomposed character in the standard. So what do we do with all the
> codepoints that mapped to the same accented character in different
> codepages (at their respective offsets into the unicode set)? We let
> one of them, for the most popular codepage, have that mapping, and
> then leave holes in the codepoint to character mapping at all the
> other locations.

Could you give an example? I believe this is completely false.

> Holes. There are codepoints that do not map to characters. In fact,
> there are lots and lots of them. These are mostly codepoints that
> will NEVER map to characters. And they're scattered haphazardly all
> the way through the character set, not gathered into a few sensible
> blocks for future expansion.

This makes easier to group characters by scripts and other kinds of
blocks. If code points were allocated sequentially, a given block
would be scattered over many places if it's not encoded the whole at
once.

It would be harder to find characters which relate to one another
(not that they are *always* near, but with sequential allocation this
would be far worse).

> That means you can't even do something as simple as iterating over
> the set without constantly consulting tables; if you do, then some
> of the numbers in your iteration will not correspond to valid
> codepoints.

Iterating over all assigned code points is not a very useful thing to do.

> UTF-8 allows codepoints to be represented as units of four different
> widths.

It's not a bug, it's a feature. It lets ASCII stay ASCII, and it makes
text converted from encodings like ISO-8859-x to UTF-8 only a bit
larger instead of 4 times larger. Without ASCII compatibility UTF-8
would not have been adopted as an encoding of emails and usenet.

> The result is that two strings known to be "unicode" cannot
> be compared to one another in a straightforward bitwise
> comparison for equality or collation.  Each must be decoded
> into a sequence of uniformly represented codepoints, trimmed
> of the byte order marker if necessary, and then compared --
> and this is pure overhead.

A given system uses a consistent representation for all its strings,
it translates them on the boundary with other systems. There is no
need to constantly handle strings in all those forms.

> If you want to know whether the three-thousand and first
> codepoint in two different files is identical, it should be
> possible, in my opinion, to s to the three thousand and
> first codepoint (uniform widths are nice that way) and see
> whether the bit patterns found there are identical.

In other words you would use only UTF-32 with some fixed endianness.
Sorry, Unicode would never have been adopted if it made all data
4 times larger than legacy encodings. You don't offer a viable
alternative.

When data is transmitted over the network, it's important that it's
not too large. Most Internet protocols don't use compression.

> If your parser intends to support multiple encodings, it gets even
> worse, because then absolutely EVERY codepoint has multiple possible
> representations.

How is that a fault of Unicode?

> As a result, lowercase sigma has two different codepoints in
> unicode; one for its "normal" form and one for its "final" form.

Given that all Gr encodings do this, how is that a fault of
Unicode?

The requirement of contextual shaping for Gr would rule out
many simple rendering engines (e.g. terminal emulators), so it's
understandable that they preferred to have a simpler engine at the
cost of adding only a single character.

> I guess the next thing to mention is unicode's "bidirectional
> algorithm." The first thing I want to point out is the writing
> system of China, whose characters are normally written from top to
> bottom, in columns from the right side of the page to the left.

http://www.unicode.org/notes/tn22/

Left to right mixed with right to left is already a complex issue.
Almost no systems support vertical scripts because it's too hard.

> Moreover, it's not properly part of a character set standard
> so much as a proper part of a display standard.

It is a property of a character set standard if you want to be able to
encode a mixture of latin and arabic in a plain text file.

> Characters are always transmitted or recorded in computer files
> in the order in which they'd be written or read by a user of the
> natural language recorded; that's a reasonable "given," and I
> believe a sufficient one. In mandating the behavior in the "bidi"
> algorithm, Unicode has made illegal the actual preferred behavior
> used in several texts and historical periods.

I have no idea what you are talking about. Unicode "bidi" relies on
logical ordering, i.e. it assumes that characters are encoded in the
order they'd be written or read.

> And it has once again made unpredictable knowledge about individual
> codepoints an absolute requirement for correct character handling,
> meaning anything that deals with this aspect of unicode has to be
> driven by large data tables that will cause lots and lots of cache
> misses.

This is an inherent complexity of the problem, not an unnecessary
complexity in a solution.

> While it's worthwhile for display purposes to be able to know which
> characters are l2r and r2l, the character set can support this much
> better by having an r2l subrange and an l2r subrange.

There are too many properties which could determine the order of
allocation of code points. You can't have all of them.

Besides, bidi properties are more complex than a mere split into l2r
and r2l classes. For example some characters can be either depending
on the context. Since all systems dealing with Arabic or Hebrew use
a single character for e.g. space, Unicode should not have been
changing that.

> The next Unicode mistake is the presence of precomposed ligatures.

It's only because of the desire of lossless representation of texts
converted from some other encodings.

> First of all it requires table lookups which can for the most part
> be eliminated by judicious layout of codepoints.

You said that you wanted code points to be allocated sequentially with
no holes. This is incompatible with having all cased characters in one
place, unless all characters are encoded at once.

Case mapping is not that important to constrain the order of
allocation of code points. There are other issues: whether it's a
combining character, character width in monospace fonts (i.e. whether
it's 0, 1, or 2 cells), whether it's an important live character or
some obscure extinct hieroglyph, or which script it belongs to.
Unicode took the last property as the primary guideline. You can't
have everything at once.

> And finally, there are the sinogram blocks; The CJK ideogram block,
> the CJK ideograph extension A, the CJK ideograph extension B, and
> the many thousands of CJK compatibility ideographs each of which is
> merely another way to write an existing ideograph. This is nuts for
> several reasons; First, it's nuts because these aren't allocated in
> a contiguous block.

Because they haven't been encoded at the same time. Would you prefer
moving existing code points to make room for new ones?

> Second, it's nuts because this is a snapshot of the vocabulary
> of several living languages, and as such is bound to continue to
> change.

Blame Chinese which have a script which depends on the vocabulary,
not Unicode.

> Third, it's nuts because it's woefully incomplete; even with all
> these hundreds of thousands of ideograms, the average Chinese person
> still can't correctly write his or her own address using these
> characters.

If Unicode waited with encoding Chinese only after it is sure that no
characters are missing, it would still not have done it.

> Rather than developing a new and sensible way to do things taking
> into account all the world's writing systems, they co-opted all the
> ways people were already doing things,

This is useful for migrating existing systems and their data to Unicode.

There are definitely various details in Unicode which could have been
done a bit better, but there is no serious alternative which actually
did it. It's better to use Unicode as it is, than to dream about an
ideal world remade from scratch, with code points arranged according
to your chosen criterion, or even several separate criteria at once,
where Germans, Gr and Chinese change important assumptions about
encoding their texts.

--
__("<         Marcin Kowalczyk
\__/       qrczak@knm.org.pl
^^     http://qrnik.knm.org.pl/~qrczak/

Report this thread to moderator Post Follow-up to this message
Old Post
Marcin 'Qrczak' Kowalczyk
05-10-05 01:58 PM


Re: [OT]The Unicode Rant
Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> wrote:
> There are definitely various details in Unicode which could have been
> done a bit better, but there is no serious alternative which actually
> did it. It's better to use Unicode as it is, than to dream about an
> ideal world remade from scratch, with code points arranged according
> to your chosen criterion, or even several separate criteria at once,
> where Germans, Gr and Chinese change important assumptions about
> encoding their texts.

Excellent article, and good summary. Unicode would be a lot simpler if
it weren't for the characters, but (unfortunately? luckily?) programmers
don't get to mandate how people write text. Until they do, we've no
choice but to implement it the best we can, and Unicode is a reasonable
attempt.

I do recommend that (programming) language designers learn at least the
basics of Unicode. While much of the encoding is straightforward, it's
not too hard to define a programming interface that makes Unicode harder
than it needs to be.

I also recommend that operating-system designers learn Unicode basics
and, at the very least, provide a simple mechanism to access the various
character properties from any program. That way, the data can all go
into some fixed part of memory accessed by the whole system, instead of
having each application dragging its own tables along and exacerbating
the lookup & caching problems.
--
Bradd W. Szonye
http://www.szonye.com/bradd

Report this thread to moderator Post Follow-up to this message
Old Post
Bradd W. Szonye
05-10-05 01:58 PM


Re: [OT]The Unicode Rant
* Ray Dillinger (bear@sonic.net)
....
> E.  When you bump that string back to lowercase, you get m,
> a, s, s, e, which is a different word with a different
> meaning than m, a, eszett, e.  In order to lowercase M, A,
> S, S, E correctly in German, you have to know from context
> which word was intended,

Unless you live in/program for Switzerland where these two
words *are* spelled the same.

...
> This is demented.

Thanks, coming from germany.

> In one fell swoop, if not somehow fixed,
> this singular insane character makes case operations in the
> redesigned extended character set nonreversible,

Not that ISO-8859-1 is any better in this regard.

...
> This is so astonishingly stupid that it warrants a kluge to
> fix it, and I name that kluge capital eszett.

Ouch. Wrong solution domain. Would you also like an uppercase '&'?

To stay OnT: The proper way to do a case-insensitive language
is to declare that a-z are the only allowed letters in names
that are not quoted.

Andreas

--
np: 4'33

Report this thread to moderator Post Follow-up to this message
Old Post
Andreas Krey
05-10-05 09:00 PM


Re: [OT]The Unicode Rant
Marcin 'Qrczak' Kowalczyk wrote:
> Ray Dillinger <bear@sonic.net> writes:
>
> 
>
>
> Because various simple programs and environments are not prepared to
> the complexity of composing characters, so they can at least handle
> important simple scripts (e.g. all European languages).

IMVHO, this doesn't have to be handled at app level.
A character encoding system should primarily be concerned with
reading/writing all characters losslessy, not displaying them.
 
>
>
> It's not a bug, it's a feature. It lets ASCII stay ASCII, and it makes
> text converted from encodings like ISO-8859-x to UTF-8 only a bit
> larger instead of 4 times larger. Without ASCII compatibility UTF-8
> would not have been adopted as an encoding of emails and usenet.

Yes, I agree with this. I'm a former iso-8859-1 user and I think that
having differing widths in characters is OK.

> Blame Chinese which have a script which depends on the vocabulary,
> not Unicode.

Couldn't sinograms be encoded on a stroke basis rather than a glyph
basis? Just guessing.

Sunnan
--
.i mi rodo roda fraxu

Report this thread to moderator Post Follow-up to this message
Old Post
Sunnan
05-10-05 09:00 PM


Sponsored Links




Last Thread Next Thread Next
Pages (7): [1] 2 3 4 5 6 » ... Last »
Search this forum -> 
Post New Thread

Scheme archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 09:50 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.