Code Comments
Programming Forum and web based access to our favorite programming groups.In response to several people who asked for it, I'm posting the Unicode Rant. Since this is not particularly about scheme, I've included a followup-to my email address in the headers; please don't flame in the newsgroup. I've also toned it down a bit, given it more technical justification and less anger, in hopes that it won't just trigger flames. The Unicode Rant ----- I remember hearing about unicode back in, I think, around 1993. It was supposed to be an expanded character set, more or less a sixteen bit replacement for ascii, that would simplify our lives as coders by allowing us to not worry about code pages, character set translations, etc; every character should have a single, simple, sixteen-bit code, and there'd finally be one universal format with a universal code width across all kinds of different platforms. So, finally, we'd have one unambiguous way to write things that used an expanded character set. That was a good idea. So I looked forward to it. And when more information was available, I checked it out. And when I checked it out, I began to have doubts. You see, the delivered standard was not actually very much like the promised one. There'd be one way to write things, was the promise; well, unless you used accented characters, in which case you had several different ways you could write most of them. You could use the precomposed character, or decompose it into a character followed by the accent. Why are these precomposed "compatibility" characters in the standard? In olden days, the only way to have an accented character was to have a precomposed accented character. Since unicode requires that a character-followed-by-accent is valid, the complication of implementation for a more general encoding scheme exists anyway; why do we also have precomposed characters? We have them for codepoint compatibility with preexisting character set standards -- those "code pages" we were supposed to be able to forget about. We have allowed redundant characters into the so-called universal standard, for the sole purpose of being compatible with the mess we were trying to be better than. The idea was that text written in any particular codepage set could be converted to Unicode by the simple expedient of adding an integer offset to all character codes over 127. Except that a lot of different codepages used many of the same accented characters, and we didn't want multiple copies of the same precomposed character in the standard. So what do we do with all the codepoints that mapped to the same accented character in different codepages (at their respective offsets into the unicode set)? We let one of them, for the most popular codepage, have that mapping, and then leave holes in the codepoint to character mapping at all the other locations. Holes. There are codepoints that do not map to characters. In fact, there are lots and lots of them. These are mostly codepoints that will NEVER map to characters. And they're scattered haphazardly all the way through the character set, not gathered into a few sensible blocks for future expansion. That means you can't even do something as simple as iterating over the set without constantly consulting tables; if you do, then some of the numbers in your iteration will not correspond to valid codepoints. In an era when cache misses are known to be the single most expensive thing you can do in a computer program, a standard which could have avoided them made data tables mandatory for something as simple as iteration. That sucks. Now, one effect of having all these codepages preserved into unicode was that since the codepoints within codepages aren't changed, differently accented versions of the same letter are scattered throughout the codespace. If you want to collate all A's before all B's and so on, you must use another external table to find your way through the codespace. Furthermore, different characters have different subsets of accented versions, and even with the same accents, do not usually appear in the same order. Another promise that was made to us was that Unicode would represent characters, not fonts. looking through the current unicode standard, I see fractur, monospace, sans serif, italic, bold, small, bold script, and double width versions of the entire latin alphabet, plus others. Not only did they fail to hold the line, they failed spectacularly, by dupicating the entire alphabet dozens of times. I accept the same argument that the unicode committee accepted; in many cases, these different forms are part of basic expression in some realms like mathematics. I disagree with the conclusion that these alphabets needed these repetitions; acknowledging the need for fonts, I'd have made modifier codepoints to express fonts. Instead, we have many many wasted codepoints and many many repetitions of characters that, for most purposes, ought to be considered the same character. Moreover, we have admitted font differences into the standard but have not made them generally applicable across alphabets; we've dedicated thousands of codepoints to miscellaneous versions of the latin alphabet to accomodate fonts, but we still can't apply fonts to accented characters or to non-latin characters. Treating fonts as a modifier character instead would have used only a dozen codepoints and at the same time would have allowed the uniform application of fonts across alphabets and accents. Let's move on to another promise; remember the idea of a uniform-width codespace? Great idea, wasn't it? But along the way, Unicode attempted to swallow several very large character sets whole and wound up needing more than sixteen bits. Oops. It's now a 21-bit standard. And we set aside thousands of numbers in the 16-codespace that will never map to codepoints, because they are used to express the first half and last half of codepoints larger than 16 bits. These are called "surrogates" - I suppose they are a "surrogate" for a standard with a uniform character width. But that's not the only failure of the set to have a uniform width; Unicode has no fewer than seven different encodings, named UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. UTF-8 allows codepoints to be represented as units of four different widths. The three UTF-16 forms allow codepoints of two different widths. The three UTF-32 forms are in fact uniform-width encodings, but an extra codepoint (the byte order marker) is required to tell the system which encoding you're using if it's not a "whatever-BE" or "whatever-LE" form. The result is that two strings known to be "unicode" cannot be compared to one another in a straightforward bitwise comparison for equality or collation. Each must be decoded into a sequence of uniformly represented codepoints, trimmed of the byte order marker if necessary, and then compared -- and this is pure overhead. If you want to know whether the three-thousand and first codepoint in two different files is identical, it should be possible, in my opinion, to sto the three thousand and first codepoint (uniform widths are nice that way) and see whether the bit patterns found there are identical. Wouldn't that be nice? But no, instead you have to process six thousand characters worth of overhead through decompression, not tripping if there's a "fake" codepoint inserted in one to show byte order, and only then will you be able to make your comparison. Now, I've already talked a little about accented characters, which are implemented in unicode as precomposed characters with accents built-in *and* as character-combiner sequences. But let me revisit that point; This doesn't just mean there is more than one way to write a given character, this means that there are ways of differing widths as measured in codepoints to write a given character. Unicode requires these different ways of writing a character to be recognized as identical for (canonical) character comparisons, but this opens up a huge can of worms in the process of comparing characters and strings. In the first place, it means that the same codepoint index in two different strings does not necessarily refer to the same character index, even if the two strings are otherwise canonically identical. In order to do a substring comparison, you have to process the entire string up to the end of the substrings you're interested in in order to know the character indexes. This throws out every possibility of efficient text comparisons. Also, if you are parsing a (formal or natural) language, then any character that can be represented by more than one codepoint sequence doubles (or more) the size of the resulting state machine for processing codepoints, so the possibility of efficient parsing is also out the window with this polymorphism. If your parser intends to support multiple encodings, it gets even worse, because then absolutely EVERY codepoint has multiple possible representations. There is no taking input directly from a file for parsing any more; instead, you have to preprocess it to deal with its encoding forms and combining codepoints in order to get a "logical" stream of unicode characters in some known encoding before you have even a prayer of getting a parser to run efficiently; and having done that, your count of characters in a known encoding bears no reliable relationship to the actual location in the file of the corresponding codepoints. The result is that when your parser finds something it's supposed to change, it cannot simply s
to the appropriate codepoint in the file, based on the character count in the parser, and change it. This needlessly complicates the design of stream editors such as 'sed' - in a Unicode world, they require overhead to work that can easily add an order of magnitude to their cost. Now let's talk about different single codepoints that all represent the exact same character. In lots of writing systems, a character can appear as any of a dozen or more different glyphs depending on context. The best known example of this is the "final sigma" form in gr
, where the lowercase character sigma took a different shape or glyph if it was the final character in a word. And in the history of computing, the only way to get a different character shape on the screen was to use a different codepoint. And Unicode inherits everything that ever got different codepoints in any codepage. As a result, lowercase sigma has two different codepoints in unicode; one for its "normal" form and one for its "final" form. It is however far from the only character that this is done to; others include most of the characters in heavily contextual scripts such as myanmar and arabic. The problem is that despite looking different, these are distinguished solely by the written context in which they appear, and recognized by their user communities as being the same character. If a Gr
user searches a document for "sigma" and instances of "final sigma" don't show up, it will be unexpected for that Gr
user. Likewise for an Arabic user searching for a given character and not finding its isolated, medial, initial, or final forms. If the Gr
user cuts "final sigma" and pastes into a non-final context, he expects the regular form of sigma rather than the final form. Once again, we are forced to go to massive code tables to figure out all the things we ought to be looking for or replacing with rather than just taking an input and going looking. In a world where cache misses are the single most expensive thing you can do on a computer, gigantic data tables are required just for a cut-and-paste operation or a "search" operation. It would be a better solution to leave distinctions between forms solely dependent on their written context to the rendering engine that prints their glyphs. Then we could have the same codepoint representing the same character and do the simple efficient thing in text searches and search-and-replace operations. I guess the next thing to mention is unicode's "bidirectional algorithm." The first thing I want to point out is the writing system of China, whose characters are normally written from top to bottom, in columns from the right side of the page to the left. The second thing I want to point out is the scripts of some south sea islanders, which are normally written from bottom to top, in columns from the right side of the page to the left. The third thing I want to point out is classical gr
, which was often written in "boustrophedon" form, that is, in lines alternating left-to-right and right-to-left, from the top of the page to the bottom. I guess my point here is that "bidirectional" doesn't even begin to cover the universe of writing systems. Moreover, it's not properly part of a character set standard so much as a proper part of a display standard. Characters are always transmitted or recorded in computer files in the order in which they'd be written or read by a user of the natural language recorded; that's a reasonable "given," and I believe a sufficient one. In mandating the behavior in the "bidi" algorithm, Unicode has made illegal the actual preferred behavior used in several texts and historical periods. And it has once again made unpredictable knowledge about individual codepoints an absolute requirement for correct character handling, meaning anything that deals with this aspect of unicode has to be driven by large data tables that will cause lots and lots of cache misses. While it's worthwhile for display purposes to be able to know which characters are l2r and r2l, the character set can support this much better by having an r2l subrange and an l2r subrange. Then querying for r2l or l2r could be done using a simple predicate on the codepoint rather than using yet more hideously expensive table lookups. The next Unicode mistake is the presence of precomposed ligatures. Ligation, in most cases, is more properly a function of the rendering engine rather than the character standard. In the cases where we don't want to leave choices about ligation to the rendering engine, precomposed ligatures are the wrong approach. A zero-width non-breaking space can be used to inhibit ligation between two normally joining characters; a corresponding codepoint forcing ligation would be an appropriate complement. Thus, you'd have A, ligation joiner, e, instead of the Ae ligature. The benefits of this procedure are manifold. First, it leaves common ligation to the rendering engine; this leaves it primarily under locale control, which is appropriate. Most people who don't use an alphabet natively can't look at heavily ligated scripts like arabic or myanmar and even be able to pick out individual characters, which will prevent them from being able to work with text in that script in even the most basic of ways. In this case the ligation works against usability. In precisely these locales, the ligation will be left undone, leaving people with a much simpler image wherein non-natives can at least count characters, look for morphological patterns, etc. In locales where the heavy ligation will be understood and useful to most people, on the other hand, ligation works for usability, and in precisely those cases it will be present. Second, it allows explicit ligation or non-ligation in the character standard for when it is actually essential, and does not constrict, as the current approach does, choices about which characters can be ligated. Third, it eliminates the need for "Titlecase" which has needlessly complicated the Unicode standard (and added yet another table lookup to working with these characters). Fourth, it simplifies case conversions on ligatures - one simply converts the cases of the component characters - and leaves case-converted ligatures, with the sole exception of eszett, taking exactly the same number of codepoints to express as the original ligature. (of which more anon; having strings change length just because of a case change is madness, for reasons I will explain soon), Fifth, it reduces the number of different ways to write the same string, which, together with a bunch of other such changes, can help make efficient lexers and parsers possible again. Now, let's talk about case conversion. In Unicode, converting a cased character from lowercase to uppercase and back is insane, for several reasons. First of all it requires table lookups which can for the most part be eliminated by judicious layout of codepoints. Second, the length of the string (as measured in codepoints) is likely to change if there are any accented characters or ligations; Third, the operation is not reversible since there exist many cased characters which are not the preferred opposite-case elements of their own opposite-case elements; Fourth, there is Titlecase and its associated complexities to worry about if ligatures are involved. Let's examine each of these issues in turn, in the light of possible other ways of handling things. First of all the table lookups required for case conversion can be largely eliminated. The 'c' code for converting an ascii character to uppercase is fairly simple and involves no table lookups. if (islower(ch)) ch += ('A' - 'a'); 'islower' here is taken to be a compiler macro that checks to see whether the codepoint 'ch' is between two known codepoints representing the beginning and end of the lowercase range, and since the expression ('A' - 'a') is a comparison of two scalar constants, it generates no runtime code; the compiler simply replaces it with a constant. This is simple, efficient, and relatively bugproof, and something a lot like it can be used with an extended character set. In the first place, for the moment all cased alphabets are also left-to-right alphabets, so there is no conflict in codepoint layout with the separation of left-to-right and right-to-left into separate blocks. All upper and lower-case characters are members of the left-to-right block. We could simply have two sub-blocks, one of which contains lowercase characters and the other of which contains uppercase characters, and use much the same logic as above. The reasons why this is not feasible in Unicode are largely eliminated by the changes already proposed; The problem with precomposed accented characters with no altercase equivalent is eliminated along with precomposed accented characters. The problem with ligatures having no altercase equivalent is eliminated along with precomposed ligatures. The complexities of titlecasing are eliminated along with precomposed ligatures as well. There remain two and only two problems that we must address. The first is the singular character eszett, which is problematic in several other ways and which I will deal with separately. The second is that some case mappings, particularly those involving the letter i, depend on the locale. Our modified code for uppercasing in an extended character set, then, looks like this: if (locale.has_case_exceptions && !locale_case_exception(ch)) ch = locale.tolower(ch); else ch += ('A' - 'a'); Since most locales won't have any case exceptions, the macro locale.has_case_exceptions will usually be false, which means table lookups can usually be completely avoided. Since given a comprehensive 'default' for extended-character-set casing, the case exceptions for a given locale are usually going to be no more than two or three characters, The table in the table lookups in lines 2 and 3 will therefore normally be trivial in size rather than gigantic, and therefore should not frequently cause cache misses. This code will run about two orders of magnitude faster than the cheapest possible implementation of case switching in unicode. The second problem with unicode case operations is that they are likely to change the length of the string. With the elimination of precomposed ligations and precomposed accented characters, they are dramatically less likely to do so. In fact, the only remaining case where the length of the string would change is, once again, the problematic character eszett. The special dementia of the character eszett is that it's not a ligature, but it is a lowercase character with no corresponding uppercase character. When it changes case it changes into two capital letter 'S' characters. It is also a poster child for non-reversible case changes; if you take the german word that is spelled m, a, eszett, e, and bump it to uppercase, you get M, A, S, S, E. When you bump that string back to lowercase, you get m, a, s, s, e, which is a different word with a different meaning than m, a, eszett, e. In order to lowercase M, A, S, S, E correctly in German, you have to know from context which word was intended, so something as simple as case operations winds up requiring human-level conceptual knowledge from outside the text. This is demented. In one fell swoop, if not somehow fixed, this singular insane character makes case operations in the redesigned extended character set nonreversible, makes for uppercase forms that are ambiguous as to which word they mean and what is the proper lowercase for them, and makes case operations in the redesigned extended character set capable of changing the length of the string; in short, all the craziness of unicode case operations in one character. This is so astonishingly stupid that it warrants a kluge to fix it, and I name that kluge capital eszett. When printed, capital eszett looks like two 'S' characters side by size, but it is a single character, and when made lowercase, it lowercases properly to eszett, meaning that the uppercase form of the string is not ambiguous as to which word it's the uppercase form of and has the same length in codepoints as the lowercase. This leaves the titlecase form 'Ss' unaccounted for, but I'm willing to regard it as having the same value as other contextual forms like 'final-sigma'; when printing uppercase eszett in an initial context in a word and followed by a lowercase character, You'd print the 'Ss' form instead of the 'SS' form. The decision can be, and properly should be, left to a locale-aware rendering engine. The third problem with unicode case changing operations is that they are nonreversible because there are many cased characters which are not the preferred opposite case of their own opposite case. Most of these characters are ligatures or accented characters, and the proposed redesign eliminates them. By representing these combinations in terms of simply cased characters, it becomes possible to perform case operations simply by operating on the components of the aggregates. The fourth problem was properly representing titlecased ligatures; This ceases to be a problem when precomposed ligatures are removed from the character set design. What unicode calls a "titlecase ligature", in the redesigned character set, is a capital character, a ligation joiner, and a lowercase character. The simple algorithm for capitalizing a word - changing the first cased character to uppercase and all others to lowercase - is in fact sufficient even when words begin with ligatures, and produces forms corresponding to Unicode's titlecase forms. The next thing I want to address is Hangul. This script is represented in Unicode in two different ways, and one of them ought to be eliminated for the sake of reducing multiplicity so as to be able to produce efficient parsers and lexers and for the sake of doing efficient and unambiguous string comparisons. For the same reasons as I advocated getting rid of ligatures and precomposed accented characters above, plus the fact that the Jamo form is more general, I'd advocate getting rid of the precomposed Hangul Syllables. And finally, there are the sinogram blocks; The CJK ideogram block, the CJK ideograph extension A, the CJK ideograph extension B, and the many thousands of CJK compatibility ideographs each of which is merely another way to write an existing ideograph. This is nuts for several reasons; First, it's nuts because these aren't allocated in a contiguous block. Second, it's nuts because this is a snapshot of the vocabulary of several living languages, and as such is bound to continue to change. Third, it's nuts because it's woefully incomplete; even with all these hundreds of thousands of ideograms, the average Chinese person still can't correctly write his or her own address using these characters. Addresses using Sinograms are particularly problematic. Because this is essentially the vocabulary of place names, it contains a lot of proper nouns. In a universe where a character stands for a word, the "character set" is effectively the dictionary, and it's uncommon to find proper nouns in the dictionary. And Chinese culture uses a lot more place names than American culture does; for example in Chinese cities, every intersection has a name, whereas in American cities, it is the streets that are named instead; This means that the number of place names in a Chinese city is on the order of the square of the number of place names in an American city of the same size. Codification of a set of sinograms that includes all the place names has not yet been done, by anyone, and opinions vary on whether it is a worthwhile task. Japan adopted a sensible solution to this problem; there are auxiliary syllabary alphabets that are used to encode names whose Kanji form isn't available. In fact, these auxiliary alphabets are rapidly overtaking Kanji as the preferred form of written communication in Japan. Korea also adopted a very sensible solution to this problem, with the system of Hangul writing and the Jamo, and the ideogram characters are seen there as well with increasing rarity. But China itself is a land of deep traditions, and the ideographic writing system looks as though it will be used there for several more generations at least. This problem goes deep, and unlike most of the other problems I've mentioned, there just isn't a complete and clever solution that will make everybody happy. One possibility that I think works for the redesigned extended character set, but which won't make everyone happy, is to allow (say) 4096 characters which are ideograph stroke modifier characters, and have a sequence of strokes "modify" a space character to build any ideogram. You'd leave final rendering of the ideogram to the rendering engine, which ought to be able to tell what ideogram you mean from the stroke sequence if it's a common one, and at least have enough information to make an attempt at a rendering if it's unknown. The question here is whether 4096 strokes is the right number. The consideration involved was this: +-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ In the above grid, there are eight columns of '+' signs and eight rows. A stroke between any two '+' signs can be expressed approximately as 3 bits for beginning row, 3 bits for beginning column, 3 bits for ending row, and 3 bits for ending column. That's a total of 12 bits, giving 4096 possible strokes. The ending points of the strokes can be displaced by half-a-row down and half-a-column right for slightly more control. If we double the number of rows and columns, we move up to 16 bits for 65536 possible strokes, which is probably too many. The problem here is what is it that we're calling a character? Is it in fact a stroke rather than an ideogram? Are we comfortable with moving from a single ideogram to a sequence of perhaps a dozen or more strokes? Are we comfortable with the idea that a "word" (or ideograph) may be "misspelled" by using the wrong stroke and shifting a line in the middle one column to the right or left? I for one don't find it more unlikely or more disturbing than the fact that English speakers are perfectly capable of mispelling words with our alphabet too, and usually don't, and it's completely in line with the way accented characters are built from multiple codepoints; So it's a consistent and reasonably complete design. The unicode standard aimed not to get into rendering issues, but here I'm advocating an approach which is based on the physical shape of a glyph for the character involved. There are many examples of typography where a given sinogram is rendered differently in a different font, in a way that would defeat identifying it with this scheme; a stroke slants to the left rather than to the right, for example. But there are also many regional differences in spelling, as for example between US and UK English, when it is recognizable to all natives that the same word is intended. Further, a standardized "spelling" of a given Sinogram need not dictate how it is rendered; remember the final glyph is provided by the rendering engine, after it looks up the combination of strokes. So subtleties of typography and character design for sinograms can be preserved in this system, even while promoting a standard spelling for "words." To sum up: The unicode committee did not design a standard for use. They designed a standard for adoption. Rather than developing a new and sensible way to do things taking into account all the world's writing systems, they co-opted all the ways people were already doing things, mostly driven by hardware and software limitations that unicode conformance demands overcoming anyway, and duplicated all of them, with all their kluges, redundancies, and problems, into unicode, making a standard much more complex than it needed to be and much more difficult to implement or work with than it needed to be, promoting errors and inefficiencies in character handling. A much simpler system for encoding writing is capable of encoding all the same scripts, using far fewer codepoints, while minimizing table lookups and cache misses, eliminating most case asymmetries, providing far better support for lexing and parsing by eliminating most multiplicities and ambiguities, and providing better coverage. It is true that under the redesigned character set more codepoints would generally be used to express the same strings, especially in ideographic languages, but it is also true that "wasting" space in the data flow is far less harmful to the functioning of programs than complicating the algorithms used and bringing in big data tables, and that the "wasted" space can be mostly recouped, as far as the commodities of disk space and line bandwidth are concerned, by standard compression algorithms (NOT encoding schemes) and uncompressed for applications where we want random-access.
Post Follow-up to this messageRay Dillinger <bear@sonic.net> writes: > Since this is not particularly about scheme, I've included a followup-to > my email address in the headers; please don't flame in the newsgroup. Let the discussion work in the newsgroup. A healthy discussion is good, the reason I follow usenet. The topic started in c.l.s so keeping it here is not a problem. People can always ignore threads, or put an ObScheme note. ObScheme: unicode is a problem in scheme and in general. discuss why or why not. Fascinating article, BTW. -- Cheers, The Rhythm is around me, The Rhythm has control. Ray Blaak The Rhythm is inside me, rAYblaaK@STRIPCAPStelus.net The Rhythm has my soul.
Post Follow-up to this messageRay Dillinger writes: Ray Blaak wrote: > Let the discussion work in the newsgroup. A healthy discussion is > good, the reason I follow usenet. The topic started in c.l.s so > keeping it here is not a problem. People can always ignore threads, or > put an ObScheme note. I agree. > ObScheme: unicode is a problem in scheme and in general. discuss why > or why not. Unicode can be tricky to implement and work with, but I think that's a symptom of the underlying problem rather than the problem itself. Natural language is messy, and Unicode attempts to tackle it all at once. As a result, dealing with Unicode is harder than dealing with any single encoding. However, in my opinion it's much easier than trying to tackle each encoding separately, one at a time, and is a success from that point of view. Also, I think Scheme has the potential to handle Unicode more gracefully than many other languages, because Scheme's superior abstraction mechanisms can help to hide some of the ugliness. -- Bradd W. Szonye http://www.szonye.com/bradd
Post Follow-up to this messageRay Blaak wrote: > Let the discussion work in the newsgroup. A healthy discussion is good, th e > reason I follow usenet. The topic started in c.l.s so keeping it here is n ot a > problem. I concur. I thought it was an interesting rant and I'm curious to hear solutions to the problems proposed. -- .i mi rodo roda fraxu
Post Follow-up to this messageRay Dillinger wrote: > In response to several people who asked for it, I'm posting the > Unicode Rant. Thanks! I have a few remarks in response, mainly to correct some factual errors and misconceptions. > There'd be one way to write things, was the promise; well, unless you > used accented characters, in which case you had several different ways > you could write most of them. You could use the precomposed > character, or decompose it into a character followed by the accent. > Why are these precomposed "compatibility" characters in the standard? > ... We have them for codepoint compatibility with preexisting > character set standards -- those "code pages" we were supposed to be > able to forget about. You /can/ forget about code pages if you use only Unicode, but not everyone has that luxury. Real applications must deal with legacy issues like non-upgradeble toolchains and round-trip conversion. While it's possible to keep track of the necessarily information with auxiliary state (e.g., attach an "original encoding" field to strings), Unicode becomes less of an advantage in that scenario. In contrast, by providing compatibility characters, the standard offers a smoother upgrade path. Furthermore, by specifying a canonical version of each compatibility character, the standard ensures that everyone eventually upgrades to the same place. > We have allowed redundant characters into the so-called universal > standard, for the sole purpose of being compatible with the mess we > were trying to be better than. But when it comes to judging the value of an encoding standard, the degree of acceptance is an important part of "better." An encoding is useless if compatibility issues prevent adoption. > Holes. There are codepoints that do not map to characters. In fact, > there are lots and lots of them ... That means you can't even do > something as simple as iterating over the set without constantly > consulting tables .... What kind of program needs to iterate over all Unicode characters without knowing what they mean? You can't even sensibly display them or make a list of their names without tables. (For that matter, what kind of program needs to iterate over the whole set in the first place?) > Now, one effect of having all these codepages preserved into unicode > was that since the codepoints within codepages aren't changed, > differently accented versions of the same letter are scattered > throughout the codespace. If you want to collate all A's before all > B's and so on, you must use another external table to find your way > through the codespace. That collation method is not generally valid anyway, since accented letters often collate differently from unaccented letters. It would work for purely English-language texts, but then you're screwed if you ever decide to globalize the program, because you'll need to switch all instances of this hack to the more general form using tables. When you globalize, the tables are inevitable. Why encourage hacks that only make it harder to use the encoding as intended? By the way, this also relates to the pre-composed characters issue. What would be a "base letter + accent" in some languages is a unique letter in others. This kind of ambiguity crops up throughout Unicode, and the standard generally errs in favor of treating them as different characters (albeit with a compatibility decomposition). You can treat them as separate characters or not, with a simple conversion if you don't get your preferred form. That's the reality that users want, even if it makes the standard a bit "messier" algorithmically. > Let's move on to another promise; remember the idea of a uniform-width > codespace? [But the standard] set aside thousands of numbers in the > 16-codespace that will never map to codepoints, because they are used > to express the first half and last half of codepoints larger than 16 > bits .... The surrogate characters cannot appear in the fixed-width encodings of Unicode. For example, a UTF-32 application need not cope with variable- width characters or surrogates at all; it's merely a big "hole" in the encoding. > But that's not the only failure of the set to have a uniform width; > Unicode has no fewer than seven different encodings, named UTF-8, > UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. It's somewhat disingenuous to describe these as "seven different encodings," since four of them are just minor specializations of UTF-16 and UTF-32. Also, while I would agree that the UTF-16 series is an unfortunate historical accident, the UTF-8 encoding exists for reasons that would still be valid even if Unicode were originally conceived in its 21-bit form. > If you want to know whether the three-thousand and first codepoint in > two different files is identical -- Why would you want to do this? In the presence of combining accents, it's just as useless as checking the thousandth byte of an Asian "shift" encoding. For most meaningful tasks, you need a whole character, and you can't get that with simple code-point indices. > Also, if you are parsing a (formal or natural) language, then any > character that can be represented by more than one codepoint sequence > doubles (or more) the size of the resulting state machine for > processing codepoints .... That's trivally avoidable by canonicalizing the data before parsing. Indeed, many of your objections are trivially addressed by canonicalizing data upon input and always using the canonical form internally. > Now let's talk about different single codepoints that all represent > the exact same character [e.g., medial sigma vs final sigma]. Again, compatibility is important here. Unicode has to cope with the reality of fonts, which assign code points by glyph rather than by character. While the Unicode ideal of ignoring purely graphical differences is interesting, it clashes badly with the reality of how characters get onto a screen or a sheet of paper. Ignoring the difference between roman and italic A is OK, but if you sweep the sigma issue under the rug, you'll just inspire competing and incompatible standards, which is worse than useless. > If a Gruser searches a document for "sigma" and instances of > "final sigma" don't show up, it will be unexpected for that Gr
> user. That depends on the context (i.e., whether you're writing a letter or typesetting a book). In contexts where they should match: just don't do that. The standard provides the information necessary to fold the two letters in "typography-insensitive" matching situations. > The next Unicode mistake is the presence of precomposed ligatures. Perhaps, although this is arguably just another case of the issue underlying the sigma problem. > Now, let's talk about case conversion. In Unicode, converting a cased > character from lowercase to uppercase and back is insane, for several > reasons. It's mainly because case conversion is insane in natural languages. The Unicode Consortium can't fix that, but can only deal with it. > First of all the table lookups required for case conversion can be > largely eliminated. The 'c' code for converting an ascii character to > uppercase is fairly simple and involves no table lookups. > > if (islower(ch)) ch += ('A' - 'a'); It'd be nice if you could apply this to all natural languages, but it just doesn't work. What happens when you throw Japanese text at this kind of code? It breaks horribly unless (1) you introduce a table lookup to figure out whether the character has case variants, or (2) you create fake "upper" and "lower" cases for Japanese, which creates the same kind of comparison problems you complained about for Gr
sigma. This is another example of code that must die in globalized software. The table lookup is inevitable, so use it to do things right instead of relying on bit-pattern hacks. > Our modified code for uppercasing in an extended character set, then, > looks like this: > > if (locale.has_case_exceptions && > !locale_case_exception(ch)) > ch = locale.tolower(ch); > else ch += ('A' - 'a'); > > Since most locales won't have any case exceptions -- That's not true unless you disallow Japanese text in Western locales. > [German eszett] is demented. German writers deal with it just fine. Unfortunately, computers are not nearly as good at natural language. > This is so astonishingly stupid that it warrants a kluge to fix it, > and I name that kluge capital eszett. When printed, capital eszett > looks like two 'S' characters side by size, but it is a single > character .... Unless you can somehow convince German typists to enter "capital eszett" instead of "SS" on their keyboards, this is not a solution. It just pushes the problem onto input devices/routines, which must guess whether you meant "MASSE" or "MA<SS>E" when you typed it into your word processor. While you might reasonably get professional typesetters to explicitly indicate ligatures in their galleys, this "capital eszett" idea seems doomed to failure. It would permit round-trip casing in one direction, but it would not actually solve the problem. > To sum up: The unicode committee did not design a standard for use. > They designed a standard for adoption. The latter is necessary to have the former. Character encodings are literally useless unless you can share them with other people, and that requires widespread adoption. -- Bradd W. Szonye http://www.szonye.com/bradd
Post Follow-up to this messageRay Dillinger <bear@sonic.net> wrote: > The special dementia of the character > eszett is that it's not a ligature, but it is a lowercase > character with no corresponding uppercase character. This is because - at least in German - there are no words that begin with an eszett. Are there other languages using this character? > When > it changes case it changes into two capital letter 'S' > characters. It is also a poster child for non-reversible > case changes; if you take the german word that is spelled m, > a, eszett, e, and bump it to uppercase, you get M, A, S, S, > E. When you bump that string back to lowercase, you get m, > a, s, s, e, which is a different word with a different > meaning than m, a, eszett, e. [...] This can be avoided by replacing the eszett with SZ (which pronuonces es-zett in German, BTW): MASZE. 'Masze' used to be an accepted way to type this word when lacking an eszett character, but few people use it today. I do, because it eliminates ambiguity. Nils -- Nils M Holm <nmh@despammed.com> http://www.holm-und-jeschag.de/nils/ Symbolic Computing - an Introduction to Pure LISP: http://www.t3x.org/scipl/
Post Follow-up to this messageRay Dillinger <bear@sonic.net> writes: > There'd be one way to write things, was the promise; well, unless > you used accented characters, in which case you had several > different ways you could write most of them. You could use the > precomposed character, or decompose it into a character followed by > the accent. Why are these precomposed "compatibility" characters in > the standard? Because various simple programs and environments are not prepared to the complexity of composing characters, so they can at least handle important simple scripts (e.g. all European languages). > We have them for codepoint compatibility with preexisting character > set standards -- those "code pages" we were supposed to be able to > forget about. They can't be forgotten immediately. Data should be migrated. When only a part of a system understands Unicode, data should be converted on the boundary. If data in a legacy encoding cannot be losslessly represented in Unicode, it's harder to adopt Unicode. > The idea was that text written in any particular codepage set could > be converted to Unicode by the simple expedient of adding an integer > offset to all character codes over 127. This sentence is false. Except for ISO-8859-1, this has never been a constraint on assigning code points. If some exotic script has this property, it's because there is no point in randomly permuting characters which have already been encoded in a different encoding. There are no duplicate characters caused by the desire to have a particular order of code points. "Duplicates" happen only because some other encoding had both. > Except that a lot of different codepages used many of the same > accented characters, and we didn't want multiple copies of the same > precomposed character in the standard. So what do we do with all the > codepoints that mapped to the same accented character in different > codepages (at their respective offsets into the unicode set)? We let > one of them, for the most popular codepage, have that mapping, and > then leave holes in the codepoint to character mapping at all the > other locations. Could you give an example? I believe this is completely false. > Holes. There are codepoints that do not map to characters. In fact, > there are lots and lots of them. These are mostly codepoints that > will NEVER map to characters. And they're scattered haphazardly all > the way through the character set, not gathered into a few sensible > blocks for future expansion. This makes easier to group characters by scripts and other kinds of blocks. If code points were allocated sequentially, a given block would be scattered over many places if it's not encoded the whole at once. It would be harder to find characters which relate to one another (not that they are *always* near, but with sequential allocation this would be far worse). > That means you can't even do something as simple as iterating over > the set without constantly consulting tables; if you do, then some > of the numbers in your iteration will not correspond to valid > codepoints. Iterating over all assigned code points is not a very useful thing to do. > UTF-8 allows codepoints to be represented as units of four different > widths. It's not a bug, it's a feature. It lets ASCII stay ASCII, and it makes text converted from encodings like ISO-8859-x to UTF-8 only a bit larger instead of 4 times larger. Without ASCII compatibility UTF-8 would not have been adopted as an encoding of emails and usenet. > The result is that two strings known to be "unicode" cannot > be compared to one another in a straightforward bitwise > comparison for equality or collation. Each must be decoded > into a sequence of uniformly represented codepoints, trimmed > of the byte order marker if necessary, and then compared -- > and this is pure overhead. A given system uses a consistent representation for all its strings, it translates them on the boundary with other systems. There is no need to constantly handle strings in all those forms. > If you want to know whether the three-thousand and first > codepoint in two different files is identical, it should be > possible, in my opinion, to sto the three thousand and > first codepoint (uniform widths are nice that way) and see > whether the bit patterns found there are identical. In other words you would use only UTF-32 with some fixed endianness. Sorry, Unicode would never have been adopted if it made all data 4 times larger than legacy encodings. You don't offer a viable alternative. When data is transmitted over the network, it's important that it's not too large. Most Internet protocols don't use compression. > If your parser intends to support multiple encodings, it gets even > worse, because then absolutely EVERY codepoint has multiple possible > representations. How is that a fault of Unicode? > As a result, lowercase sigma has two different codepoints in > unicode; one for its "normal" form and one for its "final" form. Given that all Gr
encodings do this, how is that a fault of Unicode? The requirement of contextual shaping for Gr
would rule out many simple rendering engines (e.g. terminal emulators), so it's understandable that they preferred to have a simpler engine at the cost of adding only a single character. > I guess the next thing to mention is unicode's "bidirectional > algorithm." The first thing I want to point out is the writing > system of China, whose characters are normally written from top to > bottom, in columns from the right side of the page to the left. http://www.unicode.org/notes/tn22/ Left to right mixed with right to left is already a complex issue. Almost no systems support vertical scripts because it's too hard. > Moreover, it's not properly part of a character set standard > so much as a proper part of a display standard. It is a property of a character set standard if you want to be able to encode a mixture of latin and arabic in a plain text file. > Characters are always transmitted or recorded in computer files > in the order in which they'd be written or read by a user of the > natural language recorded; that's a reasonable "given," and I > believe a sufficient one. In mandating the behavior in the "bidi" > algorithm, Unicode has made illegal the actual preferred behavior > used in several texts and historical periods. I have no idea what you are talking about. Unicode "bidi" relies on logical ordering, i.e. it assumes that characters are encoded in the order they'd be written or read. > And it has once again made unpredictable knowledge about individual > codepoints an absolute requirement for correct character handling, > meaning anything that deals with this aspect of unicode has to be > driven by large data tables that will cause lots and lots of cache > misses. This is an inherent complexity of the problem, not an unnecessary complexity in a solution. > While it's worthwhile for display purposes to be able to know which > characters are l2r and r2l, the character set can support this much > better by having an r2l subrange and an l2r subrange. There are too many properties which could determine the order of allocation of code points. You can't have all of them. Besides, bidi properties are more complex than a mere split into l2r and r2l classes. For example some characters can be either depending on the context. Since all systems dealing with Arabic or Hebrew use a single character for e.g. space, Unicode should not have been changing that. > The next Unicode mistake is the presence of precomposed ligatures. It's only because of the desire of lossless representation of texts converted from some other encodings. > First of all it requires table lookups which can for the most part > be eliminated by judicious layout of codepoints. You said that you wanted code points to be allocated sequentially with no holes. This is incompatible with having all cased characters in one place, unless all characters are encoded at once. Case mapping is not that important to constrain the order of allocation of code points. There are other issues: whether it's a combining character, character width in monospace fonts (i.e. whether it's 0, 1, or 2 cells), whether it's an important live character or some obscure extinct hieroglyph, or which script it belongs to. Unicode took the last property as the primary guideline. You can't have everything at once. > And finally, there are the sinogram blocks; The CJK ideogram block, > the CJK ideograph extension A, the CJK ideograph extension B, and > the many thousands of CJK compatibility ideographs each of which is > merely another way to write an existing ideograph. This is nuts for > several reasons; First, it's nuts because these aren't allocated in > a contiguous block. Because they haven't been encoded at the same time. Would you prefer moving existing code points to make room for new ones? > Second, it's nuts because this is a snapshot of the vocabulary > of several living languages, and as such is bound to continue to > change. Blame Chinese which have a script which depends on the vocabulary, not Unicode. > Third, it's nuts because it's woefully incomplete; even with all > these hundreds of thousands of ideograms, the average Chinese person > still can't correctly write his or her own address using these > characters. If Unicode waited with encoding Chinese only after it is sure that no characters are missing, it would still not have done it. > Rather than developing a new and sensible way to do things taking > into account all the world's writing systems, they co-opted all the > ways people were already doing things, This is useful for migrating existing systems and their data to Unicode. There are definitely various details in Unicode which could have been done a bit better, but there is no serious alternative which actually did it. It's better to use Unicode as it is, than to dream about an ideal world remade from scratch, with code points arranged according to your chosen criterion, or even several separate criteria at once, where Germans, Gr
and Chinese change important assumptions about encoding their texts. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
Post Follow-up to this messageMarcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> wrote: > There are definitely various details in Unicode which could have been > done a bit better, but there is no serious alternative which actually > did it. It's better to use Unicode as it is, than to dream about an > ideal world remade from scratch, with code points arranged according > to your chosen criterion, or even several separate criteria at once, > where Germans, Grand Chinese change important assumptions about > encoding their texts. Excellent article, and good summary. Unicode would be a lot simpler if it weren't for the characters, but (unfortunately? luckily?) programmers don't get to mandate how people write text. Until they do, we've no choice but to implement it the best we can, and Unicode is a reasonable attempt. I do recommend that (programming) language designers learn at least the basics of Unicode. While much of the encoding is straightforward, it's not too hard to define a programming interface that makes Unicode harder than it needs to be. I also recommend that operating-system designers learn Unicode basics and, at the very least, provide a simple mechanism to access the various character properties from any program. That way, the data can all go into some fixed part of memory accessed by the whole system, instead of having each application dragging its own tables along and exacerbating the lookup & caching problems. -- Bradd W. Szonye http://www.szonye.com/bradd
Post Follow-up to this message* Ray Dillinger (bear@sonic.net) .... > E. When you bump that string back to lowercase, you get m, > a, s, s, e, which is a different word with a different > meaning than m, a, eszett, e. In order to lowercase M, A, > S, S, E correctly in German, you have to know from context > which word was intended, Unless you live in/program for Switzerland where these two words *are* spelled the same. ... > This is demented. Thanks, coming from germany. > In one fell swoop, if not somehow fixed, > this singular insane character makes case operations in the > redesigned extended character set nonreversible, Not that ISO-8859-1 is any better in this regard. ... > This is so astonishingly stupid that it warrants a kluge to > fix it, and I name that kluge capital eszett. Ouch. Wrong solution domain. Would you also like an uppercase '&'? To stay OnT: The proper way to do a case-insensitive language is to declare that a-z are the only allowed letters in names that are not quoted. Andreas -- np: 4'33
Post Follow-up to this messageMarcin 'Qrczak' Kowalczyk wrote: > Ray Dillinger <bear@sonic.net> writes: > > > > > Because various simple programs and environments are not prepared to > the complexity of composing characters, so they can at least handle > important simple scripts (e.g. all European languages). IMVHO, this doesn't have to be handled at app level. A character encoding system should primarily be concerned with reading/writing all characters losslessy, not displaying them. > > > It's not a bug, it's a feature. It lets ASCII stay ASCII, and it makes > text converted from encodings like ISO-8859-x to UTF-8 only a bit > larger instead of 4 times larger. Without ASCII compatibility UTF-8 > would not have been adopted as an encoding of emails and usenet. Yes, I agree with this. I'm a former iso-8859-1 user and I think that having differing widths in characters is OK. > Blame Chinese which have a script which depends on the vocabulary, > not Unicode. Couldn't sinograms be encoded on a stroke basis rather than a glyph basis? Just guessing. Sunnan -- .i mi rodo roda fraxu
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.