For Programmers: Free Programming Magazines  


Home > Archive > Scheme > March 2006 > case-sensitivity









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author case-sensitivity
H.

2006-02-15, 7:02 pm

I heard through the webcast grapevine that the next official Scheme
standard might include case-sensitivity. Anyone know why this
consideration is under evaluation? Is it felt that computer science
students are somehow missing out by not having to worry about
case-sensitivity?

robert.corbett@sun.com

2006-02-15, 9:58 pm

> Anyone know why this consideration is under evaluation?

I don't know why the change is being considered. I do know
that for languages other than English case insensitivity can
be problematic.

Bob Corbett

Ray Dillinger

2006-02-16, 7:58 am

H. wrote:
> I heard through the webcast grapevine that the next official Scheme
> standard might include case-sensitivity. Anyone know why this
> consideration is under evaluation? Is it felt that computer science
> students are somehow missing out by not having to worry about
> case-sensitivity?
>


The next official scheme standard is concerned with eliminating
requirements that would prevent (or make difficult) the support
of the Unicode character set. One of these requirements is
case insensitivity. The R5RS case insensitivity requirements only
make sense with an alphabet where there is a bijective mapping
between lowercase and uppercase characters, and this is not
generally true outside of the ascii range.

The short version of the story is that requiring case-insensitive
identifiers under Unicode is difficult to implement, has lots of
corner and edge cases where "distinct" identifiers might be mistaken
for one another, and no matter how you do it, people who use some
language or locale will believe you did it wrong. Case sensitivity
is much, much easier than sorting out the problems.

Bear

William D Clinger

2006-02-16, 7:03 pm

Ray Dillinger wrote:
> The next official scheme standard is concerned with eliminating
> requirements that would prevent (or make difficult) the support
> of the Unicode character set. One of these requirements is
> case insensitivity. The R5RS case insensitivity requirements only
> make sense with an alphabet where there is a bijective mapping
> between lowercase and uppercase characters, and this is not
> generally true outside of the ascii range.


Bijectivity is not required. Any idempotent mapping from characters
to characters would suffice. The Unicode standard defines such a
mapping for case conversion, so Unicode is not the issue.

> The short version of the story is that requiring case-insensitive
> identifiers under Unicode is difficult to implement,


But implementations will probably be required to implement the
difficult case conversions for Unicode anyway, so that is not the
issue. See SRFI 75.

> has lots of
> corner and edge cases where "distinct" identifiers might be mistaken
> for one another, and no matter how you do it, people who use some
> language or locale will believe you did it wrong.


Those, I believe, are the issues.

The proposed change from case-insensitivity to case-sensitivity
may have several justifications, but one of those justificiations is
a perception that many, and perhaps most, Scheme programmers
would prefer case-sensitive identifiers and symbols, and that the
inevitable conversion to Unicode provides the best opportunity
we will have to change to case-sensitivity if we're going to do it
at all.

If Scheme programmers actually prefer case-insensitivity, now
would be a good time to find out. A reasonably objective survey
on this would be more helpful than arguing about it. Is anyone
willing to conduct such a survey?

Will

Ray Dillinger

2006-02-16, 9:57 pm

William D Clinger wrote:
> Ray Dillinger wrote:


>
>
> But implementations will probably be required to implement the
> difficult case conversions for Unicode anyway, so that is not the
> issue. See SRFI 75.


Is it the case that if you're going to embed something
in a device with an 8-bit (or 5-bit!) character set,
language standards are out the window anyway and just
need to be ignored? Because if schemata are *required*
to implement unicode case conversions, then the language
(or at least that part of it) becomes useless and
non-embeddable in non-unicode environments.

Or, perhaps, like Java, it becomes the mission of scheme
implementations to carry Unicode with them into every
environment to which they're ported?

> If Scheme programmers actually prefer case-insensitivity, now
> would be a good time to find out. A reasonably objective survey
> on this would be more helpful than arguing about it. Is anyone
> willing to conduct such a survey?


I don't think I'm ready to conduct the straw poll myself, but
here's my (somewhat paradoxical) answer to it. For my
*personal* comfort, I'd prefer case-insensitivity for the
characters A-Z(and a-z) ONLY. But I'd never recommend that
as a standard, because it's culturecentric against those
people whose "first and most familiar" alphabet isn't the
roman alphabet. It caters to my particular weaknesses in
reading other cased alphabets, where I am mostly unable to
tell that characters differ only by case, and caters to my
particular strengths in the roman alphabet, where I look
at a glyph and identify it more strongly as its (caseless)
character than by its case.

For the standard, I think I'd recommend case-sensitivity,
just because I don't want to have to figure out whether
identifiers in a character set I'm unfamiliar with are
"the same" identifier under case mapping rules I don't
know.

Otherwise (if case-insensitivity is preserved) then if I
ever work with code that has non-ascii identifiers, I'm
going to have to write a "code sanitizer" that smashes
case deliberately in order to make all identifiers that
are *logically* the same *look* the same.

Bear


Pascal Bourguignon

2006-02-16, 9:57 pm

Ray Dillinger <bear@sonic.net> writes:
> I don't think I'm ready to conduct the straw poll myself, but
> here's my (somewhat paradoxical) answer to it. For my
> *personal* comfort, I'd prefer case-insensitivity for the
> characters A-Z(and a-z) ONLY. But I'd never recommend that
> as a standard, because it's culturecentric against those
> people whose "first and most familiar" alphabet isn't the
> roman alphabet.


The English & Latin alphabets.

I don't know any other roman language with no accent.

bote -> BOTE, or BOTE -> bote are unfortunate.

So now you may want to extend it to ISO-8859-1, but then you'll hit .

> For the standard, I think I'd recommend case-sensitivity,
> just because I don't want to have to figure out whether
> identifiers in a character set I'm unfamiliar with are
> "the same" identifier under case mapping rules I don't
> know.


Indeed it's the best.


> Otherwise (if case-insensitivity is preserved) then if I
> ever work with code that has non-ascii identifiers, I'm
> going to have to write a "code sanitizer" that smashes
> case deliberately in order to make all identifiers that
> are *logically* the same *look* the same.


Some Pascal Pretty Printers did that.

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
__Pascal Bourguignon__ http://www.informatimago.com/
Ray Dillinger

2006-02-16, 9:57 pm

Pascal Bourguignon wrote:
> Ray Dillinger <bear@sonic.net> writes:
>
>
>
> The English & Latin alphabets.
>
> I don't know any other roman language with no accent.


Y'know, it took me a moment to realize what you were talking
about. I don't think of accents as a feature of a language,
and I'd been happily identifying just about all european
languages as belonging to the group that used the roman alphabet.

If "bote" and "boite" *weren't* the same identifier, I'd find
it jarring and annoying, because to me these words are spelled
"the same." But I'd get over it pretty quick, because at least
there *is* a visual difference.

In English, we steal words all the time. "Maana" in Spanish
got imported whole, albeit with a slightly different meaning
having to do with procrastination or schedule unpredictability,
and whether the third letter is or n just depends on the
typing skills and personal preference of the person typing it.

I think of it as correctly spelled either way (no, more
strongly, I as least think of these as the SAME SPELLING),
and if I were telling someone how it were spelled I probably
wouldn't bother to mention the accent unless there were a
reason for being pedantic. If typesetting or password entry
or something visual demanded it, yes. But just spelling?
Probably not.

Likewise, when I'm *reading* Spanish, or French, I don't
think of the accents. At all. If they weren't there I
wouldn't notice their absence. I guess I'm blind to the stuff
that my native language doesn't use.

And a system that distingishes identifiers depending only
on different accents is going to require me to shift gears
in a slightly more jarring way than distinguishing identifiers
based only on case, or even in unfamiliar alphabets. But
as you say, it's for the best.

*sigh.*


Bear

Brian Harvey

2006-02-17, 3:58 am

Ray Dillinger <bear@sonic.net> writes:
>William D Clinger wrote:
>
>I don't think I'm ready to conduct the straw poll myself, but
>here's my (somewhat paradoxical) answer to it. For my
>*personal* comfort, I'd prefer case-insensitivity for the
>characters A-Z(and a-z) ONLY. But I'd never recommend that
>as a standard, because it's culturecentric against those
>people whose "first and most familiar" alphabet isn't the
>roman alphabet.


I'm not volunteering either. But case-insensitivity is so vital to me
as a teacher that I would have to freeze our Scheme package with its
current version if it changed. I am constantly presenting students with
a modified version of code they've seen before, with the changed part in
all capitals. If I couldn't do that, I'd feel as if I had both hands tied
behind my back and had to operate the keyboard with my nose.

I concede that the needs of teachers may be different from the needs of
programmers, and that the needs of ASCII-land teachers may be different
from the needs of rest-of-the-world teachers. But I humbly beg for, at
least, a backward compatibility mode at least until I retire. :-)

P.S. The following is probably just curmudgeonliness, but even wearing my
programmer hat I prefer case-insensitivity, because when reading other
people's code I can never keep track of the difference between thisVariable
and ThisVariable when people do such things.

P.P.S. Sorry, I should read the SRFI instead of asking, but is it proposed
to eliminate the STRING-CI=? primitive? Because, if not, the Unicode case
problem has to be solved or punted regardless of how symbols are compared.
(And, notice how easily you could parse which word of two sentences back was
indended as the name of a procedure?)
Pascal Bourguignon

2006-02-17, 3:58 am

Ray Dillinger <bear@sonic.net> writes:

> Pascal Bourguignon wrote:
>
> Y'know, it took me a moment to realize what you were talking
> about. I don't think of accents as a feature of a language,
> and I'd been happily identifying just about all european
> languages as belonging to the group that used the roman alphabet.
>
> If "bote" and "boite" *weren't* the same identifier, I'd find
> it jarring and annoying, because to me these words are spelled
> "the same." But I'd get over it pretty quick, because at least
> there *is* a visual difference.


They aren't the same. Only a barbarian would think so.

Some examples:

"Les moines aiment les jenes !" The monks like the fastings.
"Les moines aiment les jeunes !" The monks like the youngs.

"Le poisson est sale." The fish is dirty.
"Le poisson est sal." The fish is salted.

mais = but
mas = corn

etc...


> In English, we steal words all the time. "Maana" in Spanish
> got imported whole, albeit with a slightly different meaning
> having to do with procrastination or schedule unpredictability,
> and whether the third letter is or n just depends on the
> typing skills and personal preference of the person typing it.


We tend to think it depends on the depths of the backwaters the
specific American comes from.


> And a system that distingishes identifiers depending only
> on different accents is going to require me to shift gears
> in a slightly more jarring way than distinguishing identifiers
> based only on case, or even in unfamiliar alphabets. But
> as you say, it's for the best.


Don't worry, international software will still be written only with
English identifiers. But in lisp, symbols are used as well for data
as for variable names.


--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
__Pascal Bourguignon__ http://www.informatimago.com/
Anton van Straaten

2006-02-17, 3:58 am

Pascal Bourguignon wrote:
> Ray Dillinger <bear@sonic.net> writes:

....
>
>
> They aren't the same. Only a barbarian would think so.


This from someone who thinks the car of nil should be nil?
Pascal Bourguignon

2006-02-17, 7:57 am

Anton van Straaten <anton@appsolutions.com> writes:

> Pascal Bourguignon wrote:
> ...
>
> This from someone who thinks the car of nil should be nil?


:-)

--
__Pascal Bourguignon__ http://www.informatimago.com/

HANDLE WITH EXTREME CARE: This product contains minute electrically
charged particles moving at velocities in excess of five hundred
million miles per hour.
Ray Dillinger

2006-02-17, 7:57 am

Pascal Bourguignon wrote:
> Ray Dillinger <bear@sonic.net> writes:


>
>
> They aren't the same. Only a barbarian would think so.


Heh. Bar, bar, bar... to me, accents are a feature of individual
words, not languages - and a feature of a particular *way* of writing
those words at that, not a feature of spelling. I am that barbarian.

> Some examples:
>
> "Les moines aiment les jenes !" The monks like the fastings.
> "Les moines aiment les jeunes !" The monks like the youngs.
>
> "Le poisson est sale." The fish is dirty.
> "Le poisson est sal." The fish is salted.


Hmmm... I'd always read "sale" in French for rotten or stale,
and figured "sal" for salted derived from salting things to
mask their bad odor.

> mais = but
> mas = corn
>
> etc...


Yep. And in English we have

lead = a very heavy metal
lead = going first and showing others the way

leading = the space between letters in typesetting
leading = eminent or noteworthy
leading = first in a traveling group

butt = the rear portion of the anatomy atop the legs
butt = the object of a joke
butt = to hit someone with your head

bit = a small amount of something
bit = the cutting part of a drill

etc....

There is plenty of room in our framework for words
that are spelled the same and mean different things.

>
>
> We tend to think it depends on the depths of the backwaters the
> specific American comes from.


Heh. You're right, of course... It doesn't get much deeper
than where I'm from. I have a lot of relatives who still use
"thee" and "thou" (correctly, I might add), because no radios
or televisions or newspapers for that matter have ever been
admitted to introduce them to modern usages. I don't think my
g'grandfather's library (mostly a religious library) had a
single book in it that had been published after 1890.

Bear

Pascal Bourguignon

2006-02-17, 7:57 am

Ray Dillinger <bear@sonic.net> writes:[color=darkred]

More over, and n are two distinct letters in Spanish, as well as ll
is one letter, and l another letter. In French, you could argue that
e, and are the same letter because they're equivalent for the
lexical sort order. Not so in Spanish for and n, or ll and l.
(and comes after nutritivo, and llaga after litro).

So writing manana for maana is equivalent to write norning for morning.

--
__Pascal Bourguignon__ http://www.informatimago.com/
You never feed me.
Perhaps I'll sleep on your face.
That will sure show you.
Lauri Alanko

2006-02-17, 7:57 am

In article <43f5a1e4$0$58104$742ec2ed@news.sonic.net>,
Ray Dillinger <bear@sonic.net> wrote:
> Heh. Bar, bar, bar... to me, accents are a feature of individual
> words, not languages - and a feature of a particular *way* of writing
> those words at that, not a feature of spelling. I am that barbarian.


Here "barbarian" seems to be a particular way of writing "wrong".

Diacritical marks are used for quite a number of purposes: to indicate
stress, umlaut, nasalization, palatalization, length, tones, and
various other phonetic features. Usually when these features are shown
in the orthography of the language, it's because they are phonemically
significant, i.e. their absence or presence makes a _real difference_
about the meaning of a word.

Saying that diacritics are a matter of taste is kind of like saying
that it doesn't really matter whether you write "Do you like
orange-flavored gum?" or "Do you like orange-flavored cum?" since,
after all, the letter G is really just the letter C with a fancy
diacritical stroke that the Romans invented since they had this funny
voicing contrast that the Etruscans didn't...

>
> Yep. And in English we have


[Snip various polysemous words]

> There is plenty of room in our framework for words
> that are spelled the same and mean different things.


How is this relevant when the issue is words that are spelled
_differently_ and mean different things?

FWIW, I think it's a Really Bad Idea to use non-ASCII characters in
identifiers, but if it must be allowed, then case-insensitivity must
be done either Right (which is impossible) or not at all.


Lauri
Marcin 'Qrczak' Kowalczyk

2006-02-17, 7:57 am

Ray Dillinger <bear@sonic.net> writes:

> Heh. Bar, bar, bar... to me, accents are a feature of individual
> words, not languages - and a feature of a particular *way* of writing
> those words at that, not a feature of spelling. I am that barbarian.


Perhaps in French. In Polish 󶼿 are not marks added to
acelnoszz to disambiguate words, nor hints for pronunciation, but
letters on their own, with separate entries in lexicographic ordering.

--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Gabriel Dos Reis

2006-02-17, 6:59 pm

Pascal Bourguignon <usenet@informatimago.com> writes:

| They aren't the same. Only a barbarian would think so.
|
| Some examples:
|
| "Les moines aiment les jenes !" The monks like the fastings.
| "Les moines aiment les jeunes !" The monks like the youngs.

some might think it is a barbarian language then :-p

-- Gaby
Matthias Blume

2006-02-17, 6:59 pm

bh@abbenay.CS.Berkeley.EDU (Brian Harvey) writes:

> Ray Dillinger <bear@sonic.net> writes:
>
> I'm not volunteering either. But case-insensitivity is so vital to me
> as a teacher that I would have to freeze our Scheme package with its
> current version if it changed. I am constantly presenting students with
> a modified version of code they've seen before, with the changed part in
> all capitals. If I couldn't do that, I'd feel as if I had both hands tied
> behind my back and had to operate the keyboard with my nose.


I don't buy it.

First of all, if the language is case-sensitive, you can write a
trivial pre-processor that turns your code with CAPITAL-AS-HIGHLIGHT
into normal code before having it processed by the compiler. Oh, you
think this would be ad-hoc? Well, how come you ask this ad-hoc
pre-processor be part of the language definition then?

Anyway, this is really a red herring. For presentation to students,
there are any number of other (and better!) ways for highlighting
differences: boldface, underscore, different fonts, color. And, the
same pre-processing trick could easily be made to work for the purpose
of feeding the code to a compiler.

> I concede that the needs of teachers may be different from the needs of
> programmers, and that the needs of ASCII-land teachers may be different
> from the needs of rest-of-the-world teachers. But I humbly beg for, at
> least, a backward compatibility mode at least until I retire. :-)


Write the above pre-processor. This should not take more than 2
minutes if you are on a Unix system. And bingo: backward-compatibility!

> P.S. The following is probably just curmudgeonliness, but even wearing my
> programmer hat I prefer case-insensitivity, because when reading other
> people's code I can never keep track of the difference between thisVariable
> and ThisVariable when people do such things.


I cannot relate to the problem you describe. "ThisVariable" and
"thisVariable" look totally different to me.

Matthias
Brian Harvey

2006-02-17, 7:00 pm

Matthias Blume <find@my.address.elsewhere> writes:
>First of all, if the language is case-sensitive, you can write a
>trivial pre-processor that turns your code with CAPITAL-AS-HIGHLIGHT
>into normal code before having it processed by the compiler. Oh, you
>think this would be ad-hoc? Well, how come you ask this ad-hoc
>pre-processor be part of the language definition then?


Sure, and then I have to tell the students to run this extra program in order
to get their Scheme code to work. It's hard enough getting them to run
Emacs! I don't need any more obstacles.

As for the ad-hocness of putting it in the language: This might be an argument
if we were designing a language from scratch. But it's already in the
language. This really does make a difference. I don't ask anyone to make
C or Java case-sensible; I just refrain from teaching with them.

>Anyway, this is really a red herring. For presentation to students,
>there are any number of other (and better!) ways for highlighting
>differences: boldface, underscore, different fonts, color. And, the
>same pre-processing trick could easily be made to work for the purpose
>of feeding the code to a compiler.


I don't know of any Scheme systems that accept Microsoft Word files, or
Open Office files, or even RTF files, as input. So, once again, you want
my students to have one more mysterious reason for their programs not to
work -- and, on top of that, a reason not to use God's Editor.
Jens Axel Sgaard

2006-02-17, 7:00 pm

Brian Harvey wrote:
> Ray Dillinger <bear@sonic.net> writes:
[color=darkred]
> I'm not volunteering either. But case-insensitivity is so vital to me
> as a teacher that I would have to freeze our Scheme package with its
> current version if it changed. I am constantly presenting students with
> a modified version of code they've seen before, with the changed part in
> all capitals. If I couldn't do that, I'd feel as if I had both hands tied
> behind my back and had to operate the keyboard with my nose.


From an implementation point I don't think it is an either-or. In order
to support r5rs-code, implementations need to keep some kind of
case-insensitive mode.

--
Jens Axel Sgaard
Alexander Schmolck

2006-02-17, 7:00 pm

Matthias Blume <find@my.address.elsewhere> writes:
I am no Brian Harvery,

however

>
> I don't buy it.
>
> First of all, if the language is case-sensitive, you can write a
> trivial pre-processor that turns your code with CAPITAL-AS-HIGHLIGHT
> into normal code before having it processed by the compiler. Oh, you
> think this would be ad-hoc? Well, how come you ask this ad-hoc
> pre-processor be part of the language definition then?


Maybe because that way it integrates seemlessly into the development
environment. The other advantage of course is that it reduces people's desire
to make up braindamaged conventions for using case to seperate words, connote
certain semantics or introduce ad hoc namespaces.

> Anyway, this is really a red herring. For presentation to students,
> there are any number of other (and better!) ways for highlighting
> differences: boldface, underscore, different fonts, color.


In a better world in which not 99% percent of software could only reasonably
handle plain ascii text, yes. How many programming languages and tools do you
know that allow you to highlight anything in the above fashion and still treat
it as executable programm code?

> And, the same pre-processing trick could easily be made to work for the
> purpose of feeding the code to a compiler.
>
>
> Write the above pre-processor. This should not take more than 2
> minutes if you are on a Unix system. And bingo: backward-compatibility!


Since it'll only take 2 minutes, can I see that preprocessor? I ask because I
wonder how you'd write something in 2 minutes that will leave column and line
numbers as well as comments, characters and strings intact (so that you still
get the same semantics and proper error reporting). The only way I can to
achieve this (modulo sexp-comments) I can see is to pipe through some nasty
perl regexp substitution (but you'd have to be pretty fluent in perl to write
it in 2 minutes).

>
> I cannot relate to the problem you describe. "ThisVariable" and
> "thisVariable" look totally different to me.


Do they sound different to you, too?

'as
William D Clinger

2006-02-17, 7:00 pm

Ray Dillinger wrote:
> William D Clinger wrote:
>
>
> Is it the case that if you're going to embed something
> in a device with an 8-bit (or 5-bit!) character set,
> language standards are out the window anyway and just
> need to be ignored? Because if schemata are *required*
> to implement unicode case conversions, then the language
> (or at least that part of it) becomes useless and
> non-embeddable in non-unicode environments.


You were talking about "under Unicode". Now you're talking
about "non-unicode environments". No wonder I'm .

> Otherwise (if case-insensitivity is preserved) then if I
> ever work with code that has non-ascii identifiers, I'm
> going to have to write a "code sanitizer" that smashes
> case deliberately in order to make all identifiers that
> are *logically* the same *look* the same.


LOL! I wish I had a tool like that for reading Usenet.

Will

Marcin 'Qrczak' Kowalczyk

2006-02-17, 7:00 pm

Alexander Schmolck <a.schmolck@gmail.com> writes:

> The other advantage of course is that it reduces people's desire to
> make up braindamaged conventions for using case to seperate words,
> connote certain semantics or introduce ad hoc namespaces.


It's actually a braindamage to make this impossible.

--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Matthias Blume

2006-02-17, 7:00 pm

bh@abbenay.CS.Berkeley.EDU (Brian Harvey) writes:

> I don't ask anyone to make C or Java case-sensible


They already are. It's Scheme that isn't. :-)
Ray Dillinger

2006-02-17, 7:00 pm

William D Clinger wrote:
> Ray Dillinger wrote:
[color=darkred]
[color=darkred]
[color=darkred]
> You were talking about "under Unicode". Now you're talking
> about "non-unicode environments". No wonder I'm .


I was talking about *removing* requirements that made unicode
hard or impossible, so that we *could* reasonably have scheme
versions that work in unicode environments. You brought up the
idea of requiring the difficult unicode case conversions,
which is *adding* a requirement that makes it unreasonable to
have scheme versions that run in any *other* kind of environment.
So I objected. I see nothing confusing in the above.

Bear

Joe Marshall

2006-02-17, 7:00 pm


William D Clinger wrote:
>
> If Scheme programmers actually prefer case-insensitivity, now
> would be a good time to find out. A reasonably objective survey
> on this would be more helpful than arguing about it. Is anyone
> willing to conduct such a survey?


People who want to argue about case-sensitivity should read Unicode
Standard Annex #31, Identifier and Pattern Syntax (
http://www.unicode.org/reports/tr31/ ).

It seems to me that it would be a huge mistake to *not* normalize and
fold characters in symbols in some way. Many code points and
code-point sequences in Unicode are difficult or impossible to
distinguish visibly. As an example, consider
<U+212B Angstrom Sign>
<U+00C5 Latin Capital Letter A with Ring Above>
<U+0041 Latin Capital Letter A><U+030A Combining Ring Above>

or

"Henry IV" vs. "Henry \u2163" (code point 2163 is Roman Numeral 4)

If no normalization is done, then symbols that apparently are the same
may actually not be EQ?. Furthemore, `grepping' for a symbol may fail
if the wrong code points are used.

Fortunately, Unicode has several normalization forms designed to deal
with these problems.

It has been the tradition in Lisp and Scheme to allow any string to be
interned as a symbol even if that symbol could not be normally read.
Common Lisp and most Scheme implementations define escape sequences
that allow non-standard symbols to be read. I believe this is an
important feature, so I suggest that symbols be kept in Unicode
Normalization Form C (NFC). NFC is the standard used by the W3C
Character Model, too.

I suggest that the Scheme reader apply Unicode Normalization Form KC
(NFKC) before interning (unless the symbol is escaped). This will
eliminate most of the ambiguity between characters that appear similar
but have different code points.

As far as case-sensitivity goes, I've weighed in before, but my
argument is this:
1. Most everyone agrees that you should not use identifiers that
differ only in case.
2. If your symbols are case-sensitive, then you need to define a
policy about which case is used by standard libraries. (In fact, when
MzScheme switched to case-sensitive, the very first act was to define a
policy that all system code was to use lowercase.) If the purpose of
case-sensitivity is to allow for mixed case, such purpose is
immediately negated by the policy of forbidding it!
3. If case is used to distinguish symbols, it cannot be used for
anything else, like distinguishing student text from teacher text or
pattern text from match text.

Finally, the arguments above for folding characters based on
compatibility still apply to case-folding.

Pascal Costanza

2006-02-17, 7:00 pm

Pascal Bourguignon wrote:
> Ray Dillinger <bear@sonic.net> writes:
>
>
>
> The English & Latin alphabets.
>
> I don't know any other roman language with no accent.
>
> bote -> BOTE, or BOTE -> bote are unfortunate.
>
> So now you may want to extend it to ISO-8859-1, but then you'll hit .


If the is a reason for case insensitivity being problematic, then it's
not a problem. The convention to convert to SS when converting it to
upper case is only one convention. The other (not so common, but to my
knowledge perfectly acceptable) convention is to leave it as it is.

So instead of converting "Strae" to "STRASSE", you can as well convert
it to "STRAE". Converting it back to lower case or mixed case is then
no problem...


Pascal

--
My website: http://p-cos.net
Closer to MOP & ContextL:
http://common-lisp.net/project/closer/
Greg Buchholz

2006-02-17, 7:00 pm

Matthias Blume wrote:
> Anyway, this is really a red herring. For presentation to students,
> there are any number of other (and better!) ways for highlighting
> differences: boldface, underscore, different fonts, color. And, the
> same pre-processing trick could easily be made to work for the purpose
> of feeding the code to a compiler.


Just think of the fun you could have with HTML...

http://sleepingsquirrel.org/scheme/fun.html

....and use 'lynx -dump' to convert it back to ASCII to feed to your
scheme compiler.


http://lynx.isc.org/lynx2.8.5/lynx2...sers_guide.html

Marcin 'Qrczak' Kowalczyk

2006-02-17, 7:00 pm

"Joe Marshall" <eval.apply@gmail.com> writes:

> If no normalization is done, then symbols that apparently are the same
> may actually not be EQ?. Furthemore, `grepping' for a symbol may fail
> if the wrong code points are used.


It's not different for filenames (except on MacOS) or grep that you
mentioned.

The solution is to avoid using different representations in different
places, not to force all programs to apply various equivalence
relations. The latter is unrealistic: there are too many places
where strings are compared.

Folding makes sense for user-oriented searches, e.g. browsing text
files. In programming unambiguity and precision is more important.

DOS/Windows made a mistake of case-insensitive filenames. Since
filenames are stored and processed differently in different
environments (DOS CP on FAT in short filenames - it's not stored
anywhere which encoding is used; Windows CP in WinAPI; UTF-16 in long
filenames, on NTFS and in newer subsets of WinAPI), filenames created
on some versions of Windows with some locales are inaccessible in some
other versions with different locales, or only for some programs
depending on which API flavor is used.

--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Lauri Alanko

2006-02-17, 7:00 pm

In article <1140199496.225018.266160@g44g2000cwa.googlegroups.com>,
Joe Marshall <eval.apply@gmail.com> wrote:
> If no normalization is done, then symbols that apparently are the same
> may actually not be EQ?.


I think this issue is not really about symbols. Symbols should be eq?
whenever the corresponding strings are string=?, so the question is
really how strings should be compared "by default".

> It has been the tradition in Lisp and Scheme to allow any string to be
> interned as a symbol even if that symbol could not be normally read.


I have always found this tradition more than a bit dubious. To me,
symbols are at heart a simple datatype with an infinite number of
constants and a single operation: equality comparison. This is, I dare
surmise, the most common way of using them: as program-internal tags.
When symbols are produced only from quoted literals and the only thing
that is done with them is see if they are eq? to something, they can
be implemented simply as unique addresses. Even their names can be
optimized away.

However, since Scheme includes a reader, we must be able to map input
tokens to symbols, so a run-time intern table is needed. If we are
going to have an intern table anyway, it's natural to provide direct
access to it via string->symbol.

But to my mind the purpose of the intern table is to be able to look
up symbols such as might be used in quotations. Being able to intern
_arbitrary_ strings seems like a pretty peripheral feature, and not
something that a core language should need to support.

So, I question whether we really need unicode symbols at all. If you
have some piece of textual data where unicode is really required,
chances are that it should be a string and not a symbol.


Lauri
Joe Marshall

2006-02-17, 7:00 pm


Marcin 'Qrczak' Kowalczyk wrote:
> "Joe Marshall" <eval.apply@gmail.com> writes:
>
>
> It's not different for filenames (except on MacOS) or grep that you
> mentioned.
>
> The solution is to avoid using different representations in different
> places, not to force all programs to apply various equivalence
> relations. The latter is unrealistic: there are too many places
> where strings are compared.


I was only suggesting that it be applied at read and intern time.

> Folding makes sense for user-oriented searches, e.g. browsing text
> files. In programming unambiguity and precision is more important.


It is my argument that using an identifier to refer to something *is* a
user-oriented search.

>
> DOS/Windows made a mistake of case-insensitive filenames. Since
> filenames are stored and processed differently in different
> environments (DOS CP on FAT in short filenames - it's not stored
> anywhere which encoding is used; Windows CP in WinAPI; UTF-16 in long
> filenames, on NTFS and in newer subsets of WinAPI), filenames created
> on some versions of Windows with some locales are inaccessible in some
> other versions with different locales, or only for some programs
> depending on which API flavor is used.


The encoding problems of DOS and Windows is irrelevant: the problem
there arises from not using Unicode everywhere.

William D Clinger

2006-02-17, 7:00 pm

Ray Dillinger wrote:
> I was talking about *removing* requirements that made unicode
> hard or impossible, so that we *could* reasonably have scheme
> versions that work in unicode environments. You brought up the
> idea of requiring the difficult unicode case conversions,
> which is *adding* a requirement that makes it unreasonable to
> have scheme versions that run in any *other* kind of environment.
> So I objected. I see nothing confusing in the above.


Why, then, do you assume that the requirement I mentioned
would make it "unreasonable to have scheme versions that
run in any *other* kind of environment"?

I was talking about the possibility that the R6RS might require
its case-conversion operations to conform to the Unicode spec.
Related to this is the possibility that the R6RS might require its
CHAR->INTEGER and INTEGER->CHAR operations to use
a Unicode mapping.

Either or both of those could be required without requiring
implementations of Scheme to support anything beyond the
ASCII subset of Unicode. IMO, neither of those requirements
would be unreasonable.

At least one of us is about this, so I think there must
have been *something* confusing about our exchange.

Will

Marcin 'Qrczak' Kowalczyk

2006-02-17, 7:00 pm

"Joe Marshall" <eval.apply@gmail.com> writes:

> It is my argument that using an identifier to refer to something
> *is* a user-oriented search.


I disagree. These rules should be simple and consistent.

Program sources are processed by various tools: editors, documentation
generators, debuggers. It's unrealistic to expect all of them to
support Unicode case mapping and normalization.

Unicode case mapping has localized variants for Turkish, Azeri, and
Lithuanian. Should Scheme use the locale-neutral variant? What if
the given tool is even more user-oriented and applies the localized
variant? Authors might not realize that there is a problem before
it's used in Turkey, Azerbaijan, or Lithuania.

> The encoding problems of DOS and Windows is irrelevant:
> the problem there arises from not using Unicode everywhere.


It's not that simple.

http://www.emacswiki.org/cgi-bin/em...bonEmacsPackage

"Q: Although I set file-name-coding-system to utf-8, some characters are
displayed as white squares in dired-mode. Why?

A: Mac OS X's filesystem uses an unpopular version of UTF-8 (NFD;
Normalization Form D), which is slightly different from the popular
version of UTF-8 (NFC; Normalization Form C). At present, Emacs
does not support NFD. In NFD, diacritical marks (accents, diaresis,
cedille, tilde, etc.) are decomposed into two sequences; for example,
is decomposed into the character u and the diaresis combining
character. In a conforming Unicode implementation, these two would be
combined back to . But Emacs doesn't support that, yet. This is why
you will see "u" and an empty box in the dired-mode buffer of Carbon
Emacs."

http://developer.apple.com/technote...icodeSubtleties

"Note:
Mac OS versions 8.1 through 10.2.x used decompositions based on
Unicode 2.1. Mac OS X version 10.3 and later use decompositions based
on Unicode 3.2. Most of the characters whose decomposition changed
are not used by any Mac encoding, so they are unlikely to occur on
an HFS Plus volume. The MacGr encoding had the largest number of
decomposition changes."

So, case mapping and canonical decomposition from which Unicode
version should Scheme use?

--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Alexander Schmolck

2006-02-17, 7:00 pm

"Joe Marshall" <eval.apply@gmail.com> writes:

> 1. Most everyone agrees that you should not use identifiers that
> differ only in case.


I doubt there is such agreement. Even I (as someone who much appreciates in
lisps the relative absence of the gazillions of brain-damaged and mutually
incompatible naming conventions that plague most languages) think it's
perfectly fine to use identifiers that differ only in case, although I can
only think of 3 contexts in which that applies: interfacing, code is data and
formulas.

For the latter you need short identifiers in order to not obscure the overall
structure and restricting yourself to just 26 letters is too limiting. Apart
from that one of my main gripes with case-clutter (no (approximate) bijection
to speech) doesn't really apply here:

You can sensibly read (.* X x) as "dot-product of big-x and x".

> 2. If your symbols are case-sensitive, then you need to define a
> policy about which case is used by standard libraries. (In fact, when
> MzScheme switched to case-sensitive, the very first act was to define a
> policy that all system code was to use lowercase.) If the purpose of
> case-sensitivity is to allow for mixed case, such purpose is
> immediately negated by the policy of forbidding it!


Well the policy could equally well have been something similarly gut-wrenching
as Shriram Krishnamurthi adopted for his programming languages text:

(define-type DefrdSub
[mtSub]
[aSub (name symbol?) (value FAE-Value?) (ds DefrdSub?)])


> 3. If case is used to distinguish symbols, it cannot be used for
> anything else, like distinguishing student text from teacher text or
> pattern text from match text.


True. But disregarding current technological constraints, you'd presumably
want something more flexible than uppercasing words anyway at some point.

'as
Jens Axel Sgaard

2006-02-17, 7:00 pm

Alexander Schmolck wrote:

> Well the policy could equally well have been something similarly gut-wrenching
> as Shriram Krishnamurthi adopted for his programming languages text:
>
> (define-type DefrdSub
> [mtSub]
> [aSub (name symbol?) (value FAE-Value?) (ds DefrdSub?)])


Books have very narrow pages compared to a normal screens :-)

--
Jens Axel Sgaard
Alexander Schmolck

2006-02-17, 7:00 pm

Lauri Alanko <la@iki.fi> writes:

> FWIW, I think it's a Really Bad Idea to use non-ASCII characters in
> identifiers


It's bad enough that at the beginning of the 21st century our conception of
programming largely boils down to pushing around monospaced characters in
text-files with 30 year old editors, so let's at least not insist on drawing
these characters from a set that is inadequate for just about any imaginable
task.

'as



Bradd W. Szonye

2006-02-17, 7:00 pm

Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> wrote:
> Alexander Schmolck <a.schmolck@gmail.com> writes:
>
> It's actually a braindamage to make this impossible.


That depends on context. In some contexts (like comments and manual
pages written in English), I'd prefer identifiers to follow ordinary
English orthography, where case is generally insignificant. In
particular, I don't want to deal with "Classes" looking like proper
nouns but "objects" looking like common nouns, even though you could
make a reasonable argument for it by analogy, because you're still
screwed at the start of a sentence (or when you translate to German).

However, in other contexts (like mathematical formulas), it's common to
use casing as part of the naming convention. Heck, when coding formulas,
it'd be handy to make /typeface/ significant in identifiers, so that
upper-case A, lower-case A, italic A, and bold A are all different
variables.
--
Bradd W. Szonye
http://www.szonye.com/bradd
Joe Marshall

2006-02-17, 7:00 pm


Marcin 'Qrczak' Kowalczyk wrote:
> "Joe Marshall" <eval.apply@gmail.com> writes:
>
>
> I disagree. These rules should be simple and consistent.
>
> Program sources are processed by various tools: editors, documentation
> generators, debuggers. It's unrealistic to expect all of them to
> support Unicode case mapping and normalization.
>
> Unicode case mapping has localized variants for Turkish, Azeri, and
> Lithuanian. Should Scheme use the locale-neutral variant? What if
> the given tool is even more user-oriented and applies the localized
> variant? Authors might not realize that there is a problem before
> it's used in Turkey, Azerbaijan, or Lithuania.
>
>
> It's not that simple.
>
> http://www.emacswiki.org/cgi-bin/em...bonEmacsPackage
>
> "Q: Although I set file-name-coding-system to utf-8, some characters are
> displayed as white squares in dired-mode. Why?
>
> A: Mac OS X's filesystem uses an unpopular version of UTF-8 (NFD;
> Normalization Form D), which is slightly different from the popular
> version of UTF-8 (NFC; Normalization Form C). At present, Emacs
> does not support NFD. In NFD, diacritical marks (accents, diaresis,
> cedille, tilde, etc.) are decomposed into two sequences; for example,
> =FC is decomposed into the character u and the diaresis =A8 combining
> character. In a conforming Unicode implementation, these two would be
> combined back to =FC. But Emacs doesn't support that, yet. This is why
> you will see "u" and an empty box in the dired-mode buffer of Carbon
> Emacs."
>
> http://developer.apple.com/technote...icodeSubtleties
>
> "Note:
> Mac OS versions 8.1 through 10.2.x used decompositions based on
> Unicode 2.1. Mac OS X version 10.3 and later use decompositions based
> on Unicode 3.2. Most of the characters whose decomposition changed
> are not used by any Mac encoding, so they are unlikely to occur on
> an HFS Plus volume. The MacGr encoding had the largest number of
> decomposition changes."
>
> So, case mapping and canonical decomposition from which Unicode
> version should Scheme use?


The alternative is to *not* specify a canonical form for Unicode
identifiers. Then we'd find that '=FC and '=FC were not EQ because one
symbol used the decomposed character and the other didn't. It was
foolish for the Mac to choose the unpopular canonical form (and I
suggested the popular one on purpose). It would be more foolish to not
canonicalize.

I suggested the NFKC canonicalization and I suggest the latest version
of Unicode.

Bradd W. Szonye

2006-02-17, 7:00 pm

Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> wrote:
> "Joe Marshall" <eval.apply@gmail.com> writes:
>
> I disagree.


But Joe's argument is almost trivially proven, because identifiers exist
solely for the programmer's benefit. From the machine's point of view,
unique integers are entirely sufficient, but they'd be disastrous for
programmers.

> These rules should be simple and consistent.


Joe is proposing a simple and consistent rule: Different identifiers
must be distinguishable by programmers. The problem is that
/programmers/ are not simple and consistent, which means that you can't
have a simple algorithm without either (1) making some distinct
identifiers indistinguishable or (2) making insignificant glyphic
differences significant.

> Program sources are processed by various tools: editors, documentation
> generators, debuggers. It's unrealistic to expect all of them to
> support Unicode case mapping and normalization.


On the contrary, I think processing Unicode text with Unicode-ignorant
tools is a recipe for disaster. The toolchain must support the encoding
or there's no point in using them together. To put it more bluntly, it's
nuts to make programmers accommodate the tools instead of the other way
around. Doing it the wrong way around always causes trouble, and good
designers eventually regret it (e.g., Ken Thompson's remarks about
"creat").

And it's not like it's that difficult for tools to support Unicode
properly, assuming a system big enough that Unicode is a reasonable
option in the first place. Nor is it a new standard. There's no good
excuse for part of a toolchain to get it wrong. Of course, that doesn't
change the fact that some tools get it wrong anyway. But do you really
want to plan around that kind of limitation if you can avoid it? It's
like the C language catering to linkers that could only handle
6-character identifiers. Maybe the designers felt that they /had/ to
accommodate the crufty old tools, but in the end it was still painful
for programmers.

Unfortunately, Unicode doesn't solve all of the problems:

> Unicode case mapping has localized variants for Turkish, Azeri, and
> Lithuanian. Should Scheme use the locale-neutral variant? What if the
> given tool is even more user-oriented and applies the localized
> variant? Authors might not realize that there is a problem before it's
> used in Turkey, Azerbaijan, or Lithuania.


This is a good example of "programmers are not simple and consistent."
Either way, somebody's going to get screwed by this. One group of
programmers gets identifiers that they can't easily distinguish by
sight. No algorithm can solve this, because the requirements vary from
programmer to programmer.
--
Bradd W. Szonye
http://www.szonye.com/bradd
Anton van Straaten

2006-02-17, 9:56 pm

Lauri Alanko wrote:
> So, I question whether we really need unicode symbols at all. If you
> have some piece of textual data where unicode is really required,
> chances are that it should be a string and not a symbol.


XML supports Unicode tags. SXML embeds XML in Scheme. If Scheme is to
fully support such embedded languages, it needs to support Unicode symbols.

Anton
Ray Dillinger

2006-02-17, 9:56 pm

Pascal Bourguignon wrote:
> Ray Dillinger <bear@sonic.net> writes:
>
>
>
> More over, and n are two distinct letters in Spanish, as well as ll
> is one letter, and l another letter. In French, you could argue that
> e, and are the same letter because they're equivalent for the
> lexical sort order. Not so in Spanish for and n, or ll and l.
> (and comes after nutritivo, and llaga after litro).
>
> So writing manana for maana is equivalent to write norning for morning.
>



Before you get all upset, Pascal, please remember that I'm
making no defense of this failing of mine. It's a failing,
pure and simple. A product of my background that I haven't
overcome as yet. I wasn't saying that it's the "right" way
to read these languages; only that this is how reading those
languages, subjectively, works for me. Learning to be
sensitive to diacriticals and regard them as distinguishing,
will be more difficult to me than distinguishing identifiers
by case.

I suspect that a lot of people have similar failings,
depending on the treatment of diacriticals they've learned
as children. It hasn't prevented me from learning
enough French to be able to read for enjoyment or decipher
tech manuals, although I wouldn't presume to attempt to write
or speak French except in emergencies, because I know I'd get
the grammar and accents wrong enough to embarrass myself
and annoy Francophones everywhere.

And I'm reporting to you as a bare fact that when English
swiped the word "maana" from Spanish, the diacritical became
optional in the English word that was thereby formed (which
is, incidentally, a different word with a slightly different
meaning anyway).

The same thing happens to *any* word that English swipes
from another language, for that matter. Look around for these
stolen words and when you find them in English texts you'll
mostly find them without the accents that were part of the
spelling of the other-language roots that the English words
are based on. We learn foreign languages already knowing
slightly-skewed versions of half or more of their vocabulary,
but the vocabulary we know is all either unaccented or
optionally accented, depending on how long the word has been
part of English.

Bear

Brian Harvey

2006-02-18, 4:01 am

How many people participating in this thread have ever actually tried to read
a computer program written by a native speaker of a different natural
language? Over in comp.lang.logo we have a very active software developer
whose native language is Spanish, and his procedure names are in Spanish.

Let me tell you, orthography (case, accents, whatever) is the least of the
difficulty I have in reading his code! *I don't speak the language!*
Someone in this thread proposed using numeric identifiers as a straw-man
counterexample to something -- well, for me, this guy might as well be using
numeric identifiers, for all the mnemonic value his names have. *He* can
read *my* code, because he speaks English.

So I don't see much hope, or any need, for a general resolution of folding
issues. It seems to me that what's needed is a mechanism for local plugin
of a folding module written by and for native speakers of each language.
That's the general principle that designers should design around.

And in *that* context I feel perfectly entitled to ask for case folding
in North American English, without feeling guiltily cultural imperialist.

Now for the part that is, ly, cultural imperialist: For the foreseeable
future, internationally cooperative programming is going to be done in
English, I predict. So, as a pragmatic matter, it is probably more
important for the Chinese dialect of Scheme (or whatever) to be able to do
English folding correctly than for the American dialect to be able to do
Chinese folding correctly. I'm not saying this is how the world should be.

P.S. Oh well, if George Bush has his way, pretty soon it'll be illegal to
speak anything but English (makes things too hard for the NSA) anywhere on
Earth, and then we can go back to ASCII. A silver lining in every cloud.
Pascal Bourguignon

2006-02-18, 4:01 am

bh@abbenay.CS.Berkeley.EDU (Brian Harvey) writes:

> How many people participating in this thread have ever actually tried to read
> a computer program written by a native speaker of a different natural
> language? Over in comp.lang.logo we have a very active software developer
> whose native language is Spanish, and his procedure names are in Spanish.
>
> Let me tell you, orthography (case, accents, whatever) is the least of the
> difficulty I have in reading his code! *I don't speak the language!*
> Someone in this thread proposed using numeric identifiers as a straw-man
> counterexample to something -- well, for me, this guy might as well be using
> numeric identifiers, for all the mnemonic value his names have. *He* can
> read *my* code, because he speaks English.


But I don't understand. If English is 40% Latin, and 20% French which
is itself 70% Latin, and if Spanish is itself 70% Latin too, why don't
you understand Spanish words?



> So I don't see much hope, or any need, for a general resolution of folding
> issues. It seems to me that what's needed is a mechanism for local plugin
> of a folding module written by and for native speakers of each language.
> That's the general principle that designers should design around.


Well, perhaps for lisp there's a little hope in that, but some have
done localized basics or localized pascals and I assure you, it's not
pretty.


> And in *that* context I feel perfectly entitled to ask for case folding
> in North American English, without feeling guiltily cultural imperialist.
>
> Now for the part that is, ly, cultural imperialist: For the foreseeable
> future, internationally cooperative programming is going to be done in
> English, I predict. So, as a pragmatic matter, it is probably more
> important for the Chinese dialect of Scheme (or whatever) to be able to do
> English folding correctly than for the American dialect to be able to do
> Chinese folding correctly. I'm not saying this is how the world should be.
>
> P.S. Oh well, if George Bush has his way, pretty soon it'll be illegal to
> speak anything but English (makes things too hard for the NSA) anywhere on
> Earth, and then we can go back to ASCII. A silver lining in every cloud.


:-)


--
__Pascal Bourguignon__ http://www.informatimago.com/

PUBLIC NOTICE AS REQUIRED BY LAW: Any use of this product, in any
manner whatsoever, will increase the amount of disorder in the
universe. Although no liability is implied herein, the consumer is
warned that this process will ultimately lead to the heat death of
the universe.
Nils M Holm

2006-02-18, 4:01 am

Alexander Schmolck <a.schmolck@gmail.com> wrote:
> It's bad enough that at the beginning of the 21st century our conception of
> programming largely boils down to pushing around monospaced characters in
> text-files with 30 year old editors, so let's at least not insist on drawing
> these characters from a set that is inadequate for just about any imaginable
> task.


I think we "push around mono-spaced characters in text-files" because
this method is a local optimum. If it was not, better methods would have
taken over.

Using ASCII exclusively in program text[1] is a very, very good idea,
too, because it allows people to swap code over national boundaries and
language boundaries. Imagine you wanted to read a Scheme program written
by someone writing kanji. Even if you could read kanji, now try editing
the program with a US or European keyboard.

[1] Excluding, of course, the values of strings and chars.

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Nils M Holm

2006-02-18, 4:01 am

Brian Harvey <bh@abbenay.cs.berkeley.edu> wrote:
> Now for the part that is, ly, cultural imperialist: For the foreseeable
> future, internationally cooperative programming is going to be done in
> English, I predict. [...]


Even if English is not my first language: I really hope so. Having
such a quasi standard simplifies things a lot.

> P.S. Oh well, if George Bush has his way, pretty soon it'll be illegal to
> speak anything but English (makes things too hard for the NSA) anywhere on
> Earth, and then we can go back to ASCII. A silver lining in every cloud.


This paragraph just made it to my fortune file.

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Ulrich Hobelmann

2006-02-18, 7:56 am

Nils M Holm wrote:
> I think we "push around mono-spaced characters in text-files" because
> this method is a local optimum. If it was not, better methods would have
> taken over.


On the non-file side there's Smalltalk/Squeak, and on the
not-too-much-character-oriented side there's Paredit mode for Emacs.

Of course identifiers, which are the most important element of
programming languages, still need character input.

> Using ASCII exclusively in program text[1] is a very, very good idea,
> too, because it allows people to swap code over national boundaries and
> language boundaries. Imagine you wanted to read a Scheme program written
> by someone writing kanji. Even if you could read kanji, now try editing
> the program with a US or European keyboard.


ASCII strongly encourages English, which I think is good. OTOH many
non-Anglosaxons often use the wrong English words, because they don't
know any better. Maybe the future for Anglosaxons in the worldwide
economy is to be word-use-counselors for the software architects? ;)

As to keyboards, I'm not sure you need one. Koreans type
kind-of-letters, and have a software method to pop up Chinese characters
matching those spelled-out words (i.e. the Chinese characters are
software-only). I have Dvorak kb-layout (basically an American keyboard
with most keys switched), but with my Mac's Alt-key I can type lots of
other letters (European accents, Umlauts, ). X Window has a compose
key for the same purpose (and you can set that key yourself, if you're
not working at a Sun). I could probably also call up some kind of
on-screen-menu of Japanese characters, or switch my kb-layout to allow
me to type them.

The question is: with software going international, and components and
libraries being trade commodities, does it make sense to use another
language than English? The other question is: should languages still be
flexible enough to allow non-English words as identifiers for those
people who don't want/need International?

> [1] Excluding, of course, the values of strings and chars.


--
Suffering from Gates-induced brain leakage...
Ulrich Hobelmann

2006-02-18, 7:56 am

Bradd W. Szonye wrote:
> Therefore, it's not too surprising for diacritical remarks to seem
> "invisible" to a native English speaker. We just don't rely on them. For
> a similar example, I wouldn't expect a native German speaker to
> intuitively understand the importance of L vs LL in Spanish and Welsh.


Germans, like Americans, increasingly pronounce Vanilla as "vanilla"
instead of "vanillya", and Mallorca as "mallorca" instead of "mayorca".
So you make sense.

> To the untrained eye, it just looks like a minor spelling variant rather
> than a wholly different sound and meaning.


Exactly.

> To put it another way, each language has allographs, orthographic
> analogs to allophones. Just as English speakers perceive "tin" and "cat"
> to have the same T sound, we perceive "TIN" and "tin" to have the same
> letter T. I dislike case-oriented naming conventions for that reason:
> "Foo" and "foo" are too similar, because the two Fs are allographs (in
> English -- I'd expect it to cause less trouble in German, for example).


Well, in those languages where Foo and foo are different, it's quite
German: capitalized words often name classes, while non-caps identify
method names or variables.

--
Suffering from Gates-induced brain leakage...
Ulrich Hobelmann

2006-02-18, 7:56 am

Brian Harvey wrote:
> P.S. Oh well, if George Bush has his way, pretty soon it'll be illegal to
> speak anything but English (makes things too hard for the NSA) anywhere on
> Earth, and then we can go back to ASCII. A silver lining in every cloud.


And then we could compress text files from 8bit down to 7bit. Imagine
the amount of storage saved ;)

We could even build 28bit-CPUs for text processing.

--
Suffering from Gates-induced brain leakage...
Lauri Alanko

2006-02-18, 7:56 am

In article <3uuJf.15309$NS6.7381@newssvr30.news.prodigy.com>,
Anton van Straaten <anton@appsolutions.com> wrote:
> XML supports Unicode tags. SXML embeds XML in Scheme. If Scheme is to
> fully support such embedded languages, it needs to support Unicode symbols.


Bummer.

Well, it wouldn't be a big deal to extend SXML to allow element names
to be specified with strings, too.

Incidentally, the XML spec says this about what it means for element
names to match:

Two strings or names being compared MUST be identical.
Characters with multiple possible representations in ISO/IEC
10646 (e.g. characters with both precomposed and
base+diacritic forms) match only if they have the same
representation in both strings. No case folding is performed.

If I read this correctly, this means that it is in principle possible
for an XML document to have two distinct element names that read the
same and would have the same normalized form, but are nevertheless
considered distinct since they are represented with different code
point sequences.

To make all distinct XML names representable with distinct symbols,
symbols would just be interned code point sequences with no
normalization whatsoever. That seems rather unwise.

XML is a mess. Scheme shouldn't be made a mess, too, just to be
compatible.


Lauri
Andreas Eder

2006-02-18, 7:56 am

Hi Pascal,
[color=darkred]

Pascal> If the is a reason for case insensitivity being problematic, then
Pascal> it's not a problem. The convention to convert to SS when converting
Pascal> it to upper case is only one convention. The other (not so common, but
Pascal> to my knowledge perfectly acceptable) convention is to leave it as it
Pascal> is.

Pascal> So instead of converting "Strae" to "STRASSE", you can as well
Pascal> convert it to "STRAE". Converting it back to lower case or mixed case
Pascal> is then no problem...

Unfortunately I don't have my Duden at hand, but as a native speaker, I
don't think it is correct, and - by the way - never have seen it written
that way.

Andreas
--
Wherever I lay my .emacs, there's my $HOME.
Pascal Costanza

2006-02-18, 7:56 am

Andreas Eder wrote:
> Hi Pascal,
>
>
>
>
> Pascal> If the is a reason for case insensitivity being problematic, then
> Pascal> it's not a problem. The convention to convert to SS when converting
> Pascal> it to upper case is only one convention. The other (not so common, but
> Pascal> to my knowledge perfectly acceptable) convention is to leave it as it
> Pascal> is.
>
> Pascal> So instead of converting "Strae" to "STRASSE", you can as well
> Pascal> convert it to "STRAE". Converting it back to lower case or mixed case
> Pascal> is then no problem...
>
> Unfortunately I don't have my Duden at hand, but as a native speaker, I
> don't think it is correct, and - by the way - never have seen it written
> that way.


OK - I swear I have seen it before but apparently, it's wrong:
http://www.canoo.net/services/Germa...t/pgf25-26.html

I still think that solution would be simpler...


Pascal

--
My website: http://p-cos.net
Closer to MOP & ContextL:
http://common-lisp.net/project/closer/
Pascal Bourguignon

2006-02-18, 6:58 pm

Ulrich Hobelmann <u.hobelmann@web.de> writes:
> The question is: with software going international, and components and
> libraries being trade commodities, does it make sense to use another
> language than English? The other question is: should languages still
> be flexible enough to allow non-English words as identifiers for those
> people who don't want/need International?


IMO the majority of programs are still written in other languages than
English, both for comments and identifiers, and with zero provision
for localization. Merely by the fact that the majority of the
programs are proprietary and mandated and owned by corporations that
are not international. (Not counting that their domain may also be
specific to the country and language, like software implemented
to match local laws).

--
__Pascal Bourguignon__ http://www.informatimago.com/

There is no worse tyranny than to force a man to pay for what he does not
want merely because you think it would be good for him. -- Robert Heinlein
Brian Harvey

2006-02-18, 6:58 pm

Pascal Bourguignon <usenet@informatimago.com> writes:
>There is no worse tyranny than to force a man to pay for what he does not
>want merely because you think it would be good for him. -- Robert Heinlein


.... such as, for example, case-sensitive identifiers!
Glad you've come around to my side on this. :-)
Ray Dillinger

2006-02-18, 6:58 pm

Brian Harvey wrote:
> How many people participating in this thread have ever actually tried to read
> a computer program written by a native speaker of a different natural
> language? Over in comp.lang.logo we have a very active software developer
> whose native language is Spanish, and his procedure names are in Spanish.
>


One of the most involved jobs I've had (and some of the most involved
code) was with a company where the other "resident mad scientist" was
Swiss. A fair amount of his code - especially where it got complex or
wizardly - was written using Swiss-german identifiers and comments.

It was -- challenging. I could distinguish identifiers easily, but
to me they had no mnemonic value, and I had to analyze everything very
closely to figure out what each was used for. The most immediate
analogue I had for the experience was reading debugger output. It did
help my comprehension a lot that he used a very rigid design discipline,
even where the stuff was getting fiercely complex.

I'm not going to claim that the design discipline carried to his
extremes helped him program and understand his own code, but I'd recognize
things and say "what would a bondage-and-discipline programmer use this
thing for?" and that was usually what was going on.

Nobody else at the company would even attempt to read any of the code
he'd done in Swiss German; anything that "mere mortals" would comprehend
he'd do in English anyway. A lot of people considered it to be "job
security code" and resented him for it, but given the complexity of the
tasks, I would say that except for the human-language barrier it was
remarkably clear.

Bear

Matthias Blume

2006-02-19, 3:56 am

Nils M Holm <before-2006-03-01@online.de> writes:

> [ ...] Imagine you wanted to read a Scheme program written
> by someone writing kanji. Even if you could read kanji, now try editing
> the program with a US or European keyboard.


Actually, if you can read kanji, then this (editing them using a US
keyboard) is the easy part.
Matthias Blume

2006-02-19, 3:56 am

Ulrich Hobelmann <u.hobelmann@web.de> writes:

> [ ... ] I could probably also
> call up some kind of on-screen-menu of Japanese characters, or switch
> my kb-layout to allow me to type them.


Since you say you are using a Mac, just turn on the Kotoeri input
method. You can tell Kotoeri to use the Dvorak layout. (My Mac is
configured that way.)
Nils M Holm

2006-02-19, 3:56 am

Matthias Blume <find@my.address.elsewhere> wrote:
> Actually, if you can read kanji, then this (editing them using a US
> keyboard) is the easy part.


I would be really interested in hearing how this works.

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Ulrich Hobelmann

2006-02-19, 3:56 am

Nils M Holm wrote:
> Matthias Blume <find@my.address.elsewhere> wrote:
>
> I would be really interested in hearing how this works.


I just fired up the Kotoeri on my Mac as Matthias suggested. Say, in
Hiragana mode (the main Japanese syllabic script) you can type syllables
using the standard ASCII transcription (so "hon" yields ほん. If you
type space or down after those three letters, you get a pop-up with
several Kanji, so you can choose for instance 本; IIRC that means "book"
(I haven't done any Japanese in more than ten years)). AFAIK Windows
machines, or at least standard Asian word processors on Windows have
used this input method for at least a decade. Not sure how the chinese
do it, though. My Mac lists several methods of which only "Pinyin" (a
standard transcription for Traditional Chinese) rings a bell.

Interesting is that the input menu lists Chinese and Japanese as not
outputting Unicode, but Japanese and Chinese encodings. Not sure what
that means for multilanguage code ;)

--
Suffering from Gates-induced brain leakage...
Nils M Holm

2006-02-19, 7:56 am

Ulrich Hobelmann <u.hobelmann@web.de> wrote:
> Nils M Holm wrote:
>
> I just fired up the Kotoeri on my Mac as Matthias suggested. Say, in
> Hiragana mode (the main Japanese syllabic script) you can type syllables
> using the standard ASCII transcription [...]


But Matthias said that editing kanji using an US keyboard was the
/easy/ part. I do not own a Mac. What if I want to edit source
code containing kanji with vi or emacs or (heaven forbid!) notepad?

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Ulrich Hobelmann

2006-02-19, 7:56 am

Nils M Holm wrote:
> Ulrich Hobelmann <u.hobelmann@web.de> wrote:
>
> But Matthias said that editing kanji using an US keyboard was the
> /easy/ part. I do not own a Mac. What if I want to edit source
> code containing kanji with vi or emacs or (heaven forbid!) notepad?


If those apps use the normal input method... The Terminal emulator on
my Mac simply passes whatever keycodes I create, through. So in vi I
can probably type Japanese if I want to. Aquamacs Emacs - sure (it pops
up an extra editing bar, because it seems like Aquamacs uses its own
editor view; when I press enter in the editing bar, whatever is in it is
inserted into the Emacs buffer)!

I don't know what MS did with Notepad, as it could (not too long ago at
least) not edit files larger than a few megabytes.

If MS did things right, all their apps can use those input methods. I
suppose the big X11 environments could do things similarly. Maybe
xterm+vi don't work, but gnome-terminal or gedit might.

--
Suffering from Gates-induced brain leakage...
Aaron Hsu

2006-02-19, 7:56 am

Ulrich Hobelmann <u.hobelmann@web.de> writes:

> Nils M Holm wrote:

[...]
[color=darkred]
> I don't know what MS did with Notepad, as it could (not too long ago
> at least) not edit files larger than a few megabytes.
>
> If MS did things right, all their apps can use those input methods. I
> suppose the big X11 environments could do things similarly. Maybe
> xterm+vi don't work, but gnome-terminal or gedit might.


I cannot speak for all of MS's programs. But I know that Notepad
supports the UTF style of inputing characters. On my Windows XP
machine, the way it works (with a Chinese IME), is that the characters
are determined through some means, and then passed as a Unicode value
to the editor. On editors like Boxer, this doesn't work because they
don't support Unicode (though Windows has another options which allows
for passing such values to such programs as do not support UTF, IIRC),
but on a program like Notepad, it works just fine.

One thing that you can run into is if you end up with an editor not
supporting UTF, then UTF doesn't work, and the other options are (for
Chinese) things like GB and whatnot. The problem with ASCII editors
and those charsets is that you often have to delete twice to remove a
single character.

--
Aaron Hsu <spam@sacrificumdeo.net> Jabber: arcfide@xmpp.us
<http://www.sacrificumdeo.net> "Extend beyond the Mortal . . . ."
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin
Aaron Hsu

2006-02-19, 7:56 am

Ulrich Hobelmann <u.hobelmann@web.de> writes:

> Nils M Holm wrote:
>
> I just fired up the Kotoeri on my Mac as Matthias suggested. Say, in
> Hiragana mode (the main Japanese syllabic script) you can type
> syllables using the standard ASCII transcription (so "hon" yields
> ほん. If you type space or down after those three letters, you get a
> pop-up with several Kanji, so you can choose for instance
> œ; IIRC that means "book" (I haven't done any Japanese in
> more than ten years)). AFAIK Windows machines, or at least standard
> Asian word processors on Windows have used this input method for at
> least a decade. Not sure how the chinese do it, though. My Mac lists
> several methods of which only "Pinyin" (a standard transcription for
> Traditional Chinese) rings a bell.
>
> Interesting is that the input menu lists Chinese and Japanese as not
> outputting Unicode, but Japanese and Chinese encodings. Not sure what
> that means for multilanguage code ;)


On Windows machines (and Macs, IIRC), you use IME's for Chinese, one
of many. In my case, and I think most non-Chinese occassional Chinese
typists, the Pinyin IME is the one to use. And I believe that both
Windows XP and Mac send the characters to the editor using UTF unless
specifically told to use something else. There are other third-party
IME's which send the characters in the other character sets.

For some people the Chinese IME they prefer are not the ones which map
1-1 to a QWERTY US keyboard, but the ones that actually have a key
representing things such as shape of the character or particular (what
are they called now?) trigraphs (??).

--
Aaron Hsu <spam@sacrificumdeo.net> Jabber: arcfide@xmpp.us
<http://www.sacrificumdeo.net> "Extend beyond the Mortal . . . ."
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin
Nils M Holm

2006-02-19, 7:56 am

Ulrich Hobelmann <u.hobelmann@web.de> wrote:
> If those apps use the normal input method... The Terminal emulator on
> my Mac simply passes whatever keycodes I create, through. So in vi I


I do not say that you cannot pass key codes to programs. However, you
have to create these key codes in some way. For example, I create an
'a' by pressing the key labelled 'a' on my keyboard. Generating kanji
or cyrillic symbols is harder because my keyboards lacks keys with the
corresponding labels.

> If MS did things right, all their apps can use those input methods. I
> suppose the big X11 environments could do things similarly. Maybe
> xterm+vi don't work, but gnome-terminal or gedit might.


So editing non-ASCII code requires special software, too. This is not
what I would call "easy".

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Nils M Holm

2006-02-19, 7:56 am

Aaron Hsu <spam@sacrificumdeo.net> wrote:
> I cannot speak for all of MS's programs. But I know that Notepad
> supports the UTF style of inputing characters. On my Windows XP


I do not doubt that notepad does accept UTF input, but this does
not help you much when /editing/ Chinese or Japanese text with
an US/European keyboard. You still need a way to create the UTF
sequences, namely a keyboard with the proper labels and/or a
special keyboard driver.

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Ulrich Hobelmann

2006-02-19, 7:56 am

Nils M Holm wrote:
> Ulrich Hobelmann <u.hobelmann@web.de> wrote:
>
> I do not say that you cannot pass key codes to programs. However, you
> have to create these key codes in some way. For example, I create an
> 'a' by pressing the key labelled 'a' on my keyboard. Generating kanji
> or cyrillic symbols is harder because my keyboards lacks keys with the
> corresponding labels.


That's exactly why you switch to Hiragana for instance, or why you
switch your keyboard into cyrillic mode for that. If you run Win XP,
there's a little applet down right that allows you to switch keyboard
layouts (unless you turned it off).

>
> So editing non-ASCII code requires special software, too. This is not
> what I would call "easy".


Typing requires special software. Accessing your floppy requires
special software. What's wrong with switching keyboard layouts, or with
using an input manager that turns your typed syllables into Japanese
characters?

--
Suffering from Gates-induced brain leakage...
Aaron Hsu

2006-02-19, 7:00 pm

Nils M Holm <before-2006-03-01@online.de> writes:

> Aaron Hsu <spam@sacrificumdeo.net> wrote:
>
> I do not doubt that notepad does accept UTF input, but this does
> not help you much when /editing/ Chinese or Japanese text with
> an US/European keyboard. You still need a way to create the UTF
> sequences, namely a keyboard with the proper labels and/or a
> special keyboard driver.


While I don't support the use of non-ascii character sequences in source
code for current day programming (not to say anything about the future
or whether I would ever support it), I also have to question how hard
it is to edit files that already have these characters in the code? For
what I would call the vast majority of most modern day heavily used
development platforms of which I am aware, there exists relatively easy
to use IME software compatible with most programming environments which
will provide the ability to create these special charcters with a US
keyboard, usually this is a matter of simply configuring your OS to use
a different keyboard layout. For UNIX platforms, it can be a little more
tricky, but even these are fairly straightforward. RHEL, Mac OS X,
Debian, and (no experience, just heresay) Ubuntu all make it very easy
to support other non-native character sets on one's computer using just
a US Keyboard.

I use Emacs for Scheme programming, so maybe I'll give it a try and see
how easy it is to get LEIM working . . . .

--
Aaron Hsu <spam@sacrificumdeo.net> Jabber: arcfide@xmpp.us
<http://www.sacrificumdeo.net> "Extend beyond the Mortal . . . ."
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin
Pascal Bourguignon

2006-02-19, 7:00 pm

Nils M Holm <before-2006-03-01@online.de> writes:

> Ulrich Hobelmann <u.hobelmann@web.de> wrote:
>
> But Matthias said that editing kanji using an US keyboard was the
> /easy/ part. I do not own a Mac. What if I want to edit source
> code containing kanji with vi or emacs or (heaven forbid!) notepad?


What if I want to edit an ASCII file with a magnet?

--
__Pascal Bourguignon__ http://www.informatimago.com/

"A TRUE Klingon warrior does not comment his code!"
Nils M Holm

2006-02-19, 7:00 pm

Pascal Bourguignon <usenet@informatimago.com> wrote:
> What if I want to edit an ASCII file with a magnet?


The wish to be able to edit a /text/ file with a /text/ editor
is reasonable. The wish to do the same with a magnet? Well... :-)

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Lauri Alanko

2006-02-19, 7:00 pm

In article <dt9k3i$ige$1@online.de>,
Nils M Holm <before-2006-03-01@online.de> wrote:
> But Matthias said that editing kanji using an US keyboard was the
> /easy/ part. I do not own a Mac. What if I want to edit source
> code containing kanji with vi or emacs or (heaven forbid!) notepad?


With emacs you just use any of the variety of input methods available.
In debian, you can just say

sudo apt-get install xemacs21-mule-canna-wnn canna
xemacs21-mule-canna-wnn

and in xemacs

M-x load-library
canna
M-x canna
M-x canna-toggle-japanese-mode

and you're ready to go.

By the way, I've lost track here: is this question about inputting CJK
via a standard keyboard supposed to illustrate a point relevant to
Scheme, or have we simply drifted far into the realm of off-topic?


Lauri
Nils M Holm

2006-02-19, 7:00 pm

Lauri Alanko <la@iki.fi> wrote:
> By the way, I've lost track here: is this question about inputting CJK
> via a standard keyboard supposed to illustrate a point relevant to
> Scheme, or have we simply drifted far into the realm of off-topic?


The point I was trying to make was that program text in pure,
unextended ASCII is a good thing, because you can edit it with
any text editor you like and without having to install additional
hardware and/or software. It (the point) was related to the
question whether or not R6RS should support Unicode identifiers
and/or should be case-sensitive.

For the record:

I think that Unicode identifiers make things worse for the reasons
I tried to explain, and I think that case-sensitivity is a bad
thing for similar reasons: it requires additional software in order
to emphasize parts of programs and it makes writing about Scheme
harder. See also: <d5304q$2qp$1@online.de>

--
Nils M Holm <n m h @ t 3 x . o r g> -- http://www.t3x.org/nmh/
Pascal Bourguignon

2006-02-19, 7:00 pm

Nils M Holm <before-2006-03-01@online.de> writes:

> Pascal Bourguignon <usenet@informatimago.com> wrote:
>
> The wish to be able to edit a /text/ file with a /text/ editor
> is reasonable. The wish to do the same with a magnet? Well... :-)


The wish to be able to edit a unicode text file with a unicode text
editor is reasonable.

The wish to be able to edit a unicode text file with an ASCII text
editor? Well...

As for NotePad, I think it can handle unicode, and for the input
method, they're usually provided by the OS, not by the application.
emacs is a counter example, because it provided the feature long
before OSes did. Or perhaps, it's just another example of the lemme,
since emacs is considered to be an OS in itself.

--
__Pascal Bourguignon__ http://www.informatimago.com/
Grace personified,
I leap into the window.
I meant to do that.
Jens Axel Sgaard

2006-02-19, 7:00 pm

Pascal Bourguignon wrote:

> As for NotePad, I think it can handle unicode, and for the input
> method, they're usually provided by the OS, not by the application.
> emacs is a counter example, because it provided the feature long
> before OSes did. Or perhaps, it's just another example of the lemme,
> since emacs is considered to be an OS in itself.


Fortunately Windows provides WordPad. Unlike NotePad it handles
Unix line breaks!

--
Jens Axel Sgaard
Alexander Schmolck

2006-02-19, 7:00 pm

Nils M Holm <before-2006-03-01@online.de> writes:

> Alexander Schmolck <a.schmolck@gmail.com> wrote:
>
> I think we "push around mono-spaced characters in text-files" because
> this method is a local optimum. If it was not, better methods would have
> taken over.


According to this logic indo-arabic numerals are at best marginally better
than romans ones. I'm wary of spending centuries in such "optima".

> Using ASCII exclusively in program text[1] is a very, very good idea,
> too, because it allows people to swap code over national boundaries and
> language boundaries.