For Programmers: Free Programming Magazines  


Home > Archive > Lisp > January 2005 > CLisp case sensitivity









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Pages:
Pages: [1] 2
Author CLisp case sensitivity
Thomas Gagne

2004-12-14, 3:57 am

I've read that Common Lisp is case sensitive, but have also noticed that
Allegro has a way of creating a case-sensitive image. Can the same thing be
done with clisp (on GNU/Linux)?
Pascal Bourguignon

2004-12-14, 3:57 am

Thomas Gagne <tgagne@wide-open-west.com> writes:

> I've read that Common Lisp is case sensitive, but have also noticed
> that Allegro has a way of creating a case-sensitive image. Can the
> same thing be done with clisp (on GNU/Linux)?


clisp is not Common Lisp: clisp is but one implementation of the
language named Common Lisp.


Common Lisp IS case sensitive, BUT its reader can be configured, and
its default configuration is to upcase every symbol, which means that
it's case insensitive._Other configurations allow to preserve case,
rendering it effectively case sensitive, or even, to _invert_ case,
rendering it completely schizophrenic about case.

Read about *READTABLE-CASE* in CLHS.


--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Adam Warner

2004-12-14, 3:57 am

Hi Thomas Gagne,

> I've read that Common Lisp is case sensitive, but have also noticed that
> Allegro has a way of creating a case-sensitive image. Can the same thing be
> done with clisp (on GNU/Linux)?


If the question is: Do any of the free Common Lisp implementations provide
a build-time option to intern all symbols in the COMMON-LISP package in
lower case so that :PRESERVE is a suitable readtable option?

The answer is: No.

The next best alternative is to use the :INVERT readtable mode. This
inverts the symbol name of all lowercase or all uppercase symbols as they
are being read while leaving the symbol name of mixed-case symbols alone.

This is the only suitable readtable option that maintains case information
because the ANSI Common Lisp committee decided backwards compatibility
with traditional uppercasing Lisps was most important. The decision hasn't
stood the test of time. If they'd made a better choice the pain of
transition would have been long over.

Readtable case should be deprecated. Symbols should be interned as
written in source code and implementors should not have the burden of
implementing "historical" baggage that is difficult to get 100% right
(e.g. ABCL is continuing to squash :INVERT mode read and print errors).

Note that the ANSI Common Lisp specification is considered sacrosanct and
these comments heretical.

Regards,
Adam
Chris Capel

2004-12-14, 9:07 am

Adam Warner wrote:

> Hi Thomas Gagne,
>
>
> If the question is: Do any of the free Common Lisp implementations provide
> a build-time option to intern all symbols in the COMMON-LISP package in
> lower case so that :PRESERVE is a suitable readtable option?
>
> The answer is: No.
>
> The next best alternative is to use the :INVERT readtable mode. This
> inverts the symbol name of all lowercase or all uppercase symbols as they
> are being read while leaving the symbol name of mixed-case symbols alone.
>
> This is the only suitable readtable option that maintains case information
> because the ANSI Common Lisp committee decided backwards compatibility
> with traditional uppercasing Lisps was most important. The decision hasn't
> stood the test of time. If they'd made a better choice the pain of
> transition would have been long over.


Another example of the ramifications of this decision: inconsistent
functions names. For example, the convention of using an "f" suffix on
functions using places that isn't followed everywhere (getf, push). The
convention of using a p or -p suffix with many type testing functions, but
not ATOM or NULL! I'm sure others can point out other examples.

Be thankful it isn't as bad as the C standard library. (Atoi? What's
Eh-toy? Some sort of faerie name?)

Chris Capel
Adam Warner

2004-12-14, 9:07 am

Hi Chris Capel,

> Another example of the ramifications of this decision: inconsistent
> functions names. For example, the convention of using an "f" suffix on
> functions using places that isn't followed everywhere (getf, push). The
> convention of using a p or -p suffix with many type testing functions, but
> not ATOM or NULL! I'm sure others can point out other examples.


Look at the character predicates:

characterp
alpha-char-p
digit-char-p
graphic-char-p
standard-char-p

You can guess why we don't have `charp'. I find using ? to denote
predicates helpful. I don't start a symbol with a non-alphanumeric so the
punctuation marks can still be used as non-terminating dispatching macro
characters [just as # can still be used within symbol names, e.g.
-#x00FF and abc#|this-is-not-a-comment|#def are both symbols].

> Be thankful it isn't as bad as the C standard library. (Atoi? What's
> Eh-toy? Some sort of faerie name?)


ANSI Common Lisp's naming inconsistency no longer bothers me. At least
when annoyed I can resolve the issue using the package system.

Perhaps the most overlooked inconsistency is that LENGTH returns an
implementation-specific value for strings [largely depending upon whether
strings are implemented as sequences of octets (CMUCL, GCL, historically
SBCL), 16-bit values (ABCL) or 32-bit values (CLISP, SBCL)].

Regards,
Adam
Barry Margolin

2004-12-14, 4:08 pm

In article <pan.2004.12.14.10.49.34.276021@consulting.net.nz>,
Adam Warner <usenet@consulting.net.nz> wrote:

> Hi Chris Capel,
>
>
> Look at the character predicates:
>
> characterp
> alpha-char-p
> digit-char-p
> graphic-char-p
> standard-char-p


What's the problem there? The convention, which I think is even
explained explicitly in CLTL, is that "p" is appended to single words
(e.g. "character"), and "-p" is appended to multiple words (e.g.
"alpha-char").

> Perhaps the most overlooked inconsistency is that LENGTH returns an
> implementation-specific value for strings [largely depending upon whether
> strings are implemented as sequences of octets (CMUCL, GCL, historically
> SBCL), 16-bit values (ABCL) or 32-bit values (CLISP, SBCL)].


What are you talking about? It returns the number of array elements,
regardless of their size.

--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
Bruno Haible

2004-12-14, 4:08 pm

> have also noticed that Allegro has a way of creating a case-sensitive image.
> Can the same thing be done with clisp (on GNU/Linux)?


Yes, it can: Just start "clisp -modern". It uses the same memory image as
normal "clisp".

And it is even better than Allegro: In CLISP you can mix old-style source
code with modern case-sensitive source code. Thus you can migrate your big
applications to the modern case-sensitive mode slowly, package by package;
you're not forced to do it all at once.

The feature is in CLISP CVS and will be part of clisp-2.34; the
implementation follows the lines presented at LSM 2004 [1].

Bruno


[1] http://www-jcsu.jesus.cam.ac.uk/~cs...ning.html#htoc4
Chris Riesbeck

2004-12-14, 4:09 pm

In article <barmar-B18246.08360914122004@comcast.dca.giganews.com>,
Barry Margolin <barmar@alum.mit.edu> wrote:

> In article <pan.2004.12.14.10.49.34.276021@consulting.net.nz>,
> Adam Warner <usenet@consulting.net.nz> wrote:
>
> What's the problem there? The convention, which I think is even
> explained explicitly in CLTL, is that "p" is appended to single words
> (e.g. "character"), and "-p" is appended to multiple words (e.g.
> "alpha-char").


Correct, though let us not forget our old friend
string-lessp, which is NOT an exception, but does
invoke an additional rule:

http://www.cliki.net/Naming%20conventions
Adam Warner

2004-12-14, 9:01 pm

Hi Barry Margolin,

>
>You can guess why we don't have `charp'. What's the problem there? The
>convention, which I think is even
> explained explicitly in CLTL, is that "p" is appended to single words
> (e.g. "character"), and "-p" is appended to multiple words (e.g.
> "alpha-char").


"You can guess why we don't have `charp'": All the other character
predicates and equality tests use the contracted form of character, char.
So does Scheme. It just so happens that charp looks stupid and would be
pronounced like see-harp, kar-pee or krap (because of the convention).

Pity the language designers forgot the convention for DEFSTRUCT! How hard
would have it have been to parse the symbol name for #\-? Obviously the
problem is that appending a #\p to a symbol name can create strange
combinations that are best avoided by not applying the convention.

>
> What are you talking about? It returns the number of array elements,
> regardless of their size.


Even with the same operating system and locale different implementations
of ANSI Common Lisp cons up strings of different lengths when the input is
identical. This means the position of corresponding characters is
implementation dependent and the length of any resulting string is
implementation dependent.

[In the table below I'm assuming ABCL has completed its Java string
support so that Unicode characters are correctly read and stored in 16-bit
strings. Input is UTF-8]

STRING CLISP/SBCL ABCL CMUCL/GCL
"A" 1 1 1
"Δ€" 1 1 2
"✐" 1 1 3
"𐀀" 1 2 4

Java dictates that all implementations have the same string representation
(a suboptimal one, but at least it's the same). Python has the same
issue as Common Lisp, but users can choose which way to build it:
<http://python.fyxm.net/peps/pep-0261.html>

Regards,
Adam
Adam Warner

2004-12-14, 9:01 pm

Hi Barry Margolin,

>
>You can guess why we don't have `charp'. What's the problem there? The
>convention, which I think is even
> explained explicitly in CLTL, is that "p" is appended to single words
> (e.g. "character"), and "-p" is appended to multiple words (e.g.
> "alpha-char").


"You can guess why we don't have `charp'": All the other character
predicates and equality tests use the contracted form of character, char.
So does Scheme. It just so happens that charp looks stupid and would be
pronounced like see-harp, kar-pee or krap (because of the convention).

Pity the language designers forgot the convention for DEFSTRUCT! How hard
would it have been to parse the symbol name for #\-? An obvious problem is
that appending #\P to a symbol name can create strange combinations that
are best avoided by not applying the convention.

>
> What are you talking about? It returns the number of array elements,
> regardless of their size.


Even with the same operating system and locale different implementations
of ANSI Common Lisp cons up strings of different lengths when the input is
identical. This means the position of corresponding characters is
implementation dependent and the length of any resulting string is
implementation dependent.

[In the table below I'm assuming ABCL has completed its Java string
support so that Unicode characters are correctly read and stored in 16-bit
strings. Input is UTF-8]

STRING CLISP/SBCL ABCL CMUCL/GCL
"A" 1 1 1
"Δ€" 1 1 2
"✐" 1 1 3
"𐀀" 1 2 4

Java dictates that all implementations have the same string representation
(a suboptimal one, but at least it's the same). Python has the same
issue as Common Lisp, but users can choose which way to build it:
<http://python.fyxm.net/peps/pep-0261.html>

Regards,
Adam
Julian Stecklina

2004-12-15, 3:57 am

Adam Warner <usenet@consulting.net.nz> writes:
> This is the only suitable readtable option that maintains case information
> because the ANSI Common Lisp committee decided backwards compatibility
> with traditional uppercasing Lisps was most important. The decision hasn't
> stood the test of time. If they'd made a better choice the pain of
> transition would have been long over.
>
> Readtable case should be deprecated. Symbols should be interned as
> written in source code and implementors should not have the burden of
> implementing "historical" baggage that is difficult to get 100% right
> (e.g. ABCL is continuing to squash :INVERT mode read and print errors).


What pain is it to have symbols be converted to upcase by default? Do
you want a case-sensitive Lisp? Do you want Read, read and READ to be
three distinct symbols?

> Note that the ANSI Common Lisp specification is considered sacrosanct and
> these comments heretical.


:)

Regards,
--
____________________________
Julian Stecklina / _________________________/
________________/ /
\_________________/ LISP - truly beautiful
Julian Stecklina

2004-12-15, 3:57 am

Bruno Haible <bruno@clisp.org> writes:

>
> Yes, it can: Just start "clisp -modern". It uses the same memory image as
> normal "clisp".


Ok, I do not get it. Why is case-sensitivity = modern? Looks like
clisp -old-school to me.

Regards,
--
____________________________
Julian Stecklina / _________________________/
________________/ /
\_________________/ LISP - truly beautiful
Adam Warner

2004-12-15, 3:57 am

Hi Julian Stecklina,

>
> What pain is it to have symbols be converted to upcase by default?


It's the reason _why_ they're uppercased which is the issue: Symbols in
the "COMMON-LISP" package are interned in uppercase.

> Do you want a case-sensitive Lisp?


Yes.

> Do you want Read, read and READ to be three distinct symbols?


Yes. With the symbol name to correspond with the textual name. This
eliminates many printing issues.

By the way I've started using uppercase text to refer to constants. It's
better than the +constant+ convention, especially when mixing constants
and arithmetic: (+ CONSTANT1 CONSTANT2) is simply more legible than
(+ +constant1+ +constant2+).

Also consider this: One Lisp is Unicode code point aware like CLISP and
SBCL and cannot uppercase a character such as #\ß within a string. So
when the symbol ß is interned its given the symbol name "ß".

Another hypothetical Lisp implementation uppercases the string "ß" as
"SS". So the symbol "ß" is interned with the symbol name "SS".

When converting back from the internal encodings to a common external
encoding the symbols names /no longer correspond/. This is caused by the
unnecessary case conversion. If the case had been left alone the
differing Unicode capabilities of the implementations would have been
irrelevant to textual symbol identity.

Regards,
Adam
Pascal Bourguignon

2004-12-15, 3:57 am

Adam Warner <usenet@consulting.net.nz> writes:
> Another hypothetical Lisp implementation uppercases the string "ί" as
> "SS". So the symbol "ί" is interned with the symbol name "SS".
>
> When converting back from the internal encodings to a common external
> encoding the symbols names /no longer correspond/. This is caused by the
> unnecessary case conversion. If the case had been left alone the
> differing Unicode capabilities of the implementations would have been
> irrelevant to textual symbol identity.


Of course. That's why you should put:
(setf (readtable-case *readtable*) :preserve)
in your ~/.clisprc, and use my emacs M-x upcase-lisp RET command
to update old code.


--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Cameron MacKinnon

2004-12-15, 3:57 am

Julian Stecklina wrote:
> What pain is it to have symbols be converted to upcase by default? Do
> you want a case-sensitive Lisp? Do you want Read, read and READ to be
> three distinct symbols?


Of course. Who wouldn't?

Are there people out there who use the case insensitivity of symbols to
advantage in their own code? No. Are there teams of programmers working
on projects where each programmer uses his own capitalization style?
Possibly, but we should not errant ones.

Are there programmers who would like to aesthetically improve their code
(by their standards, not mine) or encode more information into their
symbols via selective capitalization? Yes. They should not be made to
feel that their choice is in any way unnatural or discouraged.

Or are you paranoid that one day YOU WILL BE STUCK IN AN ELECTRONIC
JUNKYARD IN A THIRD WORLD METROPOLIS, AND YOUR ONLY CONNECTION TO YOUR
LISP IMAGE WILL BE THROUGH A TERMINAL THAT DOES NOT SUPPORT MIXED CASE?
Barry Margolin

2004-12-15, 3:57 am

In article <pan.2004.12.14.22.46.30.996369@consulting.net.nz>,
Adam Warner <usenet@consulting.net.nz> wrote:

> Hi Barry Margolin,
>
>
> Even with the same operating system and locale different implementations
> of ANSI Common Lisp cons up strings of different lengths when the input is
> identical. This means the position of corresponding characters is
> implementation dependent and the length of any resulting string is
> implementation dependent.
>
> [In the table below I'm assuming ABCL has completed its Java string
> support so that Unicode characters are correctly read and stored in 16-bit
> strings. Input is UTF-8]
>
> STRING CLISP/SBCL ABCL CMUCL/GCL
> "A" 1 1 1
> "Δ€" 1 1 2
> "?" 1 1 3
> "?" 1 2 4


This looks like bugs to me. All those strings should have length 1,
since they just contain a single character. LENGTH is supposed to count
the number of characters, *not* the number of bytes.

Are you sure these are really strings you're creating, and not byte
arrays that you're filling in by reading a file as binary?

--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
Adam Warner

2004-12-15, 3:57 am

Hi Barry Margolin,

[Snipped because you replied with a content type of text/plain]

> This looks like bugs to me. All those strings should have length 1,
> since they just contain a single character. LENGTH is supposed to count
> the number of characters, *not* the number of bytes.


It is not necessarily a bug for a string to be stored internally using a
particular encoding, whether that encoding be UTF-8, UTF-16 or UTF-32. All
three encoding can be thought of as variable width encodings because none
can store all grapheme clusters within one "character":
<http://www.unicode.org/reports/tr29...ster_Boundaries>

You told me in your previous reply that length "returns the number of
array elements, regardless of their size." As ANSI Common Lisp doesn't
define characters to be of a particular size please tell me what the
correct internal encoding should be. Then I'll tell you that every other
implementation now has to traverse each string to determine its length,
and that length will not necessarily equal the number of array elements in
the string.

And if you define the official character size as 32-bit/enough to hold a
Unicode code point then you'll have to explain why an implementation that
implements characters as grapheme clusters is non-conforming with respect
to LENGTH and CHAR.

> Are you sure these are really strings you're creating, and not byte
> arrays that you're filling in by reading a file as binary?


It all depends upon what an operating system or language defines a
character to be. A character in Java is a 16-bit unsigned value because it
is defined that way, not because it can hold all Unicode code points.
Because of this definition the length of an arbitrary string is the same
across implementations. ANSI Common Lisp doesn't define the size of a
character. Allegro for example corresponds with Java:
<http://www.franz.com/support/docume...#memory-usage-2>

Do you claim that the decision to store characters internally as 16-bit
values is non-conforming? Length and array references will differ from an
implementation with 32-bit characters. So who's right? Are they all wrong
and the only implementation capable of returning the correct answer is one
which implements strings as sequences of grapheme clusters (like the
Parrot virtual machine)?

I made a simple claim Barry: Since ANSI Common Lisp doesn't define the
size of a character the length of an arbitrary string will be
implementation specific. I am sure of this claim because no one has put
their foot down and told implementors, for better or worse, that
characters are a fixed size of n-bits or that characters must be handled
as grapheme clusters of variable size.

Regards,
Adam
Peter Seibel

2004-12-15, 3:57 am

Adam Warner <usenet@consulting.net.nz> writes:

> I made a simple claim Barry: Since ANSI Common Lisp doesn't define
> the size of a character the length of an arbitrary string will be
> implementation specific. I am sure of this claim because no one has
> put their foot down and told implementors, for better or worse, that
> characters are a fixed size of n-bits or that characters must be
> handled as grapheme clusters of variable size.


Why should the size of characters have anything at all to do with the
length of strings? Strings are measured in characters so whether you
use 8 bits or 8 megs to represent each character should have nothing
to do with the value LENGTH returns when passed a string. In those
implementations that return some number greater than 1 for a
"one-character" string, what do they return for (char s 1) (char s 2)
and (char s 3)?

-Peter

--
Peter Seibel peter@javamonkey.com

Lisp is the red pill. -- John Fraser, comp.lang.lisp
Carl Shapiro

2004-12-15, 8:57 am

Julian Stecklina <der_julian@web.de> writes:

> Bruno Haible <bruno@clisp.org> writes:
>
>
> Ok, I do not get it. Why is case-sensitivity = modern? Looks like
> clisp -old-school to me.


Modern mode ought to have been named Franzlisp mode, reflecting the
lineage of the case sensitive reader algorithm in Allegro Common Lisp.
Adam Warner

2004-12-15, 8:57 am

Hi Peter Seibel,

> Why should the size of characters have anything at all to do with the
> length of strings? Strings are measured in characters so whether you use
> 8 bits or 8 megs to represent each character should have nothing to do
> with the value LENGTH returns when passed a string.


It's the translation from a defined external encoding to the implementation's
internal encoding which determines the internal length of the string. The
internal length may differ between implementations because the size of a
"character" unit differs between implementations.

> In those implementations that return some number greater than 1 for a
> "one-character" string, what do they return for (char s 1) (char s 2)
> and (char s 3)?


Let's take a common example: Java/Windows/.NET/any implementation with
16-bit strings: Strings are stored in UTF-16, with code points >= 2^16
stored as high and low surrogates: <http://www.unicode.org/glossary/#UTF_16>

In such implementations a code point in the range #x10000 to #x10FFFF has
a length of two. Here's are some tables setting out the translation:
<http://www.i18nguy.com/unicode/surrogatetable.html>

The 16-bit sequence #xD800 #xDC00 corresponds with the code point #x10000.
That's your (char s 0) and (char s 1) respectively. In a 32-bit character
implementation (char s 0) would be #x10000 and (char s 1) would be out of
range.

Regards,
Adam
Pascal Bourguignon

2004-12-15, 8:57 am

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Peter Seibel,
>
>
> It's the translation from a defined external encoding to the implementation's
> internal encoding which determines the internal length of the string. The
> internal length may differ between implementations because the size of a
> "character" unit differs between implementations.
>
>
> Let's take a common example: Java/Windows/.NET/any implementation with
> 16-bit strings: Strings are stored in UTF-16, with code points >= 2^16
> stored as high and low surrogates: <http://www.unicode.org/glossary/#UTF_16>
>
> In such implementations a code point in the range #x10000 to #x10FFFF has
> a length of two. Here's are some tables setting out the translation:
> <http://www.i18nguy.com/unicode/surrogatetable.html>
>
> The 16-bit sequence #xD800 #xDC00 corresponds with the code point #x10000.
> That's your (char s 0) and (char s 1) respectively. In a 32-bit character
> implementation (char s 0) would be #x10000 and (char s 1) would be out of
> range.


You have to distinguish characters (code points) and codes (integers).

If you want to encode full unicode (ie, the 11000(hex) code points) in
16-bit, then use (vector (unsigned-byte 16)) and assume the
consequences (ie. (length vector-of-codes) is not the number of
characters, but the number of _codes_).

Otherwise, use a unicode-enabled lisp implementation, like clisp or
sbcl, put your unicode characters in a string and get the number of
character with (LENGTH string).

Trying to hold an encoded sequence into a lisp string is a cheap
kludge inherited from the bad C char==8-bit-integer mentality that
should not occur in lisp.

If you want to process unicode data in a lisp implementation that can
handle only iso-8859-1 characters, then you must not use the string
and character types, but only (vector (unsigned-byte 8)) for utf-8;
(vector (unsigned-byte 16)) for utf-16
and (vector (integer 0 #x10ffff)) for the full unicode.



--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Bruno Haible

2004-12-15, 8:57 am

Carl Shapiro wrote:
> Modern mode ought to have been named Franzlisp mode, reflecting the
> lineage of the case sensitive reader algorithm in Allegro Common Lisp.


clisp is not using Allegro CL's algorithm, but a new one.
In Allegro, the case-sensivity bit is in the readtable. In clisp, it
is per package.

Bruno
Barry Margolin

2004-12-15, 3:59 pm

In article <pan.2004.12.15.08.44.43.362497@consulting.net.nz>,
Adam Warner <usenet@consulting.net.nz> wrote:

> Hi Peter Seibel,
>
>
> It's the translation from a defined external encoding to the implementation's
> internal encoding which determines the internal length of the string. The
> internal length may differ between implementations because the size of a
> "character" unit differs between implementations.


But the internal representation is not supposed to be visible to the
user.

Consider the following:

(defvar *array*)
(setq *array* (make-array 1))
(setf (aref *array* 0) (expt 2 255))
(print (length *array*))

Even though the array contains a bignum whose representation takes at
least 32 bytes, this should print 1.

Lisp is a high-level language, and LENGTH is supposed to deal in the
high-level concept of characters, not bytes.

--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
Harald Hanche-Olsen

2004-12-15, 3:59 pm

+ Barry Margolin <barmar@alum.mit.edu>:

| Lisp is a high-level language, and LENGTH is supposed to deal in the
| high-level concept of characters, not bytes.

But if I have understood Adam Warner's point correctly, it is that
Unicode has /two different/ high-level concepts of characters, one at
a higher lever than the other: The higher level concept is that of
grapheme clusters, each of which contains one or more "ordinary"
unicode characters.

As far as I understand, end users should only ever see grapheme
clusters. But Common Lisp programmers are hardly end users, and it is
not at all obvious (to me) if Lisp characters should correspond to
grapheme clusters or simple characters.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- Debating gives most of us much more psychological satisfaction
than thinking does: but it deprives us of whatever chance there is
of getting closer to the truth. -- C.P. Snow
Ingvar

2004-12-15, 3:59 pm

Harald Hanche-Olsen <hanche@math.ntnu.no> writes:

> + Barry Margolin <barmar@alum.mit.edu>:
>
> | Lisp is a high-level language, and LENGTH is supposed to deal in the
> | high-level concept of characters, not bytes.
>
> But if I have understood Adam Warner's point correctly, it is that
> Unicode has /two different/ high-level concepts of characters, one at
> a higher lever than the other: The higher level concept is that of
> grapheme clusters, each of which contains one or more "ordinary"
> unicode characters.


One is "underlying encoding" and one is "character" (well, possibly
code-point, since I'm not entirely clear on how to deal with the
combining characters). If I remember Adam's discussion on this on the
SBCL list a w or three ago, it seems as if he actually wants the
"external format" exposed (UTF-8, from what I recall) and that is "the
obviously incorrect" implementation, since that combined with SETF can
cause non-compliant UTF-8 encoding.

> As far as I understand, end users should only ever see grapheme
> clusters. But Common Lisp programmers are hardly end users, and it is
> not at all obvious (to me) if Lisp characters should correspond to
> grapheme clusters or simple characters.


I don't think exposing either grapheme clusters or combining
characters can cause an "invalid" (though it can probably cause a
nonsensical) string, whereas exposing an underlying UTF-8 definitely
can cause "invalid" strings (it's simple, just look at the
wide-characetr exploits used to circumvent path-checking constraints
for earlier versions of web servers). This, in my not so humble
opinion, is *not* something you want any old lisp program to do.

//Ingvar
--
(defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car i)d
)))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ #\space)(p(cdr i)d))))
(p'(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8 8)(sort
(loop for x being the external-symbols in :cl collect (string x)) #'string< ))
Peter Seibel

2004-12-15, 3:59 pm

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Peter Seibel,
>
>
> It's the translation from a defined external encoding to the
> implementation's internal encoding which determines the internal
> length of the string. The internal length may differ between
> implementations because the size of a "character" unit differs
> between implementations.


But the "internal length" of the string has nothing to do with the
value of LENGTH, or should not. The LENGTH of a string is the number
of characters it contains.

>
> Let's take a common example: Java/Windows/.NET/any implementation
> with 16-bit strings: Strings are stored in UTF-16, with code points
> <http://www.unicode.org/glossary/#UTF_16>


Yes, I'm with you so far. That just means that LENGTH has to be
implemented in a smarter way--it has to scan the array of code-points
looking for surrogate pairs in order to determine how many characters
are in the string. (That Java's String.length() method doesn't do this
will no doubt cause no end of problems down the line.)

> In such implementations a code point in the range #x10000 to
> #x10FFFF has a length of two.


A representational length. But it's still one character. Or ought to
be. Java blew this one and is now suffering the consequences. Some
Common Lisp's may have taken the same approach but that seems wrong.

> Here's are some tables setting out the translation:
> <http://www.i18nguy.com/unicode/surrogatetable.html>
>
> The 16-bit sequence #xD800 #xDC00 corresponds with the code point
> #x10000. That's your (char s 0) and (char s 1) respectively.


But that can't be because CHAR returns a character and (assuming a
Unicode capable Lisp) there is no char with the char-code #xD800,
right? Now if you're trying to process Unicode strings in a Lisp that
doesn't actually support Unicode, I'm not entirely suprised that it
doesn't work. But you've got all kinds of problems there; you can't,
for instance, say (code-char #x10000).

-Peter

--
Peter Seibel peter@javamonkey.com

Lisp is the red pill. -- John Fraser, comp.lang.lisp
jayessay

2004-12-15, 3:59 pm

Cameron MacKinnon <cmackin+nn@clearspot.net> writes:

> Julian Stecklina wrote:
>
> Of course. Who wouldn't?


A lot of people. Myself included. For symbols with such string-equal
names as the one indicated, case sensitivity is a broken mode.

> Are there programmers who would like to aesthetically improve their
> code (by their standards, not mine) or encode more information into
> their symbols via selective capitalization? Yes


But this is irrelevant (as you should understand).


The mode mechanism as provided in CLISP is much more on track.


/Jon

--
'j' - a n t h o n y at romeo/charley/november com
jayessay

2004-12-15, 3:59 pm

Bruno Haible <bruno@clisp.org> writes:

> Carl Shapiro wrote:
>
> clisp is not using Allegro CL's algorithm, but a new one.
> In Allegro, the case-sensivity bit is in the readtable. In clisp, it
> is per package.


This smells pretty close to the right way to do it. Or have a
readtable per package and leave it in the readtable (or maybe that is
what you meant, and that Allegro only has global tables).


/Jon

--
'j' - a n t h o n y at romeo/charley/november com
Alexander Schmolck

2004-12-15, 3:59 pm

Bruno Haible <bruno@clisp.org> writes:

> Carl Shapiro wrote:
>
> clisp is not using Allegro CL's algorithm, but a new one.
> In Allegro, the case-sensivity bit is in the readtable. In clisp, it
> is per package.


Do keywords work properly?

'as
sds

2004-12-15, 3:59 pm

Alexander Schmolck wrote:
> Bruno Haible <bruno@clisp.org> writes:
>
the[color=darkred]
Lisp.[color=darkred]
it[color=darkred]
>
> Do keywords work properly?


yes.
<http://www.podval.org/~sds/clisp/im...l#cs-gensym-kwd>

Cameron MacKinnon

2004-12-15, 3:59 pm

jayessay wrote:
> Cameron MacKinnon <cmackin+nn@clearspot.net> writes:
>
>
>
>
> A lot of people. Myself included. For symbols with such string-equal
> names as the one indicated, case sensitivity is a broken mode.


Yes, it is. Do you have a body of code for which this is a problem? I
don't think there are any such codebases out there, whose owners
wouldn't a) admit that the random capitalization is unintentional cruft
which ought, ideally, to be more uniform and b) be able to fix it in
minutes with a quickie script written for the purpose.
Pascal Bourguignon

2004-12-15, 8:57 pm

Thomas Gagne <tgagne@wide-open-west.com> writes:

> I've read that Common Lisp is case sensitive, but have also noticed
> that Allegro has a way of creating a case-sensitive image. Can the
> same thing be done with clisp (on GNU/Linux)?


clisp is not Common Lisp: clisp is but one implementation of the
language named Common Lisp.


Common Lisp IS case sensitive, BUT its reader can be configured, and
its default configuration is to upcase every symbol, which means that
it's case insensitive._Other configurations allow to preserve case,
rendering it effectively case sensitive, or even, to _invert_ case,
rendering it completely schizophrenic about case.

Read about *READTABLE-CASE* in CLHS.


--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Duane Rettig

2004-12-15, 8:57 pm

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Barry Margolin,


> You told me in your previous reply that length "returns the number of
> array elements, regardless of their size." As ANSI Common Lisp doesn't
> define characters to be of a particular size please tell me what the
> correct internal encoding should be. Then I'll tell you that every other
> implementation now has to traverse each string to determine its length,


This is exactly true. And an implementation has a choice as to whether
to implement strings with a constant-width encoding, to make LENGTH
work efficiently, or to sacrifice LENGTH efficiency in order to use a
variable-width encoding. Either way, LENGTH must work correctly, and
it is very simply defined on character count, independent of
its internal representation for strings. Note that in a string where
the characters are of varying width, CHAR, AREF, and their setf inverses
also are no longer

> and that length will not necessarily equal the number of array elements in
> the string.


This cannot be true, by definition. A string is a vector of characters,
period. If the characters are implemented in a variable-width manner, then
the elements themselves are of varying width, but still have the same count.

One could get around the need for more than 8. 16, or 32 bits to represent
some characters by defining all characters to be boxed values, instead of
immediates. This wouldn't be efficient, but it would allow strings to be
implemented as lispval pointers to character boxed-objects. But then that
would make the strings have fixed-width elements, wouldn't it? :-)

> And if you define the official character size as 32-bit/enough to hold a
> Unicode code point then you'll have to explain why an implementation that
> implements characters as grapheme clusters is non-conforming with respect
> to LENGTH and CHAR.


As has been stated elsewhere in this thread, Allegro CL implements strings
internally as fixed-width arrays of characters. We provide versions of
the lisp that have 8-bit characters and 16-bit characters, but only one
in each lisp (it reduces type complexity and runtime discrimination
requirements). All other encodings of strings are treated as external-formats,
and are handled by streams. Since we use simple-streams to encode between
arrays of octets and arrays of characters, the translation from external
("native") to internal ("character", "string") is a simple matter of
using the external-format for the conversion. I've lost track of how
many external-formats we provide, but the representations of currently
grapheme clusters, etc., could be simply a matter of writing an
external format for it, if not already available.

The length of an external ("native") string is given in terms of
octets; it is calculable via excl:native-string-sizeof, which
returns the number of octets in the string argument (which is
not a Lisp string, but an external representation of a string;
we call it "native" because it is presumably native to the
operating system hosting our lisp). This function must indeed
traverse the string to figure out how many characters are in it.
But it is not the LENGTH function; it would be a nonconformance
to replace LENGTH with this function.

>
> It all depends upon what an operating system or language defines a
> character to be. A character in Java is a 16-bit unsigned value because it
> is defined that way, not because it can hold all Unicode code points.
> Because of this definition the length of an arbitrary string is the same
> across implementations. ANSI Common Lisp doesn't define the size of a
> character. Allegro for example corresponds with Java:
> <http://www.franz.com/support/docume...#memory-usage-2>


Yes, internally, Allegro CL uses 16-bit characters for strings. An
8-bit version exists as well, and a 32-bit version could conceivably
be made available if demand were high (so far, it is not). The link
you quote is an explanation of the size increase when moving from an
8-bit representation to a 16-bit representation. It has nothing to do
with the interaction we provide for the external world.

> Do you claim that the decision to store characters internally as 16-bit
> values is non-conforming? Length and array references will differ from an
> implementation with 32-bit characters. So who's right? Are they all wrong
> and the only implementation capable of returning the correct answer is one
> which implements strings as sequences of grapheme clusters (like the
> Parrot virtual machine)?


I think you misunderstand Barry, because you are not allowing for a
split between internal representations and external formats. I didn't
see any such claim to nonconformance in his response.

> I made a simple claim Barry: Since ANSI Common Lisp doesn't define the
> size of a character the length of an arbitrary string will be
> implementation specific.


This claim is false, by definition, since length is specified in
terms of a count, and not in terms of widths in some other units
of measure.

> I am sure of this claim because no one has put
> their foot down and told implementors, for better or worse, that
> characters are a fixed size of n-bits or that characters must be handled
> as grapheme clusters of variable size.


The implementation decision is a choice, but the requirement to count
elements (i.e. characters) is not. That seals the tradeoff consideration.

--
Duane Rettig duane@franz.com Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182
Carl Shapiro

2004-12-15, 8:57 pm

jayessay <nospam@foo.com> writes:

> Bruno Haible <bruno@clisp.org> writes:
>
>
> This smells pretty close to the right way to do it. Or have a
> readtable per package and leave it in the readtable (or maybe that is
> what you meant, and that Allegro only has global tables).


The right way to handle different reader modes is to have a means to
declaratively specify a syntax. A syntax definition would include,
among other things, a readtable, symbol-lookup mechanism and package
mappings. Without at least these three features you are going to end
up with a zombie environment which is not self consistent, and users
will have to resort to off beat idioms to write code which works
everywhere. Syntaxes can invoked on a per-module basis, or switched
in and out of at the top-level. (In summary, associating this
behavior with packages is far from sufficient.)
Adam Warner

2004-12-15, 8:57 pm

Hi Duane Rettig,

>
> This claim is false, by definition, since length is specified in terms
> of a count, and not in terms of widths in some other units of measure.


Here is an arbitrary string encoded in UTF-8: "𐀀" [You may generate it
in CLISP using (string (code-char #x10000))]. It consists of a single code
point.

I expect (cl:length "𐀀") will NOT return 1 in a 16-bit character Allegro
yet it will return 1 in CLISP and SBCL. I expect:

(let ((s (copy-seq "𐀀")))
(setf (char s 0) #\A)
s)

will return additional garbage because it replaces the first half of a
surrogate character. I expect this will also be legal in Allegro but
not CLISP and SBCL:

(let ((s (copy-seq "𐀀")))
(setf (char s 0) #\A)
(setf (char s 1) #\B)
s)

[I can only expect these things because I haven't licensed Allegro (and
telnetting into prompt.franz.com appears to be the 8-bit version)]

LENGTH currently returns implementation specific values since the values
it returns differ between some implementations for some identical external
strings in the same encoding. If the Lisp community can't accept the
present situation then it has to agree upon an internal encoding format
for characters, which is likely to be Unicode code points.

Regards,
Adam
Pascal Bourguignon

2004-12-16, 3:57 am

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Duane Rettig,
>
>
> Here is an arbitrary string encoded in UTF-8: "π€Β€" [You
> may generate it in CLISP using (string (code-char #x10000))]. It
> consists of a single code point.


No. You have to specify an external format, you cannot generate it jus
with (string (code-char #x10000)). For example in my case, it gives
this error:

Break 2 [2]> (print (string (code-char #x10000))))

*** - UNIX error 84 (EILSEQ): Invalid multibyte or wide character

Oops, that was with -E utf-16...

Rather try:

(with-open-file (out "test.utf-8" :direction :output
:if-does-not-exist :create
:if-exists :supersede
:external-format charset:utf-8)
(princ (string (code-char #x10000)) out))


> I expect (cl:length "π€Β€") will NOT return 1 in a 16-bit
> character Allegro yet it will return 1 in CLISP and SBCL. I expect:



Not exactly. In all encoding with >= 8 bits in clisp, this string:
"π€Β€"
as a length of 4 characters.

In encodings with < 8 bits, it contains invalid characters:

$ /usr/local/bin/clisp -ansi -norc -q -E ascii
[1]> "π€Β€"

*** - invalid byte #xF0 in CHARSET:ASCII conversion
Break 1 [2]>

Now, even when you're using 7-bit encoding as default external format
for files, terminal, etc, a string containing the unicode character of
code #x10000 is always a string of one character:

[3]> (length (string (code-char #x10000)))
1

[4]> (string (code-char #x10000))

*** - Character #\u00010000 cannot be represented in the character set CHARSET:ASCII
Break 1 [5]>


> (let ((s (copy-seq "π€Β€")))
> (setf (char s 0) #\A)
> s)


You are abusing strings, using them to store _codes_ instead of
characters. This cannot be portable Common Lisp.


All this subject is silly, it's like asking that (length "SGVsbG8K")
returns 5 because (to-base64 "Hello") returns "SGVsbG8K".


--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Duane Rettig

2004-12-16, 3:57 am

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Duane Rettig,
>
>
> Here is an arbitrary string encoded in UTF-8: "π

Adam Warner

2004-12-16, 3:57 am

Hi Pascal Bourguignon,

>
> You are abusing strings, using them to store _codes_ instead of
> characters. This cannot be portable Common Lisp.
>
>
> All this subject is silly, it's like asking that (length "SGVsbG8K")
> returns 5 because (to-base64 "Hello") returns "SGVsbG8K".


Please stop this nonsense! We are discussing Unicode encoding that
consists of code points in the range 0 to #x10FFFF. This is one of the
Universal conceptions of a "character." The code #x10000 is a single code
point in Unicode. `A' is a single code point in Unicode with code 65. By
destructively modifying a string with a single code point by inserting a
new single code point you should end up with a string of a single code
point. This isn't abusing strings. This is the whole point of Unicode
strings at implementation level 2:
<http://www.unicode.org/faq/char_combmark.html#7>

Please stop spreading misinformation and claiming this subject is silly
until you have a some idea of what you're talking about.

Regards,
Adam
Julian Stecklina

2004-12-16, 3:57 am

Adam Warner <usenet@consulting.net.nz> writes:

>
> Yes. With the symbol name to correspond with the textual name. This
> eliminates many printing issues.


E.g.?

> Also consider this: One Lisp is Unicode code point aware like CLISP and
> SBCL and cannot uppercase a character such as #\ß within a string. So
> when the symbol ß is interned its given the symbol name "ß".
> Another hypothetical Lisp implementation uppercases the string "ß" as
> "SS". So the symbol "ß" is interned with the symbol name "SS".


This is a problem of not supporting Unicode and getting file formats wrong.


Regards,
--
____________________________
Julian Stecklina / _________________________/
________________/ /
\_________________/ LISP - truly beautiful
Julian Stecklina

2004-12-16, 3:57 am

Cameron MacKinnon <cmackin+nn@clearspot.net> writes:

> Julian Stecklina wrote:
>
> Of course. Who wouldn't?


I would not. I think it's a pain in C. Why should it be better in CL?

> Are there programmers who would like to aesthetically improve their
> code (by their standards, not mine) or encode more information into
> their symbols via selective capitalization? Yes. They should not be
> made to feel that their choice is in any way unnatural or discouraged.


Why can't they?

> Or are you paranoid that one day YOU WILL BE STUCK IN AN ELECTRONIC
> JUNKYARD IN A THIRD WORLD METROPOLIS, AND YOUR ONLY CONNECTION TO YOUR
> LISP IMAGE WILL BE THROUGH A TERMINAL THAT DOES NOT SUPPORT MIXED CASE?


Come on...

Regards,
--
____________________________
Julian Stecklina / _________________________/
________________/ /
\_________________/ LISP - truly beautiful
Julian Stecklina

2004-12-16, 3:57 am

Carl Shapiro <cshapiro+spam@panix.com> writes:

> Modern mode ought to have been named Franzlisp mode, reflecting the
> lineage of the case sensitive reader algorithm in Allegro Common Lisp.


Is there any information on why Franz Inc. introduced this?

Regards,
--
____________________________
Julian Stecklina / _________________________/
________________/ /
\_________________/ LISP - truly beautiful
Ed Symanzik

2004-12-16, 3:57 am

Thomas A. Russ wrote:
> Cats and dogs cohabitating and the end of the world.


The end is near:

Utah Town Ends Law Barring Cat-Dog Cohabitation
http://www.local6.com/family/3984840/detail.html
Brian Downing

2004-12-16, 3:57 am

In article <873by8maii.fsf@thalassa.informatimago.com>,
Pascal Bourguignon <spam@mouse-potato.com> wrote:
> Adam Warner <usenet@consulting.net.nz> writes:
>
> Of course. That's why you should put:
> (setf (readtable-case *readtable*) :preserve)
> in your ~/.clisprc, and use my emacs M-x upcase-lisp RET command
> to update old code.


I'm glad that works for you. Personally, I think that's impossibly
ugly, and would not write or maintain code like that.

(I'm not arguing for a case-sensitive Lisp, either. I have no problem
personally with the current standard, but I can understand why some
would.)

-bcd
--
*** Brian Downing <bdowning at lavos dot net>
Marcin 'Qrczak' Kowalczyk

2004-12-16, 3:57 am

tar@sevak.isi.edu (Thomas A. Russ) writes:

> Distinguishing symbol names just based on case is IMNSHO a really bad
> thing to do, and programming languages SHOULD discourage it.


I disagree.

Distinguishing function names from other variable names just based on
the position in a function application expression is IMNSHO a really
bad thing to do, and programming languages SHOULD discourage it.

It would be better to distinguish them by case. Or distinguish local
variables from global variables by case (this often coincides). This
is more readable than distinguishing by position in an expression.

> If you look at the direction of "modern" file systems (Windows
> and Macintosh), you will find that they are case-preserving but
> case-insensitive.


But modern Unix file systems are case-sensitive.

> That seems to be to be a reasonable approach to the issue, since the
> case of the letters is generally just too subtle a distinguishing item.


It's more important to have unambiguous rules. Case mapping in Unicode
is non-trivial: context-dependent (not character -> character but
string -> string) and locale-dependent. Common Lisp already gets this
wrong by insisting that uppercasing a string is the same as uppercasing
each of its characters separately.

--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Adam Warner

2004-12-16, 3:57 am

Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net / 4y7YwFtODcC688vBUTBKAaqwA70YPgAgEebkBCfd
PHdHtC9N0
User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux))
Xref: newsfeed-west.nntpserver.com comp.lang.lisp:167555

Hi Duane Rettig,

>
> You're still dealing with an internal representation of a string
> in the lisp you're running; you haven't yet understood the difference
> between an external format and an internal string representation.


Yes I have. The string above is an encoding of a single code point with
the code #x10000. It just so happens to have a length of 4 octets in
my current locale, UTF-8, a locale that a Unicode Lisp should understand.
To the Lisp user you should be presenting a single character with
CHAR-CODE of #x10000. If you can not there is no consistency between
Unicode 3.1+ implementations of ANSI Common Lisp.

(If you did not see the string correctly there is probably an issue with
your NNTP client. Gnome has amazing Unicode support and will display the
numeric code point for a Unicode code point without a corresponding glyph.
My posts display a string with a boxed [010]
[000] value, which is a very good
indication that I'm encoding this discussion correctly. The Content-Type
of your reply is bizarre. It should be something like text/plain;
charset=UTF-8 (or UTF-16, etc.) Yours its multipart/mixed;
boundary="=-=-=". Nowhere does it specify the encoding.

If (char-code (char "𐀀" 0)) is not 65536 then there will be
inconsistent results between ANSI Common Lisp Unicode string
implementations.

I am impressed by your demonstration of Allegro correctly handling the
length of a string with a supplementary character ["The fact that the
character overflows the 16-bit value doesn't change the fact that the
length is properly perserved."] Will you also please confirm that SETF
CHAR correctly handles the destructive modification of a supplementary
character with a Basic Multilingual Plane character (and vice versa)
within your internal string representation. If you do correctly handle
this how the heck did you achieve it? As you cannot fit a supplementary
character within the space allocated to a BMP character you would have to
cons up a new string with SETF CHAR.

Regards,
Adam
Alexander Schmolck

2004-12-16, 3:57 am

"sds" <sds@gnu.org> writes:
> Alexander Schmolck wrote:
>
> yes.
> <http://www.podval.org/~sds/clisp/im...l#cs-gensym-kwd>


Hmm, as to the "limited negative impact" of (eq :KeyWord :keyword) in
case-sensitive packages: keywords also seem to be quite popular for things
were case *does* matter (like xml tags) so isn't this likely to cause
compatibility problems with acl-modern or inverted readtable-case packages?

'as
Duane Rettig

2004-12-16, 8:58 am


[As Adam noticed and I explain below, the encoding that ended up
being sent out was unspecified, and I've been informed by a colleague
that it crashed two of his X servers. To anyone else that my article
caused the same crash, I apologize. My colleage was able to display
with Windows, and I always use Redhat and gnus in xemacs, so I know
that at least those two should work. In this message I have elided
all characters that various X displays might interpret as control
characters. Again, I apologize for any crashes I have caused.]

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Duane Rettig,
>
>
> Yes I have. The string above is an encoding of a single code point with
> the code #x10000.


No, it is only an encoding if your X display interprets it that way.
But what you actually sent to the lisp you were working with was
4 characters (not 4 octets) with char-code values of #xf0, #x90,
#x80, and #x80 respectively. If you inspect this string you are making
in whatever lisp you are using that you claim is nonconforming, you will
see that it consists of four _characters_ (and _not_ octets). I repeat,
you have failed to understand the difference between external and
internal representation. But before you become upset with this seemingly
argumentative response, read further...

> It just so happens to have a length of 4 octets in
> my current locale, UTF-8, a locale that a Unicode Lisp should understand.


No, Ansi Common Lisp does not define "locale", and makes no requirement
that any conforming implementation support it.

> To the Lisp user you should be presenting a single character with
> CHAR-CODE of #x10000. If you can not there is no consistency between
> Unicode 3.1+ implementations of ANSI Common Lisp.


Ansi Common Lisp makes no requirement that char-code-limit be any larger
than 96. Thus, it explicitly allows conforming lisps not to support
Unicode.

> (If you did not see the string correctly there is probably an issue with
> your NNTP client. Gnome has amazing Unicode support and will display the
> numeric code point for a Unicode code point without a corresponding glyph.
> My posts display a string with a boxed [010]
> [000] value, which is a very good
> indication that I'm encoding this discussion correctly.


No, it is only an indication that Gnome is fooling you into thinking you
are seeing 1 character. After making the string in the "nonconforming"
implementation of your choice, switch locales, and see what the lisp
shows you. You will see four characters, possibly nearly unprintable.
Or, f your window system still insists on interpreting the four characters,
do a (dotimes (i 4) (describe (char s i))) to see that the string truly is
the four characters that you are insisting on is one.

On the 16-bit Allegro CL, the characters show up as

CL-USER(1): (code-char #xf0)
#\latin_small_letter_eth
CL-USER(2): (code-char #x90)
#\%^p
CL-USER(3): (code-char #x80)
#\%null
CL-USER(4): (code-char #x80)
#\%null
CL-USER(5):

I.e. they are latin-1 characters each of whose high bit happens
to have been set.

So why does it appear that you are looking at a single character? Well,
with your locale set to utf-8, any characters that are larger than
7 bits are going to be interpreted by your window system as if it
is a utf-8 encoding. This is as you might expect. However, what you've
just done was to take a string of (internal) characters from an 8-bit
lisp, and displayed them with an output device that only provides the
lower 7-bits as one-to-one mappings with the lower 7 bits of an 8-bit
latin-1 encoding, but it misinterprets the 8th bit. So what should have
been interpreted as an 8-bit code of #xf0 was misinterpreted as the
flag byte for the full 4-octet encoding of a unicode character. In
short, you thought you were dealing with an internal representation,
but your window manager was doing an extra external-format conversion
for you!

It is a tribute to utf-8 encoding style that 7-bit ascii translates
without change. Also, the bottom half of latin-1 is mapped onto
7-bit ascii, so for all of these systems, you can get away with a utf-8
locale to display them all. Once you get above 7-bits, though, you
must match locales with the intended formats.

> The Content-Type
> of your reply is bizarre. It should be something like text/plain;
> charset=UTF-8 (or UTF-16, etc.) Yours its multipart/mixed;
> boundary="=-=-=". Nowhere does it specify the encoding.


Yes. And for those of you for whom I crashed their X servers,
I apologize. I was trying to give you printed representations
of characters with no encodings, as seen by the lisps themselves.
Apparently it didn't work as I had expected.

> If (char-code (char [ ... ] 0)) is not 65536 then there will be

======================^^^^^^^ <== My editing
> inconsistent results between ANSI Common Lisp Unicode string
> implementations.


For 7 bit lisps, char-code-limit is likely to be 128. For 8-bit
lisps, that limit is likely 256. For 16-bit lisps, the limit is
likely to be 65536. For any of these lisps, if you try (code-char N)
where N > char-code-limit, then you are writing nonportable code.

> I am impressed by your demonstration of Allegro correctly handling the
> length of a string with a supplementary character ["The fact that the
> character overflows the 16-bit value doesn't change the fact that the
> length is properly perserved."]


Thank you. It is a question of making sure the utf-8 external-format
gets 32-bit values right (i.e. it correctly rejects them because they
are larger than the char-code-limit).

> Will you also please confirm that SETF
> CHAR correctly handles the destructive modification of a supplementary
> character with a Basic Multilingual Plane character (and vice versa)
> within your internal string representation.


You're barking up the wrong tree. I will confirm this:

CL-USER(1): char-code-limit
65536
CL-USER(2): (code-char 65536)
NIL
CL-USER(3):

which is correct behavior. See
http://www.franz.com/support/docume...tr/code-cha.htm
and
http://www.franz.com/support/docume...tr/char-cod.htm


> If you do correctly handle
> this how the heck did you achieve it? As you cannot fit a supplementary
> character within the space allocated to a BMP character you would have to
> cons up a new string with SETF CHAR.


It is incorrect to assume that correct handling of external-formats too
large to fit into single-character spaces imply that characters are thus
created that are too large to fit into a character's space in a string.
The specification of Ansi creates a consistent framework from which
this can be implemented correctly. _Please_ read and understand
char-code-limit.

--
Duane Rettig duane@franz.com Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182
Svein Ove Aas

2004-12-16, 8:58 am

begin quoting Marcin 'Qrczak' Kowalczyk :

>
> But modern Unix file systems are case-sensitive.
>

There is no such thing as a modern unix filesystem in this respect; they're
still based on the original filesystem semantics, with only very small
changes.

It isn't case-sensitive because that is the right thing to do, or even a
good thing to do; it's case-sensitive because that was the *easiest* thing
to do, as with many other things in unix.
Pascal Bourguignon

2004-12-16, 8:58 am

Svein Ove Aas <svein.ove@aas.no> writes:

> begin quoting Marcin 'Qrczak' Kowalczyk :
>
> There is no such thing as a modern unix filesystem in this respect; they're
> still based on the original filesystem semantics, with only very small
> changes.
>
> It isn't case-sensitive because that is the right thing to do, or even a
> good thing to do; it's case-sensitive because that was the *easiest* thing
> to do, as with many other things in unix.


I still think case sensitivity is the best thing to do, the more so
with unicode, because of the impossibility to uppercase/downcase
consistently letters like ί. It's better to leave the case alone.

Anyways, what's wrong in file systems is not the case sensitivity,
it's the file system itself (for the _users_). Instead of a strict
hierarchical organization, _users_ want a bag of _unnamed_ files, and
will retrive them by icon position, by subject or project, or by color.


--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Adam Warner

2004-12-16, 8:58 am

Hi Duane Rettig,

Many thanks for the thoughtful reply.

>
> No, Ansi Common Lisp does not define "locale", and makes no requirement
> that any conforming implementation support it.
>
>
> Ansi Common Lisp makes no requirement that char-code-limit be any larger
> than 96. Thus, it explicitly allows conforming lisps not to support
> Unicode.


Of course. But we're discussing the parameters of what makes a conforming
Lisp implementation _also_ conforming with Unicode. If you're not claiming
that Allegro fully supports Unicode 3.1+ then there's no live issue
(because differing semantics are expected). However I don't think you're
claiming this. See below for what I suspect may describe your position.

[big snip]

> ======================^^^^^^^ <== My editing
>
> For 7 bit lisps, char-code-limit is likely to be 128. For 8-bit
> lisps, that limit is likely 256. For 16-bit lisps, the limit is
> likely to be 65536. For any of these lisps, if you try (code-char N)
> where N > char-code-limit, then you are writing nonportable code.


[...]

>
> You're barking up the wrong tree. I will confirm this:
>
> CL-USER(1): char-code-limit
> 65536
> CL-USER(2): (code-char 65536)
> NIL
> CL-USER(3):
>
> which is correct behavior. See
> http://www.franz.com/support/docume...tr/code-cha.htm
> and
> http://www.franz.com/support/docume...tr/char-cod.htm


OK, you've demonstrated that "no such character [with code 65536] exists
and one cannot be created, [so] nil is returned." You therefore don't have
an ANSI defined Common Lisp _character_ interface to Unicode supplementary
code points. You can encode them in strings. You just can't represent them
as characters (and therefore one can't, for example, LOOP ACROSS a string
and expect to have a supplementary character of-type CHARACTER returned).

But this doesn't necessarily mean Allegro doesn't fully support the latest
Unicode standard because fully supporting Unicode is an extension to ANSI
Common Lisp. According to this interpretation an implementation is free to
choose any character code limit so long as internally strings can encode
Unicode code points and extensions are provided to, e.g., access those
code points.

Unfortunately this interpretation makes Unicode support vendor specific
and potentially subject to vendor lock in (I know this is furthest from
your mind and you've already raised the issue of making a "32-bit" version
of Allegro CL available, subject to customer demand).

It's unlikely to be in the interests of users to have fragmented Unicode
support when the ANSI standard defines a way to support all Unicode code
points via #\ notation, CODE-CHAR, CHAR-CODE, CHAR, CHAR-CODE-LIMIT, etc.
But so much else is already non-standard in Common Lisp that it would be
just another pity.

I hope we've reached a mutually acceptable understanding.

Regards,
Adam
Adam Warner

2004-12-16, 8:58 am

Hi Pascal Bourguignon,

> You are abusing strings, using them to store _codes_ instead of
> characters. This cannot be portable Common Lisp.
>
>
> All this subject is silly, it's like asking that (length "SGVsbG8K")
> returns 5 because (to-base64 "Hello") returns "SGVsbG8K".


I suspect my reply offended you just as much as your claims offended me. I
hope we can put these differences aside and proceed anew. I think you may
believe I'm abusing strings because I keep using the term "code point".
This is the technical term for what a computer programmer would consider a
"character". I was really working with characters above. I just use the
term code point to distinguish a character from its unit encoding and to
distinguish a character from its grapheme cluster.

<http://www.unicode.org/glossary/#C>

Character. (1) The smallest component of written language that has
semantic value; refers to the abstract meaning and/or shape, rather
than a specific shape (see also glyph), though in code tables some form
of visual representation is essential for the reader’s understanding.
(2) Synonym for abstract character. (3) The basic unit of encoding for
the Unicode character encoding. (4) The English name for the
ideographic written elements of Chinese origin. (See ideograph (2).)

Code Point. Any value in the Unicode codespace; that is, the range of
integers from 0 to 10FFFFh. (See definition D4b in Section 3.4,
Characters and Encoding.)

Using the term "Code Point" is more precise than discussing the storing of
characters even though it may give the impression I'm abusing strings to
store codes instead of characters.

Regards,
Adam
Mario S. Mommer

2004-12-16, 4:06 pm


Duane Rettig <duane@franz.com> writes:
> [As Adam noticed and I explain below, the encoding that ended up
> being sent out was unspecified, and I've been informed by a colleague
> that it crashed two of his X servers. To anyone else that my article
> caused the same crash, I apologize. My colleage was able to display
> with Windows, and I always use Redhat and gnus in xemacs, so I know
> that at least those two should work. In this message I have elided
> all characters that various X displays might interpret as control
> characters. Again, I apologize for any crashes I have caused.]


I'd say that the crashes were caused by bad software on their end. If
you can crash X servers with a simple usenet post, then something is
rotten.

Cameron MacKinnon

2004-12-16, 8:57 pm

Thomas A. Russ wrote:

> Distinguishing symbol names just based on case is IMNSHO a really bad
> thing to do, and programming languages SHOULD discourage it. The
> problem is that it leads to various subtle bugs that are hard to
> visually pick up when one has to deal with the fact, that, for example
> the two symbols subclassOf and subClassOf are actually different.


By advocating a case-insensitive language, you force subtle writers to
interface with tin-eared listeners. It's an autistic machine interface;
one that gets the big picture, but doesn't understand the subtleties
which the writer introduced into his prose, and which human readers do
understand.

In no field outside of computers are there case-insensitive languages
which leave capitalization as a completely arbitrary yet
information-free choice of the writer. The posting to which I am
replying contains various examples of information encoded with selective
capitalization, for example.

Typographers, whose job it is to make text look good so it can quickly
and delightfully convey information, do not see the choice between
majiscules and miniscules as an arbitrary one, and they use it to great
effect. Your belief that the distinction is slight and should be ignored
by our machines belies centuries of typographic wisdom.

> One can perhaps make a weak case that all uppercase is visually
> different enough to be distinguished from mixed and lowercase, but that
> is a much more arcane restriction. If you look at the direction of
> "modern" file systems (Windows and Macintosh), you will find that they
> are case-preserving but case-insensitive. That seems to be to be a
> reasonable approach to the issue, since the case of the letters is
> generally just too subtle a distinguishing item.


File systems' designs have a lot more to do with being compatible with
other operating systems, past and present, than with user desires.
Besides which, users don't mind if their expression of ideas is somewhat
limited in the file's name, it is simply a tag which is consulted
briefly as an index. Users know that their ideas can be given full
expression inside the files, not in their names.

Case insensitivity in our computer systems has everything to do with
baudot five bit codes, low resolution displays and nine pin dot matrix
printers. In that austere and harsh environment, ALL CAPS really was the
most legible and economical choice. Not anymore.


A few comments regarding your subclassOf and subClassOf example:

I easily notice capitalization errors when I'm reading prose. If a coder
really did have code where both subclassOf and subClassOf were defined
and in scope, well, how far are you willing to go to save him from
himself? The benefits to allowing writers more style in their works
outweigh, I think, the risks that bad writers will use the tools to
create eyesores -- bad writers will find ways of writing bad code no
matter how much you constrain them, but good writers can only surpass
mediocrity with good, case sensitive tools.

(HELP, MY LISP READER KEEPS SHOUTING AT ME) :-)
Cameron MacKinnon

2004-12-16, 8:57 pm

Adam Warner wrote:
> Here is an arbitrary string encoded in UTF-8:


Hilarious. Every time someone quoted this article, I got a different
character. Adam's original showed up as a question mark, Pascal's as a
diamond, and Duane's was, well, interesting.
Carl Shapiro

2004-12-16, 8:57 pm

jayessay <nospam@foo.com> writes:

> Carl Shapiro <cshapiro+spam@panix.com> writes:


[color=darkred]
> You've obviously thought about this more than I have. These are good
> points and sound on target. Thanks.


I did a little bit of thinking about this problem when I began to use
a half ML, half Prolog-like system which was built on top of Common
Lisp. (You could do all sorts of grody academic stuff, and then
escape to Common Lisp when real work had to be done.) In the Prolog
tradition variables and symbols were differentiated by the case of the
first character in a symbol's print-name. Also, the semicolon was no
longer the comment delimiter and was used rampantly in program code.

I had the good fortune to use a Lisp which supported the syntax
facility that I described. Syntaxes could be defined hierarchically,
so I subclassed the ANSI Common Lisp syntax, specified a hacked up
readtable, adjusted various package mappings (forward "common-lisp" to
"COMMON-LISP", use a different user package, etc.) and was good to go.
As long as I stuck "Syntax: <the name of my syntax>" at the top of
every source file, all of the development tools would automatically
adjust to the unique treatment of my program code.
Edi Weitz

2004-12-16, 8:57 pm

On 16 Dec 2004 15:28:01 -0500, Carl Shapiro <cshapiro+spam@panix.com> wrote:

> I had the good fortune to use a Lisp which supported the syntax
> facility that I described. Syntaxes could be defined
> hierarchically, so I subclassed the ANSI Common Lisp syntax,
> specified a hacked up readtable, adjusted various package mappings
> (forward "common-lisp" to "COMMON-LISP", use a different user
> package, etc.) and was good to go. As long as I stuck "Syntax: <the
> name of my syntax>" at the top of every source file, all of the
> development tools would automatically adjust to the unique treatment
> of my program code.


Which environment was that? Genera?

Thanks,
Edi.

--

Lisp is not dead, it just smells funny.

Real email: (replace (subseq "spamtrap@agharta.de" 5) "edi")
jayessay

2004-12-16, 8:57 pm

Carl Shapiro <cshapiro+spam@panix.com> writes:

> I had the good fortune to use a Lisp which supported the syntax
> facility that I described. Syntaxes could be defined hierarchically,
> so I subclassed the ANSI Common Lisp syntax, specified a hacked up
> readtable, adjusted various package mappings (forward "common-lisp" to
> "COMMON-LISP", use a different user package, etc.) and was good to go.
> As long as I stuck "Syntax: <the name of my syntax>" at the top of
> every source file, all of the development tools would automatically
> adjust to the unique treatment of my program code.



Nice. Which Lisp was this? Also, is "was" the operative word
here?...


/Jon

--
'j' - a n t h o n y at romeo/charley/november com
Carl Shapiro

2004-12-16, 8:57 pm

Edi Weitz <spamtrap@agharta.de> writes:

> On 16 Dec 2004 15:28:01 -0500, Carl Shapiro <cshapiro+spam@panix.com> wrote:
>
>
> Which environment was that? Genera?


It sho' was.

The amount of systems code to supported the notion of a syntax is
incredibly tiny. There were only a few minor changes to the package
system interface so an optional argument could be passed along with
the usual parameters. The reader had to be taught about the triple
colon operator to make a syntax explicit when a symbol was being read.
You may occasionally see SYNTAX:::PACKAGE::SYMBOL in various obscure
places.

Package maps were the biggest win, and I wish that more people would
experiment with this rather than so-called "hierarchical packages".
The package structure add two slots for relative name fields. The
reader would examine one of these two slots when reading a symbol
before resolving the package. A real hierarchy of names exists in
that one can lookup names through other packages, recursively.

Here is an example of how this could work in practice. Say I wanted
to port code from one system which referenced a package that already
existed on my Lisp but whose exported interface was different. One
way to correct this problem would be to create a compatibility package
which had a unique name and the expected interfaces. Unfortunately,
in order to ensure that symbols are resolved to the right package, you
would have to hunt down every reference to the old package name in the
ported source code and replace it with the compatibility package name.
A relative mapping could fix this problem by instructing the reader
that whenever a symbol is referenced from my ported program's package
to the (for example) MP package, forward that reference to this other
package, the compatibility MP package.

A lot of thought and care went into the syntax facility. Its design
is quite complete and elegant.
Carl Shapiro

2004-12-16, 8:57 pm

jayessay <nospam@foo.com> writes:

> Carl Shapiro <cshapiro+spam@panix.com> writes:
>
>
>
> Nice. Which Lisp was this? Also, is "was" the operative word
> here?...


I was doing this work on Symbolics Genera. (I posted a more detailed
description of the system in another follow-up which arrived at my
NNTP sever before your's did.) The syntax support required no Lisp
machine magic, just a few prescient design decisions.

If anybody wants to see this system in action, I'll make sure there is
at least one Lisp Machine and documentation set at the ILC 2005, this
June.
Pascal Bourguignon

2004-12-16, 8:57 pm

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Pascal Bourguignon,
>
>
> I suspect my reply offended you just as much as your claims offended me. I
> hope we can put these differences aside and proceed anew. I think you may
> believe I'm abusing strings because I keep using the term "code point".
> This is the technical term for what a computer programmer would consider a
> "character". I was really working with characters above. I just use the
> term code point to distinguish a character from its unit encoding and to
> distinguish a character from its grapheme cluster.


No offense. I guess the problem was that my gnus setting showed your
single character unicode string as a string of four 8-bit characters
on my screen, so I assumed you wanted a 4-character string to be of
LENGTH 1.


> <http://www.unicode.org/glossary/#C>
>
> Character.
> (3) The basic unit of encoding for the Unicode character encoding.


I guess this unicode-specific meaning is at cross with Common Lisp
definition of a character. Common Lisp use external bits
(:EXTERNAL-FORMAT) or small integers for encodings (CHAR-CODE/CODE-CHAR)
and Common Lisp CHARACTER correspond to unicode Code Points.


> Code Point. Any value in the Unicode codespace; that is, the range of
> integers from 0 to 10FFFFh. (See definition D4b in Section 3.4,
> Characters and Encoding.)
>
> Using the term "Code Point" is more precise than discussing the storing of
> characters even though it may give the impression I'm abusing strings to
> store codes instead of characters.
>
> Regards,
> Adam


--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Marcin 'Qrczak' Kowalczyk

2004-12-16, 8:57 pm

Cameron MacKinnon <cmackin+nn@clearspot.net> writes:

> Hilarious. Every time someone quoted this article, I got a different
> character. Adam's original showed up as a question mark, Pascal's as
> a diamond, and Duane's was, well, interesting.


I didn't saw a correct character because of poor Unicode support in
Emacs Lisp.

--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Marcin 'Qrczak' Kowalczyk

2004-12-16, 8:57 pm

Cameron MacKinnon <cmackin+nn@clearspot.net> writes:

> Hilarious. Every time someone quoted this article, I got a different
> character. Adam's original showed up as a question mark, Pascal's as
> a diamond, and Duane's was, well, interesting.


I didn't see a correct character because of poor Unicode support in
Emacs Lisp.

--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Pascal Bourguignon

2004-12-16, 8:57 pm

Cameron MacKinnon <cmackin+nn@clearspot.net> writes:

> Thomas A. Russ wrote:
>
>
> By advocating a case-insensitive language, you force subtle writers to
> interface with tin-eared listeners. It's an autistic machine
> interface; one that gets the big picture, but doesn't understand the
> subtleties which the writer introduced into his prose, and which human
> readers do understand.


I'm all in favor of case sensibility. I've never lost a file on case
sensitive file system, but I did on case insensitive file system,
overidding some File with some other file.

Anyway, what users want really is not case insensitivity, it's
phonetic spelling of file names.

--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
Edi Weitz

2004-12-16, 8:57 pm

On 16 Dec 2004 17:39:29 -0500, Carl Shapiro <cshapiro+spam@panix.com> wrote:

> If anybody wants to see this system in action, I'll make sure there
> is at least one Lisp Machine and documentation set at the ILC 2005,
> this June.


That's a very good idea. How about a "tutorial" where a LispM wizard
demoes the system in action? I'm looking forward to that.

Cheers,
Edi.

--

Lisp is not dead, it just smells funny.

Real email: (replace (subseq "spamtrap@agharta.de" 5) "edi")
Kenny Tilton

2004-12-17, 3:57 am



Cameron MacKinnon wrote:
> Thomas A. Russ wrote:
>
>
>
> By advocating a case-insensitive language, you force subtle writers to
> interface with tin-eared listeners. It's an autistic machine interface;
> one that gets the big picture, but doesn't understand the subtleties
> which the writer introduced into his prose, and which human readers do
> understand.
>
> In no field outside of computers are there case-insensitive languages
> which leave capitalization as a completely arbitrary yet
> information-free choice of the writer. The posting to which I am
> replying contains various examples of information encoded with selective
> capitalization, for example.
>
> Typographers, whose job it is to make text look good so it can quickly
> and delightfully convey information, do not see the choice between
> majiscules and miniscules as an arbitrary one, and they use it to great
> effect. Your belief that the distinction is slight and should be ignored
> by our machines belies centuries of typographic wisdom.


I must say, that was a beautiful speech. But your correspondent offered
subclassOf vs. subClassOf, while you offered nothing. subclassOf vs.
subClassOf is silly because no two different things or functions could
ever end up with those names. which is the point:

what two different things could be named correctly with names differing
only in the case of one or more letters?

Safely assuming the null set to be your response...why introduce
gratuitous case sensitivity? Remember, some few of us are actually
trying to get some programming done.

kt


--
Cells? Cello? Celtik?: http://www.common-lisp.net/project/cells/
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film

Carl Shapiro

2004-12-17, 3:57 am

Edi Weitz <spamtrap@agharta.de> writes:

> On 16 Dec 2004 17:39:29 -0500, Carl Shapiro <cshapiro+spam@panix.com> wrote:
>
>
> That's a very good idea. How about a "tutorial" where a LispM wizard
> demoes the system in action? I'm looking forward to that.


For what it's worth, past ILCs have always had several Lisp Machine
users in attendance. (That is to say, people who still use The
Machine every day as part of their job function, not just people with
a dump truck full of fond memories.) I know next year will be no
different.

Something past organizing chairs have consistently forgotten to do is
to secure a space with enough tables and chairs so people can
congregate around machines, show off programs, and work on code
together. This time around we will have space where people can play
on various Lisp systems. I cannot schedule a demonstration of the
Lisp Machine unless somebody would like to volunteer in advance...
However, there will be no shortage of people around who are capable of
giving informal tours.
Adam Warner

2004-12-17, 8:57 am

Hi Thomas Gagne,

> I've read that Common Lisp is case sensitive, but have also noticed that
> Allegro has a way of creating a case-sensitive image. Can the same thing be
> done with clisp (on GNU/Linux)?


If the question is: Do any of the free Common Lisp implementations provide
a build-time option to intern all symbols in the COMMON-LISP package in
lower case so that :PRESERVE is a suitable readtable option?

The answer is: No.

The next best alternative is to use the :INVERT readtable mode. This
inverts the symbol name of all lowercase or all uppercase symbols as they
are being read while leaving the symbol name of mixed-case symbols alone.

This is the only suitable readtable option that maintains case information
because the ANSI Common Lisp committee decided backwards compatibility
with traditional uppercasing Lisps was most important. The decision hasn't
stood the test of time. If they'd made a better choice the pain of
transition would have been long over.

Readtable case should be deprecated. Symbols should be interned as
written in source code and implementors should not have the burden of
implementing "historical" baggage that is difficult to get 100% right
(e.g. ABCL is continuing to squash :INVERT mode read and print errors).

Note that the ANSI Common Lisp specification is considered sacrosanct and
these comments heretical.

Regards,
Adam