Home > Archive > Lisp > January 2005 > CLisp case sensitivity
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here] Pages: Pages: [1] 2
| Author |
CLisp case sensitivity
|
|
| Thomas Gagne 2004-12-14, 3:57 am |
| I've read that Common Lisp is case sensitive, but have also noticed that
Allegro has a way of creating a case-sensitive image. Can the same thing be
done with clisp (on GNU/Linux)?
| |
| Pascal Bourguignon 2004-12-14, 3:57 am |
| Thomas Gagne <tgagne@wide-open-west.com> writes:
> I've read that Common Lisp is case sensitive, but have also noticed
> that Allegro has a way of creating a case-sensitive image. Can the
> same thing be done with clisp (on GNU/Linux)?
clisp is not Common Lisp: clisp is but one implementation of the
language named Common Lisp.
Common Lisp IS case sensitive, BUT its reader can be configured, and
its default configuration is to upcase every symbol, which means that
it's case insensitive._Other configurations allow to preserve case,
rendering it effectively case sensitive, or even, to _invert_ case,
rendering it completely schizophrenic about case.
Read about *READTABLE-CASE* in CLHS.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
| |
| Adam Warner 2004-12-14, 3:57 am |
| Hi Thomas Gagne,
> I've read that Common Lisp is case sensitive, but have also noticed that
> Allegro has a way of creating a case-sensitive image. Can the same thing be
> done with clisp (on GNU/Linux)?
If the question is: Do any of the free Common Lisp implementations provide
a build-time option to intern all symbols in the COMMON-LISP package in
lower case so that :PRESERVE is a suitable readtable option?
The answer is: No.
The next best alternative is to use the :INVERT readtable mode. This
inverts the symbol name of all lowercase or all uppercase symbols as they
are being read while leaving the symbol name of mixed-case symbols alone.
This is the only suitable readtable option that maintains case information
because the ANSI Common Lisp committee decided backwards compatibility
with traditional uppercasing Lisps was most important. The decision hasn't
stood the test of time. If they'd made a better choice the pain of
transition would have been long over.
Readtable case should be deprecated. Symbols should be interned as
written in source code and implementors should not have the burden of
implementing "historical" baggage that is difficult to get 100% right
(e.g. ABCL is continuing to squash :INVERT mode read and print errors).
Note that the ANSI Common Lisp specification is considered sacrosanct and
these comments heretical.
Regards,
Adam
| |
| Chris Capel 2004-12-14, 9:07 am |
| Adam Warner wrote:
> Hi Thomas Gagne,
>
>
> If the question is: Do any of the free Common Lisp implementations provide
> a build-time option to intern all symbols in the COMMON-LISP package in
> lower case so that :PRESERVE is a suitable readtable option?
>
> The answer is: No.
>
> The next best alternative is to use the :INVERT readtable mode. This
> inverts the symbol name of all lowercase or all uppercase symbols as they
> are being read while leaving the symbol name of mixed-case symbols alone.
>
> This is the only suitable readtable option that maintains case information
> because the ANSI Common Lisp committee decided backwards compatibility
> with traditional uppercasing Lisps was most important. The decision hasn't
> stood the test of time. If they'd made a better choice the pain of
> transition would have been long over.
Another example of the ramifications of this decision: inconsistent
functions names. For example, the convention of using an "f" suffix on
functions using places that isn't followed everywhere (getf, push). The
convention of using a p or -p suffix with many type testing functions, but
not ATOM or NULL! I'm sure others can point out other examples.
Be thankful it isn't as bad as the C standard library. (Atoi? What's
Eh-toy? Some sort of faerie name?)
Chris Capel
| |
| Adam Warner 2004-12-14, 9:07 am |
| Hi Chris Capel,
> Another example of the ramifications of this decision: inconsistent
> functions names. For example, the convention of using an "f" suffix on
> functions using places that isn't followed everywhere (getf, push). The
> convention of using a p or -p suffix with many type testing functions, but
> not ATOM or NULL! I'm sure others can point out other examples.
Look at the character predicates:
characterp
alpha-char-p
digit-char-p
graphic-char-p
standard-char-p
You can guess why we don't have `charp'. I find using ? to denote
predicates helpful. I don't start a symbol with a non-alphanumeric so the
punctuation marks can still be used as non-terminating dispatching macro
characters [just as # can still be used within symbol names, e.g.
-#x00FF and abc#|this-is-not-a-comment|#def are both symbols].
> Be thankful it isn't as bad as the C standard library. (Atoi? What's
> Eh-toy? Some sort of faerie name?)
ANSI Common Lisp's naming inconsistency no longer bothers me. At least
when annoyed I can resolve the issue using the package system.
Perhaps the most overlooked inconsistency is that LENGTH returns an
implementation-specific value for strings [largely depending upon whether
strings are implemented as sequences of octets (CMUCL, GCL, historically
SBCL), 16-bit values (ABCL) or 32-bit values (CLISP, SBCL)].
Regards,
Adam
| |
| Barry Margolin 2004-12-14, 4:08 pm |
| In article <pan.2004.12.14.10.49.34.276021@consulting.net.nz>,
Adam Warner <usenet@consulting.net.nz> wrote:
> Hi Chris Capel,
>
>
> Look at the character predicates:
>
> characterp
> alpha-char-p
> digit-char-p
> graphic-char-p
> standard-char-p
What's the problem there? The convention, which I think is even
explained explicitly in CLTL, is that "p" is appended to single words
(e.g. "character"), and "-p" is appended to multiple words (e.g.
"alpha-char").
> Perhaps the most overlooked inconsistency is that LENGTH returns an
> implementation-specific value for strings [largely depending upon whether
> strings are implemented as sequences of octets (CMUCL, GCL, historically
> SBCL), 16-bit values (ABCL) or 32-bit values (CLISP, SBCL)].
What are you talking about? It returns the number of array elements,
regardless of their size.
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
| |
| Bruno Haible 2004-12-14, 4:08 pm |
| > have also noticed that Allegro has a way of creating a case-sensitive image.
> Can the same thing be done with clisp (on GNU/Linux)?
Yes, it can: Just start "clisp -modern". It uses the same memory image as
normal "clisp".
And it is even better than Allegro: In CLISP you can mix old-style source
code with modern case-sensitive source code. Thus you can migrate your big
applications to the modern case-sensitive mode slowly, package by package;
you're not forced to do it all at once.
The feature is in CLISP CVS and will be part of clisp-2.34; the
implementation follows the lines presented at LSM 2004 [1].
Bruno
[1] http://www-jcsu.jesus.cam.ac.uk/~cs...ning.html#htoc4
| |
| Chris Riesbeck 2004-12-14, 4:09 pm |
| In article <barmar-B18246.08360914122004@comcast.dca.giganews.com>,
Barry Margolin <barmar@alum.mit.edu> wrote:
> In article <pan.2004.12.14.10.49.34.276021@consulting.net.nz>,
> Adam Warner <usenet@consulting.net.nz> wrote:
>
> What's the problem there? The convention, which I think is even
> explained explicitly in CLTL, is that "p" is appended to single words
> (e.g. "character"), and "-p" is appended to multiple words (e.g.
> "alpha-char").
Correct, though let us not forget our old friend
string-lessp, which is NOT an exception, but does
invoke an additional rule:
http://www.cliki.net/Naming%20conventions
| |
| Adam Warner 2004-12-14, 9:01 pm |
| Hi Barry Margolin,
>
>You can guess why we don't have `charp'. What's the problem there? The
>convention, which I think is even
> explained explicitly in CLTL, is that "p" is appended to single words
> (e.g. "character"), and "-p" is appended to multiple words (e.g.
> "alpha-char").
"You can guess why we don't have `charp'": All the other character
predicates and equality tests use the contracted form of character, char.
So does Scheme. It just so happens that charp looks stupid and would be
pronounced like see-harp, kar-pee or krap (because of the convention).
Pity the language designers forgot the convention for DEFSTRUCT! How hard
would have it have been to parse the symbol name for #\-? Obviously the
problem is that appending a #\p to a symbol name can create strange
combinations that are best avoided by not applying the convention.
>
> What are you talking about? It returns the number of array elements,
> regardless of their size.
Even with the same operating system and locale different implementations
of ANSI Common Lisp cons up strings of different lengths when the input is
identical. This means the position of corresponding characters is
implementation dependent and the length of any resulting string is
implementation dependent.
[In the table below I'm assuming ABCL has completed its Java string
support so that Unicode characters are correctly read and stored in 16-bit
strings. Input is UTF-8]
STRING CLISP/SBCL ABCL CMUCL/GCL
"A" 1 1 1
"Δ" 1 1 2
"β" 1 1 3
"π" 1 2 4
Java dictates that all implementations have the same string representation
(a suboptimal one, but at least it's the same). Python has the same
issue as Common Lisp, but users can choose which way to build it:
<http://python.fyxm.net/peps/pep-0261.html>
Regards,
Adam
| |
| Adam Warner 2004-12-14, 9:01 pm |
| Hi Barry Margolin,
>
>You can guess why we don't have `charp'. What's the problem there? The
>convention, which I think is even
> explained explicitly in CLTL, is that "p" is appended to single words
> (e.g. "character"), and "-p" is appended to multiple words (e.g.
> "alpha-char").
"You can guess why we don't have `charp'": All the other character
predicates and equality tests use the contracted form of character, char.
So does Scheme. It just so happens that charp looks stupid and would be
pronounced like see-harp, kar-pee or krap (because of the convention).
Pity the language designers forgot the convention for DEFSTRUCT! How hard
would it have been to parse the symbol name for #\-? An obvious problem is
that appending #\P to a symbol name can create strange combinations that
are best avoided by not applying the convention.
>
> What are you talking about? It returns the number of array elements,
> regardless of their size.
Even with the same operating system and locale different implementations
of ANSI Common Lisp cons up strings of different lengths when the input is
identical. This means the position of corresponding characters is
implementation dependent and the length of any resulting string is
implementation dependent.
[In the table below I'm assuming ABCL has completed its Java string
support so that Unicode characters are correctly read and stored in 16-bit
strings. Input is UTF-8]
STRING CLISP/SBCL ABCL CMUCL/GCL
"A" 1 1 1
"Δ" 1 1 2
"β" 1 1 3
"π" 1 2 4
Java dictates that all implementations have the same string representation
(a suboptimal one, but at least it's the same). Python has the same
issue as Common Lisp, but users can choose which way to build it:
<http://python.fyxm.net/peps/pep-0261.html>
Regards,
Adam
| |
| Julian Stecklina 2004-12-15, 3:57 am |
| Adam Warner <usenet@consulting.net.nz> writes:
> This is the only suitable readtable option that maintains case information
> because the ANSI Common Lisp committee decided backwards compatibility
> with traditional uppercasing Lisps was most important. The decision hasn't
> stood the test of time. If they'd made a better choice the pain of
> transition would have been long over.
>
> Readtable case should be deprecated. Symbols should be interned as
> written in source code and implementors should not have the burden of
> implementing "historical" baggage that is difficult to get 100% right
> (e.g. ABCL is continuing to squash :INVERT mode read and print errors).
What pain is it to have symbols be converted to upcase by default? Do
you want a case-sensitive Lisp? Do you want Read, read and READ to be
three distinct symbols?
> Note that the ANSI Common Lisp specification is considered sacrosanct and
> these comments heretical.
:)
Regards,
--
____________________________
Julian Stecklina / _________________________/
________________/ /
\_________________/ LISP - truly beautiful
| |
| Julian Stecklina 2004-12-15, 3:57 am |
| Bruno Haible <bruno@clisp.org> writes:
>
> Yes, it can: Just start "clisp -modern". It uses the same memory image as
> normal "clisp".
Ok, I do not get it. Why is case-sensitivity = modern? Looks like
clisp -old-school to me.
Regards,
--
____________________________
Julian Stecklina / _________________________/
________________/ /
\_________________/ LISP - truly beautiful
| |
| Adam Warner 2004-12-15, 3:57 am |
| Hi Julian Stecklina,
>
> What pain is it to have symbols be converted to upcase by default?
It's the reason _why_ they're uppercased which is the issue: Symbols in
the "COMMON-LISP" package are interned in uppercase.
> Do you want a case-sensitive Lisp?
Yes.
> Do you want Read, read and READ to be three distinct symbols?
Yes. With the symbol name to correspond with the textual name. This
eliminates many printing issues.
By the way I've started using uppercase text to refer to constants. It's
better than the +constant+ convention, especially when mixing constants
and arithmetic: (+ CONSTANT1 CONSTANT2) is simply more legible than
(+ +constant1+ +constant2+).
Also consider this: One Lisp is Unicode code point aware like CLISP and
SBCL and cannot uppercase a character such as #\Γ within a string. So
when the symbol Γ is interned its given the symbol name "Γ".
Another hypothetical Lisp implementation uppercases the string "Γ" as
"SS". So the symbol "Γ" is interned with the symbol name "SS".
When converting back from the internal encodings to a common external
encoding the symbols names /no longer correspond/. This is caused by the
unnecessary case conversion. If the case had been left alone the
differing Unicode capabilities of the implementations would have been
irrelevant to textual symbol identity.
Regards,
Adam
| |
| Pascal Bourguignon 2004-12-15, 3:57 am |
| Adam Warner <usenet@consulting.net.nz> writes:
> Another hypothetical Lisp implementation uppercases the string "ί" as
> "SS". So the symbol "ί" is interned with the symbol name "SS".
>
> When converting back from the internal encodings to a common external
> encoding the symbols names /no longer correspond/. This is caused by the
> unnecessary case conversion. If the case had been left alone the
> differing Unicode capabilities of the implementations would have been
> irrelevant to textual symbol identity.
Of course. That's why you should put:
(setf (readtable-case *readtable*) :preserve)
in your ~/.clisprc, and use my emacs M-x upcase-lisp RET command
to update old code.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
| |
| Cameron MacKinnon 2004-12-15, 3:57 am |
| Julian Stecklina wrote:
> What pain is it to have symbols be converted to upcase by default? Do
> you want a case-sensitive Lisp? Do you want Read, read and READ to be
> three distinct symbols?
Of course. Who wouldn't?
Are there people out there who use the case insensitivity of symbols to
advantage in their own code? No. Are there teams of programmers working
on projects where each programmer uses his own capitalization style?
Possibly, but we should not errant ones.
Are there programmers who would like to aesthetically improve their code
(by their standards, not mine) or encode more information into their
symbols via selective capitalization? Yes. They should not be made to
feel that their choice is in any way unnatural or discouraged.
Or are you paranoid that one day YOU WILL BE STUCK IN AN ELECTRONIC
JUNKYARD IN A THIRD WORLD METROPOLIS, AND YOUR ONLY CONNECTION TO YOUR
LISP IMAGE WILL BE THROUGH A TERMINAL THAT DOES NOT SUPPORT MIXED CASE?
| |
| Barry Margolin 2004-12-15, 3:57 am |
| In article <pan.2004.12.14.22.46.30.996369@consulting.net.nz>,
Adam Warner <usenet@consulting.net.nz> wrote:
> Hi Barry Margolin,
>
>
> Even with the same operating system and locale different implementations
> of ANSI Common Lisp cons up strings of different lengths when the input is
> identical. This means the position of corresponding characters is
> implementation dependent and the length of any resulting string is
> implementation dependent.
>
> [In the table below I'm assuming ABCL has completed its Java string
> support so that Unicode characters are correctly read and stored in 16-bit
> strings. Input is UTF-8]
>
> STRING CLISP/SBCL ABCL CMUCL/GCL
> "A" 1 1 1
> "Δ" 1 1 2
> "?" 1 1 3
> "?" 1 2 4
This looks like bugs to me. All those strings should have length 1,
since they just contain a single character. LENGTH is supposed to count
the number of characters, *not* the number of bytes.
Are you sure these are really strings you're creating, and not byte
arrays that you're filling in by reading a file as binary?
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
| |
| Adam Warner 2004-12-15, 3:57 am |
| Hi Barry Margolin,
[Snipped because you replied with a content type of text/plain]
> This looks like bugs to me. All those strings should have length 1,
> since they just contain a single character. LENGTH is supposed to count
> the number of characters, *not* the number of bytes.
It is not necessarily a bug for a string to be stored internally using a
particular encoding, whether that encoding be UTF-8, UTF-16 or UTF-32. All
three encoding can be thought of as variable width encodings because none
can store all grapheme clusters within one "character":
<http://www.unicode.org/reports/tr29...ster_Boundaries>
You told me in your previous reply that length "returns the number of
array elements, regardless of their size." As ANSI Common Lisp doesn't
define characters to be of a particular size please tell me what the
correct internal encoding should be. Then I'll tell you that every other
implementation now has to traverse each string to determine its length,
and that length will not necessarily equal the number of array elements in
the string.
And if you define the official character size as 32-bit/enough to hold a
Unicode code point then you'll have to explain why an implementation that
implements characters as grapheme clusters is non-conforming with respect
to LENGTH and CHAR.
> Are you sure these are really strings you're creating, and not byte
> arrays that you're filling in by reading a file as binary?
It all depends upon what an operating system or language defines a
character to be. A character in Java is a 16-bit unsigned value because it
is defined that way, not because it can hold all Unicode code points.
Because of this definition the length of an arbitrary string is the same
across implementations. ANSI Common Lisp doesn't define the size of a
character. Allegro for example corresponds with Java:
<http://www.franz.com/support/docume...#memory-usage-2>
Do you claim that the decision to store characters internally as 16-bit
values is non-conforming? Length and array references will differ from an
implementation with 32-bit characters. So who's right? Are they all wrong
and the only implementation capable of returning the correct answer is one
which implements strings as sequences of grapheme clusters (like the
Parrot virtual machine)?
I made a simple claim Barry: Since ANSI Common Lisp doesn't define the
size of a character the length of an arbitrary string will be
implementation specific. I am sure of this claim because no one has put
their foot down and told implementors, for better or worse, that
characters are a fixed size of n-bits or that characters must be handled
as grapheme clusters of variable size.
Regards,
Adam
| |
| Peter Seibel 2004-12-15, 3:57 am |
| Adam Warner <usenet@consulting.net.nz> writes:
> I made a simple claim Barry: Since ANSI Common Lisp doesn't define
> the size of a character the length of an arbitrary string will be
> implementation specific. I am sure of this claim because no one has
> put their foot down and told implementors, for better or worse, that
> characters are a fixed size of n-bits or that characters must be
> handled as grapheme clusters of variable size.
Why should the size of characters have anything at all to do with the
length of strings? Strings are measured in characters so whether you
use 8 bits or 8 megs to represent each character should have nothing
to do with the value LENGTH returns when passed a string. In those
implementations that return some number greater than 1 for a
"one-character" string, what do they return for (char s 1) (char s 2)
and (char s 3)?
-Peter
--
Peter Seibel peter@javamonkey.com
Lisp is the red pill. -- John Fraser, comp.lang.lisp
| |
| Carl Shapiro 2004-12-15, 8:57 am |
| Julian Stecklina <der_julian@web.de> writes:
> Bruno Haible <bruno@clisp.org> writes:
>
>
> Ok, I do not get it. Why is case-sensitivity = modern? Looks like
> clisp -old-school to me.
Modern mode ought to have been named Franzlisp mode, reflecting the
lineage of the case sensitive reader algorithm in Allegro Common Lisp.
| |
| Adam Warner 2004-12-15, 8:57 am |
| Hi Peter Seibel,
> Why should the size of characters have anything at all to do with the
> length of strings? Strings are measured in characters so whether you use
> 8 bits or 8 megs to represent each character should have nothing to do
> with the value LENGTH returns when passed a string.
It's the translation from a defined external encoding to the implementation's
internal encoding which determines the internal length of the string. The
internal length may differ between implementations because the size of a
"character" unit differs between implementations.
> In those implementations that return some number greater than 1 for a
> "one-character" string, what do they return for (char s 1) (char s 2)
> and (char s 3)?
Let's take a common example: Java/Windows/.NET/any implementation with
16-bit strings: Strings are stored in UTF-16, with code points >= 2^16
stored as high and low surrogates: <http://www.unicode.org/glossary/#UTF_16>
In such implementations a code point in the range #x10000 to #x10FFFF has
a length of two. Here's are some tables setting out the translation:
<http://www.i18nguy.com/unicode/surrogatetable.html>
The 16-bit sequence #xD800 #xDC00 corresponds with the code point #x10000.
That's your (char s 0) and (char s 1) respectively. In a 32-bit character
implementation (char s 0) would be #x10000 and (char s 1) would be out of
range.
Regards,
Adam
| |
| Pascal Bourguignon 2004-12-15, 8:57 am |
| Adam Warner <usenet@consulting.net.nz> writes:
> Hi Peter Seibel,
>
>
> It's the translation from a defined external encoding to the implementation's
> internal encoding which determines the internal length of the string. The
> internal length may differ between implementations because the size of a
> "character" unit differs between implementations.
>
>
> Let's take a common example: Java/Windows/.NET/any implementation with
> 16-bit strings: Strings are stored in UTF-16, with code points >= 2^16
> stored as high and low surrogates: <http://www.unicode.org/glossary/#UTF_16>
>
> In such implementations a code point in the range #x10000 to #x10FFFF has
> a length of two. Here's are some tables setting out the translation:
> <http://www.i18nguy.com/unicode/surrogatetable.html>
>
> The 16-bit sequence #xD800 #xDC00 corresponds with the code point #x10000.
> That's your (char s 0) and (char s 1) respectively. In a 32-bit character
> implementation (char s 0) would be #x10000 and (char s 1) would be out of
> range.
You have to distinguish characters (code points) and codes (integers).
If you want to encode full unicode (ie, the 11000(hex) code points) in
16-bit, then use (vector (unsigned-byte 16)) and assume the
consequences (ie. (length vector-of-codes) is not the number of
characters, but the number of _codes_).
Otherwise, use a unicode-enabled lisp implementation, like clisp or
sbcl, put your unicode characters in a string and get the number of
character with (LENGTH string).
Trying to hold an encoded sequence into a lisp string is a cheap
kludge inherited from the bad C char==8-bit-integer mentality that
should not occur in lisp.
If you want to process unicode data in a lisp implementation that can
handle only iso-8859-1 characters, then you must not use the string
and character types, but only (vector (unsigned-byte 8)) for utf-8;
(vector (unsigned-byte 16)) for utf-16
and (vector (integer 0 #x10ffff)) for the full unicode.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
| |
| Bruno Haible 2004-12-15, 8:57 am |
| Carl Shapiro wrote:
> Modern mode ought to have been named Franzlisp mode, reflecting the
> lineage of the case sensitive reader algorithm in Allegro Common Lisp.
clisp is not using Allegro CL's algorithm, but a new one.
In Allegro, the case-sensivity bit is in the readtable. In clisp, it
is per package.
Bruno
| |
| Barry Margolin 2004-12-15, 3:59 pm |
| In article <pan.2004.12.15.08.44.43.362497@consulting.net.nz>,
Adam Warner <usenet@consulting.net.nz> wrote:
> Hi Peter Seibel,
>
>
> It's the translation from a defined external encoding to the implementation's
> internal encoding which determines the internal length of the string. The
> internal length may differ between implementations because the size of a
> "character" unit differs between implementations.
But the internal representation is not supposed to be visible to the
user.
Consider the following:
(defvar *array*)
(setq *array* (make-array 1))
(setf (aref *array* 0) (expt 2 255))
(print (length *array*))
Even though the array contains a bignum whose representation takes at
least 32 bytes, this should print 1.
Lisp is a high-level language, and LENGTH is supposed to deal in the
high-level concept of characters, not bytes.
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
| |
| Harald Hanche-Olsen 2004-12-15, 3:59 pm |
| + Barry Margolin <barmar@alum.mit.edu>:
| Lisp is a high-level language, and LENGTH is supposed to deal in the
| high-level concept of characters, not bytes.
But if I have understood Adam Warner's point correctly, it is that
Unicode has /two different/ high-level concepts of characters, one at
a higher lever than the other: The higher level concept is that of
grapheme clusters, each of which contains one or more "ordinary"
unicode characters.
As far as I understand, end users should only ever see grapheme
clusters. But Common Lisp programmers are hardly end users, and it is
not at all obvious (to me) if Lisp characters should correspond to
grapheme clusters or simple characters.
--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- Debating gives most of us much more psychological satisfaction
than thinking does: but it deprives us of whatever chance there is
of getting closer to the truth. -- C.P. Snow
| |
| Ingvar 2004-12-15, 3:59 pm |
| Harald Hanche-Olsen <hanche@math.ntnu.no> writes:
> + Barry Margolin <barmar@alum.mit.edu>:
>
> | Lisp is a high-level language, and LENGTH is supposed to deal in the
> | high-level concept of characters, not bytes.
>
> But if I have understood Adam Warner's point correctly, it is that
> Unicode has /two different/ high-level concepts of characters, one at
> a higher lever than the other: The higher level concept is that of
> grapheme clusters, each of which contains one or more "ordinary"
> unicode characters.
One is "underlying encoding" and one is "character" (well, possibly
code-point, since I'm not entirely clear on how to deal with the
combining characters). If I remember Adam's discussion on this on the
SBCL list a w or three ago, it seems as if he actually wants the
"external format" exposed (UTF-8, from what I recall) and that is "the
obviously incorrect" implementation, since that combined with SETF can
cause non-compliant UTF-8 encoding.
> As far as I understand, end users should only ever see grapheme
> clusters. But Common Lisp programmers are hardly end users, and it is
> not at all obvious (to me) if Lisp characters should correspond to
> grapheme clusters or simple characters.
I don't think exposing either grapheme clusters or combining
characters can cause an "invalid" (though it can probably cause a
nonsensical) string, whereas exposing an underlying UTF-8 definitely
can cause "invalid" strings (it's simple, just look at the
wide-characetr exploits used to circumvent path-checking constraints
for earlier versions of web servers). This, in my not so humble
opinion, is *not* something you want any old lisp program to do.
//Ingvar
--
(defun p(i d)(cond((not i)(terpri))((car i)(let((l(cadr i))(d(nthcdr(car i)d
)))(princ(elt(string(car d))l))(p(cddr i)d)))(t(princ #\space)(p(cdr i)d))))
(p'(76 2 1 3 1 4 1 6()0 5()16 10 0 7 0 8 0 9()2 6 0 0 12 4 23 4 1 4 8 8)(sort
(loop for x being the external-symbols in :cl collect (string x)) #'string< ))
| |
| Peter Seibel 2004-12-15, 3:59 pm |
| Adam Warner <usenet@consulting.net.nz> writes:
> Hi Peter Seibel,
>
>
> It's the translation from a defined external encoding to the
> implementation's internal encoding which determines the internal
> length of the string. The internal length may differ between
> implementations because the size of a "character" unit differs
> between implementations.
But the "internal length" of the string has nothing to do with the
value of LENGTH, or should not. The LENGTH of a string is the number
of characters it contains.
>
> Let's take a common example: Java/Windows/.NET/any implementation
> with 16-bit strings: Strings are stored in UTF-16, with code points
> <http://www.unicode.org/glossary/#UTF_16>
Yes, I'm with you so far. That just means that LENGTH has to be
implemented in a smarter way--it has to scan the array of code-points
looking for surrogate pairs in order to determine how many characters
are in the string. (That Java's String.length() method doesn't do this
will no doubt cause no end of problems down the line.)
> In such implementations a code point in the range #x10000 to
> #x10FFFF has a length of two.
A representational length. But it's still one character. Or ought to
be. Java blew this one and is now suffering the consequences. Some
Common Lisp's may have taken the same approach but that seems wrong.
> Here's are some tables setting out the translation:
> <http://www.i18nguy.com/unicode/surrogatetable.html>
>
> The 16-bit sequence #xD800 #xDC00 corresponds with the code point
> #x10000. That's your (char s 0) and (char s 1) respectively.
But that can't be because CHAR returns a character and (assuming a
Unicode capable Lisp) there is no char with the char-code #xD800,
right? Now if you're trying to process Unicode strings in a Lisp that
doesn't actually support Unicode, I'm not entirely suprised that it
doesn't work. But you've got all kinds of problems there; you can't,
for instance, say (code-char #x10000).
-Peter
--
Peter Seibel peter@javamonkey.com
Lisp is the red pill. -- John Fraser, comp.lang.lisp
| |
| jayessay 2004-12-15, 3:59 pm |
| Cameron MacKinnon <cmackin+nn@clearspot.net> writes:
> Julian Stecklina wrote:
>
> Of course. Who wouldn't?
A lot of people. Myself included. For symbols with such string-equal
names as the one indicated, case sensitivity is a broken mode.
> Are there programmers who would like to aesthetically improve their
> code (by their standards, not mine) or encode more information into
> their symbols via selective capitalization? Yes
But this is irrelevant (as you should understand).
The mode mechanism as provided in CLISP is much more on track.
/Jon
--
'j' - a n t h o n y at romeo/charley/november com
| |
| jayessay 2004-12-15, 3:59 pm |
| Bruno Haible <bruno@clisp.org> writes:
> Carl Shapiro wrote:
>
> clisp is not using Allegro CL's algorithm, but a new one.
> In Allegro, the case-sensivity bit is in the readtable. In clisp, it
> is per package.
This smells pretty close to the right way to do it. Or have a
readtable per package and leave it in the readtable (or maybe that is
what you meant, and that Allegro only has global tables).
/Jon
--
'j' - a n t h o n y at romeo/charley/november com
| |
| Alexander Schmolck 2004-12-15, 3:59 pm |
| Bruno Haible <bruno@clisp.org> writes:
> Carl Shapiro wrote:
>
> clisp is not using Allegro CL's algorithm, but a new one.
> In Allegro, the case-sensivity bit is in the readtable. In clisp, it
> is per package.
Do keywords work properly?
'as
| |
|
|
| Cameron MacKinnon 2004-12-15, 3:59 pm |
| jayessay wrote:
> Cameron MacKinnon <cmackin+nn@clearspot.net> writes:
>
>
>
>
> A lot of people. Myself included. For symbols with such string-equal
> names as the one indicated, case sensitivity is a broken mode.
Yes, it is. Do you have a body of code for which this is a problem? I
don't think there are any such codebases out there, whose owners
wouldn't a) admit that the random capitalization is unintentional cruft
which ought, ideally, to be more uniform and b) be able to fix it in
minutes with a quickie script written for the purpose.
| |
| Pascal Bourguignon 2004-12-15, 8:57 pm |
| Thomas Gagne <tgagne@wide-open-west.com> writes:
> I've read that Common Lisp is case sensitive, but have also noticed
> that Allegro has a way of creating a case-sensitive image. Can the
> same thing be done with clisp (on GNU/Linux)?
clisp is not Common Lisp: clisp is but one implementation of the
language named Common Lisp.
Common Lisp IS case sensitive, BUT its reader can be configured, and
its default configuration is to upcase every symbol, which means that
it's case insensitive._Other configurations allow to preserve case,
rendering it effectively case sensitive, or even, to _invert_ case,
rendering it completely schizophrenic about case.
Read about *READTABLE-CASE* in CLHS.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"
| |
| Duane Rettig 2004-12-15, 8:57 pm |
| Adam Warner <usenet@consulting.net.nz> writes:
> Hi Barry Margolin,
> You told me in your previous reply that length "returns the number of
> array elements, regardless of their size." As ANSI Common Lisp doesn't
> define characters to be of a particular size please tell me what the
> correct internal encoding should be. Then I'll tell you that every other
> implementation now has to traverse each string to determine its length,
This is exactly true. And an implementation has a choice as to whether
to implement strings with a constant-width encoding, to make LENGTH
work efficiently, or to sacrifice LENGTH efficiency in order to use a
variable-width encoding. Either way, LENGTH must work correctly, and
it is very simply defined on character count, independent of
its internal representation for strings. Note that in a string where
the characters are of varying width, CHAR, AREF, and their setf inverses
also are no longer
> and that length will not necessarily equal the number of array elements in
> the string.
This cannot be true, by definition. A string is a vector of characters,
period. If the characters are implemented in a variable-width manner, then
the elements themselves are of varying width, but still have the same count.
One could get around the need for more than 8. 16, or 32 bits to represent
some characters by defining all characters to be boxed values, instead of
immediates. This wouldn't be efficient, but it would allow strings to be
implemented as lispval pointers to character boxed-objects. But then that
would make the strings have fixed-width elements, wouldn't it? :-)
> And if you define the official character size as 32-bit/enough to hold a
> Unicode code point then you'll have to explain why an implementation that
> implements characters as grapheme clusters is non-conforming with respect
> to LENGTH and CHAR.
As has been stated elsewhere in this thread, Allegro CL implements strings
internally as fixed-width arrays of characters. We provide versions of
the lisp that have 8-bit characters and 16-bit characters, but only one
in each lisp (it reduces type complexity and runtime discrimination
requirements). All other encodings of strings are treated as external-formats,
and are handled by streams. Since we use simple-streams to encode between
arrays of octets and arrays of characters, the translation from external
("native") to internal ("character", "string") is a simple matter of
using the external-format for the conversion. I've lost track of how
many external-formats we provide, but the representations of currently
grapheme clusters, etc., could be simply a matter of writing an
external format for it, if not already available.
The length of an external ("native") string is given in terms of
octets; it is calculable via excl:native-string-sizeof, which
returns the number of octets in the string argument (which is
not a Lisp string, but an external representation of a string;
we call it "native" because it is presumably native to the
operating system hosting our lisp). This function must indeed
traverse the string to figure out how many characters are in it.
But it is not the LENGTH function; it would be a nonconformance
to replace LENGTH with this function.
>
> It all depends upon what an operating system or language defines a
> character to be. A character in Java is a 16-bit unsigned value because it
> is defined that way, not because it can hold all Unicode code points.
> Because of this definition the length of an arbitrary string is the same
> across implementations. ANSI Common Lisp doesn't define the size of a
> character. Allegro for example corresponds with Java:
> <http://www.franz.com/support/docume...#memory-usage-2>
Yes, internally, Allegro CL uses 16-bit characters for strings. An
8-bit version exists as well, and a 32-bit version could conceivably
be made available if demand were high (so far, it is not). The link
you quote is an explanation of the size increase when moving from an
8-bit representation to a 16-bit representation. It has nothing to do
with the interaction we provide for the external world.
> Do you claim that the decision to store characters internally as 16-bit
> values is non-conforming? Length and array references will differ from an
> implementation with 32-bit characters. So who's right? Are they all wrong
> and the only implementation capable of returning the correct answer is one
> which implements strings as sequences of grapheme clusters (like the
> Parrot virtual machine)?
I think you misunderstand Barry, because you are not allowing for a
split between internal representations and external formats. I didn't
see any such claim to nonconformance in his response.
> I made a simple claim Barry: Since ANSI Common Lisp doesn't define the
> size of a character the length of an arbitrary string will be
> implementation specific.
This claim is false, by definition, since length is specified in
terms of a count, and not in terms of widths in some other units
of measure.
> I am sure of this claim because no one has put
> their foot down and told implementors, for better or worse, that
> characters are a fixed size of n-bits or that characters must be handled
> as grapheme clusters of variable size.
The implementation decision is a choice, but the requirement to count
elements (i.e. characters) is not. That seals the tradeoff consideration.
--
Duane Rettig duane@franz.com Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182
| |
| Carl Shapiro 2004-12-15, 8:57 pm |
| jayessay <nospam@foo.com> writes:
> Bruno Haible <bruno@clisp.org> writes:
>
>
> This smells pretty close to the right way to do it. Or have a
> readtable per package and leave it in the readtable (or maybe that is
> what you meant, and that Allegro only has global tables).
The right way to handle different reader modes is to have a means to
declaratively specify a syntax. A syntax definition would include,
among other things, a readtable, symbol-lookup mechanism and package
mappings. Without at least these three features you are going to end
up with a zombie environment which is not self consistent, and users
will have to resort to off beat idioms to write code which works
everywhere. Syntaxes can invoked on a per-module basis, or switched
in and out of at the top-level. (In summary, associating this
behavior with packages is far from sufficient.)
| |
| Adam Warner 2004-12-15, 8:57 pm |
| Hi Duane Rettig,
>
> This claim is false, by definition, since length is specified in terms
> of a count, and not in terms of widths in some other units of measure.
Here is an arbitrary string encoded in UTF-8: "π" [You may generate it
in CLISP using (string (code-char #x10000))]. It consists of a single code
point.
I expect (cl:length "π") will NOT return 1 in a 16-bit character Allegro
yet it will return 1 in CLISP and SBCL. I expect:
(let ((s (copy-seq "π")))
(setf (char s 0) #\A)
s)
will return additional garbage because it replaces the first half of a
surrogate character. I expect this will also be legal in Allegro but
not CLISP and SBCL:
(let ((s (copy-seq "π")))
(setf (char s 0) #\A)
(setf (char s 1) #\B)
s)
[I can only expect these things because I haven't licensed Allegro (and
telnetting into prompt.franz.com appears to be the 8-bit version)]
LENGTH currently returns implementation specific values since the values
it returns differ between some implementations for some identical external
strings in the same encoding. If the Lisp community can't accept the
present situation then it has to agree upon an internal encoding format
for characters, which is likely to be Unicode code points.
Regards,
Adam
| |
| Pascal Bourguignon 2004-12-16, 3:57 am |
| Adam Warner <usenet@consulting.net.nz> writes:
> Hi Duane Rettig,
>
>
> Here is an arbitrary string encoded in UTF-8: "πΒ" [You
> may generate it in CLISP using (string (code-char #x10000))]. It
> consists of a single code point.
No. You have to specify an external format, you cannot generate it jus
with (string (code-char #x10000)). For example in my case, it gives
this error:
B r e a k 2 [ 2 ] > | | |