Home > Archive > Tcl > February 2007 > glob with accents
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| moumou@igbmc.u-strasbg.fr 2007-02-27, 4:22 am |
| Hello world !
We encouter a strange behaviour of "glob" when dealing with acents.
On linux box 1:
encoding system --> UTF-8
On linux box 2:
encoding system --> ISO8859-1
When issuing a glob command to retrieve a directory called
"r=E9tinite" (e with acute accent, french), set dir [glob -type d *]
gives:
linux box 1 ---> rA~xxtinite (encoded in utf-8)
linux box 2 --> r=E9tinite (correct word)
On linux box 1, glob $dir finds nothing,
on linux box 2, glob $dir finds the files in $dir.
It seems that encoding convertto ISO8859-1 doesnf't work.
Any clue ?
Thanks in advance !!
Raymond Ripp
IGBMC Strasbourg
| |
| suchenwi 2007-02-27, 4:22 am |
| I see no problems on Win XP, Tcl 8.4.1:
% file mkdir r=E9tinite
% glob r*
r=E9tinite
% encoding system
cp1252
% info pa
8=2E4.1
| |
| Andreas Leitgeb 2007-02-27, 7:14 pm |
| moumou@igbmc.u-strasbg.fr <moumou@igbmc.u-strasbg.fr> wrote:
> We encouter a strange behaviour of "glob" when dealing with acents.
> On linux box 1:
> encoding system --> UTF-8
> On linux box 2:
> encoding system --> ISO8859-1
I know the problem (also on linux), but only when I'm using
a file that has been created(or renamed) while in iso8859-1
encoding, and then is used in tcl while utf-8 is set.
So, I guess, that on box 1, the directory is actually named
in ISO8859-1 encoding, and thus inconsistent with current
locale.
Tcl recognizes the actually invalid utf-8 encoding in the
result of [glob] and converts it to valid utf-8 encoding, and
when it then tries to access the file, it will ask the OS for
the utf-8 named item, and the system answers: ENOENT.
To avoid this situation, either
use an encoding for file naming that is consistent with
system's locale,
set Tcl's view of system encoding to "binary"(*)
(using: [encoding system binary]), which will make
it take any octet-string it receives from the OS
without any interpretation/conversion of encoding.
This means, that probably it might still fail to present
the wrongly-coded character properly (e.g. in a tk widget
or puts), but at least it can call back the OS with the
same octet-string that it got from it before, and so can
successfully access previously [glob]ed items.
Just recently I started writing a script, that would convert
ISO-8859-1 named directory-trees recursively to equivalent utf-8.
I also have two machines, (also, one iso8859, the other utf-8)
some tree of which I synchronize with unison. I'm going to
rename both sides to utf-8, (leaving the iso8859-machine
with hardly legible filenames) and sooner or later upgrade the
iso8859-box to use utf-8 as well.
(*): The difference between iso8859-1 and binary is, that while
converting to the former may cause some chars to be replaced
by a "?" question mark, this will not happen with the latter.
|
|
|
|
|