For Programmers: Free Programming Magazines  


Home > Archive > Tcl > February 2007 > glob with accents









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author glob with accents
moumou@igbmc.u-strasbg.fr

2007-02-27, 4:22 am

Hello world !

We encouter a strange behaviour of "glob" when dealing with acents.

On linux box 1:
encoding system --> UTF-8
On linux box 2:
encoding system --> ISO8859-1

When issuing a glob command to retrieve a directory called
"r=E9tinite" (e with acute accent, french), set dir [glob -type d *]
gives:
linux box 1 ---> rA~xxtinite (encoded in utf-8)
linux box 2 --> r=E9tinite (correct word)

On linux box 1, glob $dir finds nothing,
on linux box 2, glob $dir finds the files in $dir.

It seems that encoding convertto ISO8859-1 doesnf't work.

Any clue ?

Thanks in advance !!

Raymond Ripp
IGBMC Strasbourg

suchenwi

2007-02-27, 4:22 am

I see no problems on Win XP, Tcl 8.4.1:

% file mkdir r=E9tinite
% glob r*
r=E9tinite
% encoding system
cp1252
% info pa
8=2E4.1

Andreas Leitgeb

2007-02-27, 7:14 pm

moumou@igbmc.u-strasbg.fr <moumou@igbmc.u-strasbg.fr> wrote:
> We encouter a strange behaviour of "glob" when dealing with acents.
> On linux box 1:
> encoding system --> UTF-8
> On linux box 2:
> encoding system --> ISO8859-1


I know the problem (also on linux), but only when I'm using
a file that has been created(or renamed) while in iso8859-1
encoding, and then is used in tcl while utf-8 is set.

So, I guess, that on box 1, the directory is actually named
in ISO8859-1 encoding, and thus inconsistent with current
locale.

Tcl recognizes the actually invalid utf-8 encoding in the
result of [glob] and converts it to valid utf-8 encoding, and
when it then tries to access the file, it will ask the OS for
the utf-8 named item, and the system answers: ENOENT.

To avoid this situation, either
use an encoding for file naming that is consistent with
system's locale,
set Tcl's view of system encoding to "binary"(*)
(using: [encoding system binary]), which will make
it take any octet-string it receives from the OS
without any interpretation/conversion of encoding.
This means, that probably it might still fail to present
the wrongly-coded character properly (e.g. in a tk widget
or puts), but at least it can call back the OS with the
same octet-string that it got from it before, and so can
successfully access previously [glob]ed items.

Just recently I started writing a script, that would convert
ISO-8859-1 named directory-trees recursively to equivalent utf-8.

I also have two machines, (also, one iso8859, the other utf-8)
some tree of which I synchronize with unison. I'm going to
rename both sides to utf-8, (leaving the iso8859-machine
with hardly legible filenames) and sooner or later upgrade the
iso8859-box to use utf-8 as well.

(*): The difference between iso8859-1 and binary is, that while
converting to the former may cause some chars to be replaced
by a "?" question mark, this will not happen with the latter.

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com