For Programmers: Free Programming Magazines  


Home > Archive > Compression > August 2007 > Unicode compliant gzip?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Unicode compliant gzip?
vikascoder@gmail.com

2007-07-30, 3:55 am

Hi,
Just a weird question. I think gzip is the most widely known
compression utility on the net. I was just wondering why it isnt
Unicode compliant?

Moreover I cant find a open source / freeware utility that is Unicode
compliant and available for multi platforms.

Kindly suggest.

Thanks,
Vikas

John Reiser

2007-07-30, 6:56 pm

vikascoder@gmail.com wrote:
> Just a weird question. I think gzip is the most widely known
> compression utility on the net. I was just wondering why it isnt
> Unicode compliant?


Please give an example of non-compliance, and explain.

gzip works for any content, and for filenames in UTF-8.
Perhaps you or your operating system cannot handle UTF-8.

--
Thomas Pornin

2007-07-30, 6:56 pm

According to John Reiser <jreiser@BitWagon.com>:
> gzip works for any content, and for filenames in UTF-8.


Actually, gzip may copy the source file name (when compressing a named
file) into the compressed stream. RFC 1952 specifies that this name, if
present, consists of ISO 8859-1 characters (see section 2.3.1). If the
source file name contains Unicode characters, chances are that the host
operating system handles them as UTF-8(*), and gzip will gleafully copy
that UTF-8 encoded file name as is into the compressed file. The problem
is that UTF-8 encoded strings can also be valid 8859-1 strings, hence
the decompressor will not be able to unambiguously recover the source
file name.

This is hardly a serious problem; the source file name is mostly a hint,
and is optional. Of course, the file data is completely unaffected by
this issue. Moreover, even if the source file name is not clearly
defined as being UTF-8 or 8859-1, it still is a bunch of bytes which
gzip got from the operating system upon compression. For most purposes,
that bunch of bytes is the right thing to have, because the decompressor
may send it back to the operating system as it found it.

Still, for completeness (this is a matter of aesthetics), it could be
worth including in the gzip format some provision of marking the
optional file name (and, possibly, the optional comment string) as being
UTF-8 encoded. There are some free flags in the header (the bits 5 to 7
in the FLG byte).


(I discuss here about RFC 1952. It is entirely possible that a new gzip
file format specification has been issued since, and possibly this new
hypothetical specification could address the issue.)


--Thomas Pornin

(*) Well, it depends. UTF-8 is likely for Unix-like systems such as
Linux, because UTF-8 is compatible with ASCII, and the kernel handles
file names as bunch of bytes which contain neither the null character
(ASCII 0x00) nor the slash character (ASCII 0x2F). I am not quite sure
for Windows systems: Windows tends to prefer UTF-16 and I do not know
what happens when a file was created with a UTF-16 API and then its name
is read back with the byte-oriented API.
vikascoder@gmail.com

2007-07-31, 3:56 am

On Jul 30, 7:03 pm, John Reiser <jrei...@BitWagon.com> wrote:
> vikasco...@gmail.com wrote:
>
> Please give an example of non-compliance, and explain.
>
> gzip works for any content, and for filenames in UTF-8.
> Perhaps you or your operating system cannot handle UTF-8.
>
> --

I deeply apologize for the incomplete information provided earlier.

Well this is what I had done.
I have a WindowsXP Professional system installed with the Asian
Language Pack.

Now I configured the language options inside the Control Panel, added
the Japanese Keyboard and enabled the language bar.
Now I chose the language as 'Japanese' and the input mode as
'Hiragana'. Next I made a text file, pressed 'F2' and typed in
a Japanese name for the file, like this -> ' '.

Next I fired up the command line and after writing gzip, dragged the
file name to the terminal window and pressed Enter.
Result: Gzip says 'No such file or directory'. If I use an alternative
tool like 7-Zip (http://www.7-zip.org), it works perfectly fine.
Thus I have one of these bugs logged for my application which says
that Gzip doesnt work on files with Unicode file names. Though Gzip
has
no problems in zipping files with UTF-8 encoded text. But Unicode
filenames it doesnt seem to support.

Thus I was searching for a utility which would and which is available
for other platforms too (and of course open source).

thanks..

Ben Rudiak-Gould

2007-08-02, 6:56 pm

vikascoder@gmail.com wrote:
> Though Gzip has
> no problems in zipping files with UTF-8 encoded text. But Unicode
> filenames it doesnt seem to support.


You can try using UTF-8 Cygwin and the Cygwin version of gzip:

http://www.okisoft.co.jp/esc/utf8-cygwin/

You will find that a lot of other Windows software can't access your files
either. The problem is that a ton of Windows software is compiled to use
byte-oriented file access APIs, which were the only ones available on
Windows 95 and its descendants. There's no way to communicate the Japanese
filenames to these programs unless you set your system code page to
Shift-JIS. Annoyingly, UTF-8 is not available as a system code page.

-- Ben
Mark Adler

2007-08-02, 6:56 pm

On Aug 2, 7:48 am, Ben Rudiak-Gould <br276delet...@cam.ac.uk> wrote:
> You can try using UTF-8 Cygwin and the Cygwin version of gzip:


What is different about this version of gzip? Is this something that
should be incorporated into the gzip source code?

Mark

Ben Rudiak-Gould

2007-08-02, 6:56 pm

Mark Adler wrote:
> On Aug 2, 7:48 am, Ben Rudiak-Gould <br276delet...@cam.ac.uk> wrote:
>
> What is different about this version of gzip? Is this something that
> should be incorporated into the gzip source code?


There must be a web page somewhere that explains this well, but I can't find
one, so here goes. Win32 functions that deal with strings come in two
versions, one taking locale-independent UTF-16 strings and one taking C
strings in a locale-dependent encoding. On NT, the C-string functions are
wrappers for the UTF-16 functions; on Win9x, most of the UTF-16 functions
aren't available at all.

Every Win32-based C standard library uses the Win32 locale encoding, so
standards-compliant C programs can't access files with names outside the
current locale, which is a constant hassle for people who have filenames in
multiple languages. UTF-8 isn't available as a locale because Win32 assumes
that every code point encodes to at most two bytes.

Cygwin C libraries dynamically link with Cygwin instead of Win32, and the
Cygwin DLLs can be replaced by versions that use UTF-8 and call the UTF-16
Win32 functions.

To make a non-Cygwin gzip that works well on Windows, you need to write a
Windows-specific frontend that uses _TCHAR instead of char and _tmain()
instead of main() and _tfopen() instead of fopen() and so forth, for every
part of the program that deals with filenames. This can then be compiled
into both UTF-16 and C-string versions depending on the setting of the
_UNICODE preprocessor macro. Since the UTF-16 version won't work at all on
Win9x, you probably have to distribute separate Unicode and non-Unicode
versions.

Another, perhaps less invasive, option would be to ignore the TCHAR
functions and instead conditionally #define main wmain, #define fopen
_wfopen, and so forth, using the UTF-16 versions directly. This is basically
what the TCHAR wrapper does, but this way you don't have to use the weird
names with t in them. You still have to wrap all your string literals, since
"foo" always gets you a C string; for a UTF-16 string you need L"foo".

-- Ben
Hans-Peter Diettrich

2007-08-03, 7:55 am

Ben Rudiak-Gould wrote:

[color=darkred]
> Annoyingly, UTF-8 is not available as a
> system code page.


Windows has no notion of a system code page. The active code page is
used for conversion between MBCS and Unicode strings in API functions,
where for display purposes, in a basically MBCS (AKA Ansi) system, the
code page must be a "real" code page, with an associated character set.
UTF-8 is not a codepage in this sense, instead it's a special Unicode
encoding, isn't it?

Filenames are stored in OEM and, on VFAT or NTFS volumes, in UTF-16.
Their handling depends on the file API configuration, which can be
retrieved with AreFileApisAnsi(). This value might be settable on a
per-process base, please look up details yourself. Even with an Unicode
file system API, UTF-8 has to be converted into UTF-16, otherwise the
default string handling would use the active display code page in the
conversion into Unicode.

Since MBCS filenames in archives can have any encoding, every
application must guess which encoding is appropriate for every single
archive or file - the OS cannot guess better. If the bet is UTF-8, the
application should convert to UTF-16 accordingly, use the ...W ('W'ide =
Unicode) API functions, and everything should be fine.

DoDi
Hans-Peter Diettrich

2007-08-03, 7:55 am

Mark Adler wrote:

>
>
> What is different about this version of gzip? Is this something that
> should be incorporated into the gzip source code?


Cygwin is not a compiler, it's a runtime environment for program
development and execution, sitting on top of Windows. Provided that the
Cygwin runtime environment allows to specify a system code page (dunno,
sorry), it would convert byte-oriented character strings according to
that codepage, before calling the Unicode (W) versions of the Windows
API functions.

DoDi
Ben Rudiak-Gould

2007-08-04, 6:56 pm

Hans-Peter Diettrich wrote:
> Windows has no notion of a system code page.


I meant CP_ACP, which Microsoft describes as "the system default ANSI code
page". I can call it something other than the "system code page" if you
want. Of course there's also the OEM code page, which is a complication I
ignored.

> UTF-8 is not a codepage in this sense, instead it's a special Unicode
> encoding, isn't it?


It has a code page number, CP_UTF8, but that can't be the number of the
system code page. In some sense it's not a code page.

> Their handling depends on the file API configuration, which can be
> retrieved with AreFileApisAnsi(). This value might be settable on a
> per-process base, please look up details yourself.


It is, but that isn't helpful in this case.

> Since MBCS filenames in archives can have any encoding, every
> application must guess which encoding is appropriate for every single
> archive or file - the OS cannot guess better.


Yes, but I think the important issue here is not the filename in the archive
(which is rarely used anyway) but the fact that gzip can't access some files
at all on Windows systems. If a file name contains characters outside the
current code page, gzip can't even refer to the file to open it for
compression. There is no Windows code page, or even ANSI/OEM pair, that
covers all the code points used in the file names on my system. (Unless you
count UTF-8, which can't be made the default.)

-- Ben
Hans-Peter Diettrich

2007-08-05, 6:56 pm

Ben Rudiak-Gould wrote:

>
>
> Yes, but I think the important issue here is not the filename in the
> archive (which is rarely used anyway) but the fact that gzip can't
> access some files at all on Windows systems.


It's a matter of (in)appropriate coding. E.g. Abbrevia can handle
filenames without problems, regardless of the archive type.

> If a file name contains
> characters outside the current code page, gzip can't even refer to the
> file to open it for compression.


Right, for everything beyond ASCII, the Unicode API functions should be
used. If gzip fails to do so, it's not properly ported.

Perhaps you noticed already, that adopting software to various target
platforms requires appropriate (conditional) platform specific code, and
the correct use of automake&friends. Windows often is poorly supported
in automake projects, if ever, depending on how much a coder likes that
platform, or is familiar with it at all.

Since automake must be available on every host system, an appropriate
environment (e.g. Cygwin, Interix...) must be used on a Windows host.
Then the result is usable to the degree of preparation *for* the target
platform, and limited by the implementation of the libc and the runtime
environment *on* the target platform. That's not different from writing
and using portable software on any other target platform. You can search
or wait for a better Windows port of gzip, or port it yourself.

DoDi
Mark Adler

2007-08-06, 3:55 am

On Aug 5, 1:40 pm, Hans-Peter Diettrich <DrDiettri...@aol.com> wrote:
> Windows often is poorly supported
> in automake projects, if ever, depending on how much a coder likes that
> platform, or is familiar with it at all.


I can see why. Holy cow. I am unfamiliar with the Windows platform
(never had one). I started to attempt to learn, through our friend
google, how to use the Windows API to transfer Unicode properly from
the command line argument the the file name. I was completely lost
within minutes. (And I am a bona fide rocket scientist.) I give up.
I again appreciate how the Unix on my Mac "just works" -- in this case
by using UTF-8.

If anyone can provide some sample code for doing this and is willing
to test it, I can try to put it into pigz (parallel gzip).

Mark

Hans-Peter Diettrich

2007-08-06, 3:55 am

Mark Adler wrote:

>
>
> I can see why. Holy cow. I am unfamiliar with the Windows platform
> (never had one). I started to attempt to learn, through our friend
> google, how to use the Windows API to transfer Unicode properly from
> the command line argument the the file name. I was completely lost
> within minutes.


That's the culture shock, which I currently feel myself, in playing
around with Linux ;-)

The Windows shell is much simpler as a typical Unix shell, but different
in a few points. Long (Unicode) filenames should be enclosed in double
quotes, which may be removed by the shell (depends). Due to the
rudimentary shell, applications may have to do some processing of the
commandline arguments themselves.

An application can use the MultiByteToWideChar and WideCharToMultiByte
functions to map single-byte character set (SBCS) strings (including
UTF-8) to Unicode and Unicode strings to SBCS strings.

And, of course, the use of the Ansi/Wide string API functions must be
selected in the source code. Windows compilers usually append an "A" or
"W" to the API function names, for this purpose, selected globally by a
#define Unicode. Dunno the according behaviour of gcc. The descriptions
only list the base names of the functions, and one should know that
every function with LPTSTR parameters (in contrast to LPSTR) comes in
the A/W flavours.

> (And I am a bona fide rocket scientist.) I give up.


Is it you, who invented the Adler checksum?

I hope that your next try will be less frustrating, the keywords in the
preceding paragraphs may give you better starting points.


> I again appreciate how the Unix on my Mac "just works" -- in this case
> by using UTF-8.
>
> If anyone can provide some sample code for doing this and is willing
> to test it, I can try to put it into pigz (parallel gzip).


I'm neither familiar with C, gcc, or autobloat, and cannot contribute
but hints about how the resulting code should look like. If you
understand Pascal, the Abbrevia code contains the handling of filenames,
extracted from or written into compressed files.

DoDi
Mark Adler

2007-08-06, 6:56 pm

On Aug 6, 12:57 am, Hans-Peter Diettrich <DrDiettri...@aol.com> wrote:
> Is it you, who invented the Adler checksum?


That would be me. More of an extrapolation really than an invention.

> I hope that your next try will be less frustrating, the keywords in the
> preceding paragraphs may give you better starting points.


Thanks.

Mark

Ross Ridge

2007-08-07, 6:56 pm

Mark Adler <madler@alumni.caltech.edu> wrote:
>I can see why. Holy cow. I am unfamiliar with the Windows platform
>(never had one). I started to attempt to learn, through our friend
>google, how to use the Windows API to transfer Unicode properly from
>the command line argument the the file name. I was completely lost
>within minutes. (And I am a bona fide rocket scientist.) I give up.


Well, a lot depends on the context. Just using the Windows API, and none
of the Standard C or Unix-like functions, all you would do is define
the UNICODE macro and all the strings you pass to and accept from the
API will be in UTF-16. If you don't want to go native, you can use
functions like wmain() and _wfopen() which use UTF-16 strings.

>I again appreciate how the Unix on my Mac "just works" -- in this case
>by using UTF-8.


Apple doesn't care about backwards compatibility, so they were able to
ignore the fact there users were using other character sets. Microsoft
and the other Unix and Unix-like vendors don't have that option.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rridge@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Mark Adler

2007-08-09, 7:56 am

On Aug 7, 11:42 am, Ross Ridge <rri...@caffeine.csclub.uwaterloo.ca>
wrote:
> Just using the Windows API, and none of the Standard C or Unix-like functions,


Well, what I want is to use the standard C command line and Unix file
functions, passing the unicode file name properly from the command
line to (for example) the fopen() function.

>
> Apple doesn't care about backwards compatibility, so they were able to
> ignore the fact there users were using other character sets. Microsoft
> and the other Unix and Unix-like vendors don't have that option.


Umm, ok, I'm not sure what that all means. As far as I can tell, an
enormous number of international character sets are supported on the
Mac for both input and display in all applications. In fact, it's
been that way for more than 20 years. (The Mac's big market for a
long time was desktop publishing, all over the world.) The Mac APIs
all consistently use unicode for character strings.

Anyway, what I was talking about is (and I just did this): in Mac OS
X, I can create a file whose name is in unicode, Cyrillic in this
case. It displays the cyrillic characters quite nicely in the
finder. I can then run gzip from the command line with that file
name, and gzip compresses it. There were no modifications whatsoever
to gzip to be able to do this. gzip just takes the usual zero-
terminated string from the command line, processes it, and gives it to
the file functions. No fuss, no muss.

That's what I mean by "it just works".

Apparently this does not work at all on Windows. Windows appears to
require special functions to get the unicode from the command line,
and special file functions to accept the unicode file name, and
special intermediate data types. Oh, and special processing if
there's a yen character since that has something to do with the path
separator. And probably other stuff that I did not discover in my few
minutes of research.

So to port gzip properly to Windows to support unicode file names, as
best as I can tell, there will need to be significant modifications to
the code. I would love to be proven wrong.

Mark

Ben Rudiak-Gould

2007-08-09, 6:56 pm

Mark Adler wrote:
> So to port gzip properly to Windows to support unicode file names, as
> best as I can tell, there will need to be significant modifications to
> the code. I would love to be proven wrong.


I think it's unavoidable. Yes, it's stupid. Microsoft was a Unicode early
adopter, back when Unicode was a 16-bit character set; they thought everyone
would migrate to 16-bit characters, and they were just ahead of the curve.
Instead, much of the world went to UTF-8, creating a portability nightmare.

The cleanest way to do it is to use impedance-matching wrappers for all
functions that communicate filenames to or from the system. On Unix, they
just call the usual C library functions with their arguments. On Windows,
they do UTF-8 conversion. This way you don't have to rewrite everything to
use wide characters. I'll volunteer to write the Windows wrappers (and even
the Unix wrappers) if you want to go this route.

It would be really, really nice to have a version of the MinGW standard
library that used UTF-8.

-- Ben
Ross Ridge

2007-08-09, 6:56 pm

Ross Ridge wrote:
> Just using the Windows API, and none of the Standard C or Unix-like
>functions,


Mark Adler <madler@alumni.caltech.edu> wrote:
>Well, what I want is to use the standard C command line and Unix file
>functions, passing the unicode file name properly from the command
>line to (for example) the fopen() function.


If that were possible, then your program would "just work" already,
but that's obviously not the case.

>
>Umm, ok, I'm not sure what that all means. As far as I can tell, an
>enormous number of international character sets are supported on the
>Mac for both input and display in all applications.


Only the Unicode character set is supported, and the APIs only accept
the UTF-8 encoding of Unicode.

> In fact, it's been that way for more than 20 years. (The Mac's big
>market for a long time was desktop publishing, all over the world.)

..The Mac APIs all consistently use unicode for character strings.

The APIs consistently use UTF-8, now. In the past they used different
character sets and encodings.

>Anyway, what I was talking about is (and I just did this): in Mac OS
>X, I can create a file whose name is in unicode, Cyrillic in this
>case. It displays the cyrillic characters quite nicely in the
>finder. I can then run gzip from the command line with that file
>name, and gzip compresses it. There were no modifications whatsoever
>to gzip to be able to do this. gzip just takes the usual zero-
>terminated string from the command line, processes it, and gives it to
>the file functions. No fuss, no muss.


Take a zip file with Cyrillic filename that was created on a older version
of Mac OS whose APIs used an encoding other than UTF-8 and unzip using
a Mac OS X version of zip. You'll either get an encoding error or a
garbage filename because the file APIs now require UTF-8.

>Apparently this does not work at all on Windows. Windows appears to
>require special functions to get the unicode from the command line,
>and special file functions to accept the unicode file name, and
>special intermediate data types.


*shrug* That's the cost of backwards compatibility. I dunno, but I
kinda like the fact that the 14 year old version of gzip installed on
my computer still works.

>Oh, and special processing if there's a yen character since that has
>something to do with the path separator.


It doesn't affect your application. All it means is that the backslash
character looks different when viewed using Japanese character sets
and encodings.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rridge@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Ross Ridge

2007-08-09, 6:56 pm

Ben Rudiak-Gould <br276deleteme@cam.ac.uk> wrote:
>It would be really, really nice to have a version of the MinGW standard
>library that used UTF-8.


Never going to happen. There is no MinGW standard library, it just
uses Microsoft's runtime library, and MinGW is not an emulation layer.
Try cygwin, apparently someone made a UTF-8 version of that.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rridge@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com