Home > Archive > AWK > January 2005 > newbie: awk not working for multi-byte charsets?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
newbie: awk not working for multi-byte charsets?
|
|
| Zhang Weiwu 2005-01-08, 3:55 am |
| Hello? I just wanna know:
1) does awk (I am using gawk) support string operation to multi-byte
charsets? Say I have a string contain 4 ideographs, but length(string)
gives 12 currently. According to unicode.org, gawk is
unicode-compatible, so how can I turn on multi-byte character recognition?
2) why this looks so common a question (all asian users could face it),
but very difficult to google out an answer? I used keywords like "awk
Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk
length utf-8" ... All the keywords combination leads to unrelated
information. I am pretty new to awk, so perhaps I am looking for the
answer in a wrong way? The only related information is a (looking)
professional message post from Olaf Dabrunz which finally received no reply:
http://mail.nl.linux.org/linux-utf8...6/msg00005.html
| |
| Zhang Weiwu 2005-01-08, 3:55 am |
| Zhang Weiwu wrote:
> Hello? I just wanna know:
>
> 1) does awk (I am using gawk) support string operation to multi-byte
> charsets? Say I have a string contain 4 ideographs, but length(string)
> gives 12 currently. According to unicode.org, gawk is
> unicode-compatible, so how can I turn on multi-byte character recognition?
I think I forgot to mension I am on LANG=zh_CN.UTF-8
And README_d/README.multibyte does not give information on my topic (wired).
| |
| Jürgen Kahrs 2005-01-08, 3:55 am |
| Zhang Weiwu wrote:
> 1) does awk (I am using gawk) support string operation to multi-byte=20
> charsets? Say I have a string contain 4 ideographs, but length(string) =
> gives 12 currently. According to unicode.org, gawk is=20
> unicode-compatible, so how can I turn on multi-byte character recogniti=
on?
What requirements are there for a language to
be called "Unicode Enabled" ? The example of
string operations is a good example, but is
there any formal requirement list telling me
what else is needed ? Do you know any ?
> information. I am pretty new to awk, so perhaps I am looking for the=20
> answer in a wrong way? The only related information is a (looking)=20
> professional message post from Olaf Dabrunz which finally received no=20
> reply:
> http://mail.nl.linux.org/linux-utf8...6/msg00005.html
Thanks for posting this link.
I think the questions raised by Olaf are mostly
not addressed by the specifications of AWK and C.
If I am wrong, please tell me.
When I started working on the XML extension for
GNU Awk, I was soon running into same trouble with
XML's ability of coping with Unicode. For example,
the byte sequence representing Umlaut characters
varies depending on the encoding used. Imagine we
want to compare strings in AWK:
if (s =3D=3D "M=FCll") {
print "found some trash"
}
The Umlaut character may be encoded as a multi-byte
character in the XML data and as a single-byte character
in the AWK program (Latin-1 encoding). Should the different
encodings of a character be reported as identical or not ?
The pain of comparing Umlauts is eased a bit by the fact
that most XML parsers produce character as UTF-8, no matter
what the original encoding of the XML file was. But the
general problem remains the same as with your chinese
ideographs.
|
|
|
|
|