Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

newbie: awk not working for multi-byte charsets?
Hello? I just wanna know:

1) does awk (I am using gawk) support string operation to multi-byte
charsets? Say I have a string contain 4 ideographs, but length(string)
gives 12 currently. According to unicode.org, gawk is
unicode-compatible, so how can I turn on multi-byte character recognition?

2) why this looks so common a question (all asian users could face it),
but very difficult to google out an answer? I used keywords like "awk
Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk
length utf-8" ... All the keywords combination leads to unrelated
information. I am pretty new to awk, so perhaps I am looking for the
answer in a wrong way? The only related information is a (looking)
professional message post from Olaf Dabrunz which finally received no reply:
http://mail.nl.linux.org/linux-utf8...6/msg00005.html

Report this thread to moderator Post Follow-up to this message
Old Post
Zhang Weiwu
01-03-05 08:55 PM


Re: newbie: awk not working for multi-byte charsets?
Zhang Weiwu wrote:
> Hello? I just wanna know:
>
> 1) does awk (I am using gawk) support string operation to multi-byte
> charsets? Say I have a string contain 4 ideographs, but length(string)
> gives 12 currently. According to unicode.org, gawk is
> unicode-compatible, so how can I turn on multi-byte character recognition?

I think I forgot to mension I am on LANG=zh_CN.UTF-8
And README_d/README.multibyte does not give information on my topic (wired).

Report this thread to moderator Post Follow-up to this message
Old Post
Zhang Weiwu
01-03-05 08:55 PM


Re: newbie: awk not working for multi-byte charsets?
Zhang Weiwu wrote:

> 1) does awk (I am using gawk) support string operation to multi-byte=20
> charsets? Say I have a string contain 4 ideographs, but length(string) =

> gives 12 currently. According to unicode.org, gawk is=20
> unicode-compatible, so how can I turn on multi-byte character recogniti=
on?

What requirements are there for a language to
be called "Unicode Enabled" ? The example of
string operations is a good example, but is
there any formal requirement list telling me
what else is needed ? Do you know any ?

> information. I am pretty new to awk, so perhaps I am looking for the=20
> answer in a wrong way? The only related information is a (looking)=20
> professional message post from Olaf Dabrunz which finally received no=20
> reply:
> http://mail.nl.linux.org/linux-utf8...6/msg00005.html

Thanks for posting this link.
I think the questions raised by Olaf are mostly
not addressed by the specifications of AWK and C.
If I am wrong, please tell me.

When I started working on the XML extension for
GNU Awk, I was soon running into same trouble with
XML's ability of coping with Unicode. For example,
the byte sequence representing Umlaut characters
varies depending on the encoding used. Imagine we
want to compare strings in AWK:

if (s =3D=3D "M=FCll") {
print "found some trash"
}

The Umlaut character may be encoded as a multi-byte
character in the XML data and as a single-byte character
in the AWK program (Latin-1 encoding). Should the different
encodings of a character be reported as identical or not ?
The pain of comparing Umlauts is eased a bit by the fact
that most XML parsers produce character as UTF-8, no matter
what the original encoding of the XML file was. But the
general problem remains the same as with your chinese
ideographs.

Report this thread to moderator Post Follow-up to this message
Old Post
Jürgen Kahrs
01-03-05 08:55 PM


Re: newbie: awk not working for multi-byte charsets?
In article <33tef7F43uqv4U1@individual.net>,
Zhang Weiwu  <zhangweiwu@realss.com> wrote:
>Hello? I just wanna know:
>
>1) does awk (I am using gawk) support string operation to multi-byte
>charsets? Say I have a string contain 4 ideographs, but length(string)
>gives 12 currently. According to unicode.org, gawk is
>unicode-compatible, so how can I turn on multi-byte character recognition?

Gawk is partially unicode compatible. In particular, regular expression
matching understands unicode input.

>2) why this looks so common a question (all asian users could face it),
>but very difficult to google out an answer? I used keywords like "awk
>Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk
>length utf-8" ... All the keywords combination leads to unrelated
>information. I am pretty new to awk, so perhaps I am looking for the
>answer in a wrong way?

POSIX does indeed state that awk uses "characters".  In particular,
this affects the index(), length(), substr(), and match() functions
(and RSTART and RLENGTH variables).

Gawk's internals were designed way, Way, WAY before multibyte character
sets were an issue for any GNU developer.  Thus, everything currently
works in terms of bytes.  Also, as a native English speaker who grew
up in the USA, I have had neither need for, nor a real understanding
of, the multibyte issues and APIs.  (This is neither "right" nor
"wrong", just the situation.)

This has led me to be hesitant to attempt to fix the problem, in the
(apparantly rather vain) hope that someone better qualified to make
changes would step forward and do so.  (This is in the time-honored Free
Software / Open Source tradition, where the person who has the need adds
the feature.)  Unfortunately, I'm still waiting.

I will also point out that if gawk is going to become multibyte-aware,
it has to be able to handle *any* multibyte locale, not just Unicode.

All that said, I have begun looking at the issue.  I have a private
version that I believe correctly handles index(), length(), and substr(),
although I have no data with which to test it.  I still have to find
two hours or so to tackle match(), which will be less straightforward
than the others.

My changes increase the size of a NODE from 32 bytes to 40 bytes, about
which I am not happy at all; I'm not sure that that problem is solvable,
though.

When they're ready for review, I'll post a note in this group.  Or
maybe even the patch.

> The only related information is a (looking) professional
> message post from Olaf Dabrunz which finally received no reply:
> http://mail.nl.linux.org/linux-utf8...6/msg00005.html

I did reply to him privately, with much the same answer.  That email is
still in my inbox, pending my doing something about it.

There is a separate issue: RS = "multibyte string".  I believe that this
currently works, with gawk treating RS as a regular expression. Since
the R.E. code understands multibyte stuff, it "just works", although
apparently this is a serendipitous accident, and not by design.

I am curious if any commercial version of awk correctly handles these
issues.  If anyone can report to me (privately) which ones do, I'd
appreciate it.  Of the free ones, gawk will probably eventually support
multibyte characters correctly; I can't say anything about the others.

Finally, I will remind the readers of this group that a very large
percentage of GNU software, and gawk in particular, is maintained
ON VOLUNTEER TIME.  I have a family to support and a mortgage to pay, and
I have yet to see one dime come directly from the work I've done on gawk.

This has two implications: 1. if you think a feature is missing, be
POLITE in how you ask for it, and 2. if you need it so badly that
you're willing to pay for it, please email me off line.

Arnold
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.	arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381	Fax: +1 206 350 8765
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL

Report this thread to moderator Post Follow-up to this message
Old Post
Aharon Robbins
01-04-05 08:55 PM


Re: newbie: awk not working for multi-byte charsets?
Aharon Robbins wrote:
> In article <33tef7F43uqv4U1@individual.net>,
> Zhang Weiwu  <zhangweiwu@realss.com> wrote: 
>
>
> POSIX does indeed state that awk uses "characters".  In particular,
> this affects the index(), length(), substr(), and match() functions
> (and RSTART and RLENGTH variables).

> Gawk's internals were designed way, Way, WAY before multibyte character
> sets were an issue for any GNU developer.  Thus, everything currently
> works in terms of bytes.  Also, as a native English speaker who grew
> up in the USA, I have had neither need for, nor a real understanding
> of, the multibyte issues and APIs.  (This is neither "right" nor
> "wrong", just the situation.)

> This has led me to be hesitant to attempt to fix the problem, in the
> (apparantly rather vain) hope that someone better qualified to make
> changes would step forward and do so.  (This is in the time-honored Free
> Software / Open Source tradition, where the person who has the need adds
> the feature.)  Unfortunately, I'm still waiting.

> I will also point out that if gawk is going to become multibyte-aware,
> it has to be able to handle *any* multibyte locale, not just Unicode.
>
> All that said, I have begun looking at the issue.  I have a private
> version that I believe correctly handles index(), length(), and substr(),
> although I have no data with which to test it.  I still have to find
> two hours or so to tackle match(), which will be less straightforward
> than the others.

Really? Then you've done a great job ^_^ Thank you.
Can I have your version of gawk? I think I could test it. And index(),
length() and substr() happen to be the only string functions used in my
script. Can I have your source and make a test? I feel sorry because I
am not a C developer (to directly help you on the development).

Please do not give me x86 binary code because I am now on sparc64.

> My changes increase the size of a NODE from 32 bytes to 40 bytes, about
> which I am not happy at all; I'm not sure that that problem is solvable,
> though.
>
> When they're ready for review, I'll post a note in this group.  Or
> maybe even the patch.
>
> I am curious if any commercial version of awk correctly handles these
> issues.  If anyone can report to me (privately) which ones do, I'd
> appreciate it.  Of the free ones, gawk will probably eventually support
> multibyte characters correctly; I can't say anything about the others.

I was told commercial awk versions are not as good as gawk generally
speaking. If there is a multi-byte compatible awk (even commercial)
please let me know too;)

> Finally, I will remind the readers of this group that a very large
> percentage of GNU software, and gawk in particular, is maintained
> ON VOLUNTEER TIME.  I have a family to support and a mortgage to pay, and
> I have yet to see one dime come directly from the work I've done on gawk.
>
> This has two implications: 1. if you think a feature is missing, be
> POLITE in how you ask for it, and 2. if you need it so badly that
> you're willing to pay for it, please email me off line.

Thank you for the kind notice. Actually my complaint is "are there so
few Asian users?" So it's clear I know free software is supported by its
users. I just wonder, if there are Asian people (Chinese, Japanese...)
in the development team, this problem should be already solved, as it's
so necessary for them. It's a pity Asian users participated less than
American and European people :(

Report this thread to moderator Post Follow-up to this message
Old Post
Zhang Weiwu
01-04-05 08:55 PM


Re: newbie: awk not working for multi-byte charsets?
In article <33tef7F43uqv4U1@individual.net>,
Zhang Weiwu  <zhangweiwu@realss.com> wrote:
>Hello? I just wanna know:
>
>1) does awk (I am using gawk) support string operation to multi-byte
>charsets? Say I have a string contain 4 ideographs, but length(string)
>gives 12 currently. According to unicode.org, gawk is
>unicode-compatible, so how can I turn on multi-byte character recognition?

Gawk is partially unicode compatible. In particular, regular expression
matching understands unicode input.

>2) why this looks so common a question (all asian users could face it),
>but very difficult to google out an answer? I used keywords like "awk
>Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk
>length utf-8" ... All the keywords combination leads to unrelated
>information. I am pretty new to awk, so perhaps I am looking for the
>answer in a wrong way?

POSIX does indeed state that awk uses "characters".  In particular,
this affects the index(), length(), substr(), and match() functions
(and RSTART and RLENGTH variables).

Gawk's internals were designed way, Way, WAY before multibyte character
sets were an issue for any GNU developer.  Thus, everything currently
works in terms of bytes.  Also, as a native English speaker who grew
up in the USA, I have had neither need for, nor a real understanding
of, the multibyte issues and APIs.  (This is neither "right" nor
"wrong", just the situation.)

This has led me to be hesitant to attempt to fix the problem, in the
(apparantly rather vain) hope that someone better qualified to make
changes would step forward and do so.  (This is in the time-honored Free
Software / Open Source tradition, where the person who has the need adds
the feature.)  Unfortunately, I'm still waiting.

I will also point out that if gawk is going to become multibyte-aware,
it has to be able to handle *any* multibyte locale, not just Unicode.

All that said, I have begun looking at the issue.  I have a private
version that I believe correctly handles index(), length(), and substr(),
although I have no data with which to test it.  I still have to find
two hours or so to tackle match(), which will be less straightforward
than the others.

My changes increase the size of a NODE from 32 bytes to 40 bytes, about
which I am not happy at all; I'm not sure that that problem is solvable,
though.

When they're ready for review, I'll post a note in this group.  Or
maybe even the patch.

> The only related information is a (looking) professional
> message post from Olaf Dabrunz which finally received no reply:
> http://mail.nl.linux.org/linux-utf8...6/msg00005.html

I did reply to him privately, with much the same answer.  That email is
still in my inbox, pending my doing something about it.

There is a separate issue: RS = "multibyte string".  I believe that this
currently works, with gawk treating RS as a regular expression. Since
the R.E. code understands multibyte stuff, it "just works", although
apparently this is a serendipitous accident, and not by design.

I am curious if any commercial version of awk correctly handles these
issues.  If anyone can report to me (privately) which ones do, I'd
appreciate it.  Of the free ones, gawk will probably eventually support
multibyte characters correctly; I can't say anything about the others.

Finally, I will remind the readers of this group that a very large
percentage of GNU software, and gawk in particular, is maintained
ON VOLUNTEER TIME.  I have a family to support and a mortgage to pay, and
I have yet to see one dime come directly from the work I've done on gawk.

This has two implications: 1. if you think a feature is missing, be
POLITE in how you ask for it, and 2. if you need it so badly that
you're willing to pay for it, please email me off line.

Arnold
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.	arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381	Fax: +1 206 350 8765
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL

Report this thread to moderator Post Follow-up to this message
Old Post
Aharon Robbins
01-07-05 01:56 AM


Re: newbie: awk not working for multi-byte charsets?
Aharon Robbins wrote:
> In article <33tef7F43uqv4U1@individual.net>,
> Zhang Weiwu  <zhangweiwu@realss.com> wrote: 
>
>
> POSIX does indeed state that awk uses "characters".  In particular,
> this affects the index(), length(), substr(), and match() functions
> (and RSTART and RLENGTH variables).

> Gawk's internals were designed way, Way, WAY before multibyte character
> sets were an issue for any GNU developer.  Thus, everything currently
> works in terms of bytes.  Also, as a native English speaker who grew
> up in the USA, I have had neither need for, nor a real understanding
> of, the multibyte issues and APIs.  (This is neither "right" nor
> "wrong", just the situation.)

> This has led me to be hesitant to attempt to fix the problem, in the
> (apparantly rather vain) hope that someone better qualified to make
> changes would step forward and do so.  (This is in the time-honored Free
> Software / Open Source tradition, where the person who has the need adds
> the feature.)  Unfortunately, I'm still waiting.

> I will also point out that if gawk is going to become multibyte-aware,
> it has to be able to handle *any* multibyte locale, not just Unicode.
>
> All that said, I have begun looking at the issue.  I have a private
> version that I believe correctly handles index(), length(), and substr(),
> although I have no data with which to test it.  I still have to find
> two hours or so to tackle match(), which will be less straightforward
> than the others.

Really? Then you've done a great job ^_^ Thank you.
Can I have your version of gawk? I think I could test it. And index(),
length() and substr() happen to be the only string functions used in my
script. Can I have your source and make a test? I feel sorry because I
am not a C developer (to directly help you on the development).

Please do not give me x86 binary code because I am now on sparc64.

> My changes increase the size of a NODE from 32 bytes to 40 bytes, about
> which I am not happy at all; I'm not sure that that problem is solvable,
> though.
>
> When they're ready for review, I'll post a note in this group.  Or
> maybe even the patch.
>
> I am curious if any commercial version of awk correctly handles these
> issues.  If anyone can report to me (privately) which ones do, I'd
> appreciate it.  Of the free ones, gawk will probably eventually support
> multibyte characters correctly; I can't say anything about the others.

I was told commercial awk versions are not as good as gawk generally
speaking. If there is a multi-byte compatible awk (even commercial)
please let me know too;)

> Finally, I will remind the readers of this group that a very large
> percentage of GNU software, and gawk in particular, is maintained
> ON VOLUNTEER TIME.  I have a family to support and a mortgage to pay, and
> I have yet to see one dime come directly from the work I've done on gawk.
>
> This has two implications: 1. if you think a feature is missing, be
> POLITE in how you ask for it, and 2. if you need it so badly that
> you're willing to pay for it, please email me off line.

Thank you for the kind notice. Actually my complaint is "are there so
few Asian users?" So it's clear I know free software is supported by its
users. I just wonder, if there are Asian people (Chinese, Japanese...)
in the development team, this problem should be already solved, as it's
so necessary for them. It's a pity Asian users participated less than
American and European people :(

Report this thread to moderator Post Follow-up to this message
Old Post
Zhang Weiwu
01-07-05 01:56 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

AWK archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:39 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.