Code Comments
Programming Forum and web based access to our favorite programming groups.Hello? I just wanna know: 1) does awk (I am using gawk) support string operation to multi-byte charsets? Say I have a string contain 4 ideographs, but length(string) gives 12 currently. According to unicode.org, gawk is unicode-compatible, so how can I turn on multi-byte character recognition? 2) why this looks so common a question (all asian users could face it), but very difficult to google out an answer? I used keywords like "awk Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk length utf-8" ... All the keywords combination leads to unrelated information. I am pretty new to awk, so perhaps I am looking for the answer in a wrong way? The only related information is a (looking) professional message post from Olaf Dabrunz which finally received no reply: http://mail.nl.linux.org/linux-utf8...6/msg00005.html
Post Follow-up to this messageZhang Weiwu wrote: > Hello? I just wanna know: > > 1) does awk (I am using gawk) support string operation to multi-byte > charsets? Say I have a string contain 4 ideographs, but length(string) > gives 12 currently. According to unicode.org, gawk is > unicode-compatible, so how can I turn on multi-byte character recognition? I think I forgot to mension I am on LANG=zh_CN.UTF-8 And README_d/README.multibyte does not give information on my topic (wired).
Post Follow-up to this messageZhang Weiwu wrote: > 1) does awk (I am using gawk) support string operation to multi-byte=20 > charsets? Say I have a string contain 4 ideographs, but length(string) = > gives 12 currently. According to unicode.org, gawk is=20 > unicode-compatible, so how can I turn on multi-byte character recogniti= on? What requirements are there for a language to be called "Unicode Enabled" ? The example of string operations is a good example, but is there any formal requirement list telling me what else is needed ? Do you know any ? > information. I am pretty new to awk, so perhaps I am looking for the=20 > answer in a wrong way? The only related information is a (looking)=20 > professional message post from Olaf Dabrunz which finally received no=20 > reply: > http://mail.nl.linux.org/linux-utf8...6/msg00005.html Thanks for posting this link. I think the questions raised by Olaf are mostly not addressed by the specifications of AWK and C. If I am wrong, please tell me. When I started working on the XML extension for GNU Awk, I was soon running into same trouble with XML's ability of coping with Unicode. For example, the byte sequence representing Umlaut characters varies depending on the encoding used. Imagine we want to compare strings in AWK: if (s =3D=3D "M=FCll") { print "found some trash" } The Umlaut character may be encoded as a multi-byte character in the XML data and as a single-byte character in the AWK program (Latin-1 encoding). Should the different encodings of a character be reported as identical or not ? The pain of comparing Umlauts is eased a bit by the fact that most XML parsers produce character as UTF-8, no matter what the original encoding of the XML file was. But the general problem remains the same as with your chinese ideographs.
Post Follow-up to this messageIn article <33tef7F43uqv4U1@individual.net>, Zhang Weiwu <zhangweiwu@realss.com> wrote: >Hello? I just wanna know: > >1) does awk (I am using gawk) support string operation to multi-byte >charsets? Say I have a string contain 4 ideographs, but length(string) >gives 12 currently. According to unicode.org, gawk is >unicode-compatible, so how can I turn on multi-byte character recognition? Gawk is partially unicode compatible. In particular, regular expression matching understands unicode input. >2) why this looks so common a question (all asian users could face it), >but very difficult to google out an answer? I used keywords like "awk >Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk >length utf-8" ... All the keywords combination leads to unrelated >information. I am pretty new to awk, so perhaps I am looking for the >answer in a wrong way? POSIX does indeed state that awk uses "characters". In particular, this affects the index(), length(), substr(), and match() functions (and RSTART and RLENGTH variables). Gawk's internals were designed way, Way, WAY before multibyte character sets were an issue for any GNU developer. Thus, everything currently works in terms of bytes. Also, as a native English speaker who grew up in the USA, I have had neither need for, nor a real understanding of, the multibyte issues and APIs. (This is neither "right" nor "wrong", just the situation.) This has led me to be hesitant to attempt to fix the problem, in the (apparantly rather vain) hope that someone better qualified to make changes would step forward and do so. (This is in the time-honored Free Software / Open Source tradition, where the person who has the need adds the feature.) Unfortunately, I'm still waiting. I will also point out that if gawk is going to become multibyte-aware, it has to be able to handle *any* multibyte locale, not just Unicode. All that said, I have begun looking at the issue. I have a private version that I believe correctly handles index(), length(), and substr(), although I have no data with which to test it. I still have to find two hours or so to tackle match(), which will be less straightforward than the others. My changes increase the size of a NODE from 32 bytes to 40 bytes, about which I am not happy at all; I'm not sure that that problem is solvable, though. When they're ready for review, I'll post a note in this group. Or maybe even the patch. > The only related information is a (looking) professional > message post from Olaf Dabrunz which finally received no reply: > http://mail.nl.linux.org/linux-utf8...6/msg00005.html I did reply to him privately, with much the same answer. That email is still in my inbox, pending my doing something about it. There is a separate issue: RS = "multibyte string". I believe that this currently works, with gawk treating RS as a regular expression. Since the R.E. code understands multibyte stuff, it "just works", although apparently this is a serendipitous accident, and not by design. I am curious if any commercial version of awk correctly handles these issues. If anyone can report to me (privately) which ones do, I'd appreciate it. Of the free ones, gawk will probably eventually support multibyte characters correctly; I can't say anything about the others. Finally, I will remind the readers of this group that a very large percentage of GNU software, and gawk in particular, is maintained ON VOLUNTEER TIME. I have a family to support and a mortgage to pay, and I have yet to see one dime come directly from the work I've done on gawk. This has two implications: 1. if you think a feature is missing, be POLITE in how you ask for it, and 2. if you need it so badly that you're willing to pay for it, please email me off line. Arnold -- Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 206 350 8765 Nof Ayalon Cell Phone: +972 50 729-7545 D.N. Shimshon 99785 ISRAEL
Post Follow-up to this messageAharon Robbins wrote: > In article <33tef7F43uqv4U1@individual.net>, > Zhang Weiwu <zhangweiwu@realss.com> wrote: > > > POSIX does indeed state that awk uses "characters". In particular, > this affects the index(), length(), substr(), and match() functions > (and RSTART and RLENGTH variables). > Gawk's internals were designed way, Way, WAY before multibyte character > sets were an issue for any GNU developer. Thus, everything currently > works in terms of bytes. Also, as a native English speaker who grew > up in the USA, I have had neither need for, nor a real understanding > of, the multibyte issues and APIs. (This is neither "right" nor > "wrong", just the situation.) > This has led me to be hesitant to attempt to fix the problem, in the > (apparantly rather vain) hope that someone better qualified to make > changes would step forward and do so. (This is in the time-honored Free > Software / Open Source tradition, where the person who has the need adds > the feature.) Unfortunately, I'm still waiting. > I will also point out that if gawk is going to become multibyte-aware, > it has to be able to handle *any* multibyte locale, not just Unicode. > > All that said, I have begun looking at the issue. I have a private > version that I believe correctly handles index(), length(), and substr(), > although I have no data with which to test it. I still have to find > two hours or so to tackle match(), which will be less straightforward > than the others. Really? Then you've done a great job ^_^ Thank you. Can I have your version of gawk? I think I could test it. And index(), length() and substr() happen to be the only string functions used in my script. Can I have your source and make a test? I feel sorry because I am not a C developer (to directly help you on the development). Please do not give me x86 binary code because I am now on sparc64. > My changes increase the size of a NODE from 32 bytes to 40 bytes, about > which I am not happy at all; I'm not sure that that problem is solvable, > though. > > When they're ready for review, I'll post a note in this group. Or > maybe even the patch. > > I am curious if any commercial version of awk correctly handles these > issues. If anyone can report to me (privately) which ones do, I'd > appreciate it. Of the free ones, gawk will probably eventually support > multibyte characters correctly; I can't say anything about the others. I was told commercial awk versions are not as good as gawk generally speaking. If there is a multi-byte compatible awk (even commercial) please let me know too;) > Finally, I will remind the readers of this group that a very large > percentage of GNU software, and gawk in particular, is maintained > ON VOLUNTEER TIME. I have a family to support and a mortgage to pay, and > I have yet to see one dime come directly from the work I've done on gawk. > > This has two implications: 1. if you think a feature is missing, be > POLITE in how you ask for it, and 2. if you need it so badly that > you're willing to pay for it, please email me off line. Thank you for the kind notice. Actually my complaint is "are there so few Asian users?" So it's clear I know free software is supported by its users. I just wonder, if there are Asian people (Chinese, Japanese...) in the development team, this problem should be already solved, as it's so necessary for them. It's a pity Asian users participated less than American and European people :(
Post Follow-up to this messageIn article <33tef7F43uqv4U1@individual.net>, Zhang Weiwu <zhangweiwu@realss.com> wrote: >Hello? I just wanna know: > >1) does awk (I am using gawk) support string operation to multi-byte >charsets? Say I have a string contain 4 ideographs, but length(string) >gives 12 currently. According to unicode.org, gawk is >unicode-compatible, so how can I turn on multi-byte character recognition? Gawk is partially unicode compatible. In particular, regular expression matching understands unicode input. >2) why this looks so common a question (all asian users could face it), >but very difficult to google out an answer? I used keywords like "awk >Unicode" "awk utf" "gawk Unicode" "gawk utf" "awk multibyte string" "awk >length utf-8" ... All the keywords combination leads to unrelated >information. I am pretty new to awk, so perhaps I am looking for the >answer in a wrong way? POSIX does indeed state that awk uses "characters". In particular, this affects the index(), length(), substr(), and match() functions (and RSTART and RLENGTH variables). Gawk's internals were designed way, Way, WAY before multibyte character sets were an issue for any GNU developer. Thus, everything currently works in terms of bytes. Also, as a native English speaker who grew up in the USA, I have had neither need for, nor a real understanding of, the multibyte issues and APIs. (This is neither "right" nor "wrong", just the situation.) This has led me to be hesitant to attempt to fix the problem, in the (apparantly rather vain) hope that someone better qualified to make changes would step forward and do so. (This is in the time-honored Free Software / Open Source tradition, where the person who has the need adds the feature.) Unfortunately, I'm still waiting. I will also point out that if gawk is going to become multibyte-aware, it has to be able to handle *any* multibyte locale, not just Unicode. All that said, I have begun looking at the issue. I have a private version that I believe correctly handles index(), length(), and substr(), although I have no data with which to test it. I still have to find two hours or so to tackle match(), which will be less straightforward than the others. My changes increase the size of a NODE from 32 bytes to 40 bytes, about which I am not happy at all; I'm not sure that that problem is solvable, though. When they're ready for review, I'll post a note in this group. Or maybe even the patch. > The only related information is a (looking) professional > message post from Olaf Dabrunz which finally received no reply: > http://mail.nl.linux.org/linux-utf8...6/msg00005.html I did reply to him privately, with much the same answer. That email is still in my inbox, pending my doing something about it. There is a separate issue: RS = "multibyte string". I believe that this currently works, with gawk treating RS as a regular expression. Since the R.E. code understands multibyte stuff, it "just works", although apparently this is a serendipitous accident, and not by design. I am curious if any commercial version of awk correctly handles these issues. If anyone can report to me (privately) which ones do, I'd appreciate it. Of the free ones, gawk will probably eventually support multibyte characters correctly; I can't say anything about the others. Finally, I will remind the readers of this group that a very large percentage of GNU software, and gawk in particular, is maintained ON VOLUNTEER TIME. I have a family to support and a mortgage to pay, and I have yet to see one dime come directly from the work I've done on gawk. This has two implications: 1. if you think a feature is missing, be POLITE in how you ask for it, and 2. if you need it so badly that you're willing to pay for it, please email me off line. Arnold -- Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 206 350 8765 Nof Ayalon Cell Phone: +972 50 729-7545 D.N. Shimshon 99785 ISRAEL
Post Follow-up to this messageAharon Robbins wrote: > In article <33tef7F43uqv4U1@individual.net>, > Zhang Weiwu <zhangweiwu@realss.com> wrote: > > > POSIX does indeed state that awk uses "characters". In particular, > this affects the index(), length(), substr(), and match() functions > (and RSTART and RLENGTH variables). > Gawk's internals were designed way, Way, WAY before multibyte character > sets were an issue for any GNU developer. Thus, everything currently > works in terms of bytes. Also, as a native English speaker who grew > up in the USA, I have had neither need for, nor a real understanding > of, the multibyte issues and APIs. (This is neither "right" nor > "wrong", just the situation.) > This has led me to be hesitant to attempt to fix the problem, in the > (apparantly rather vain) hope that someone better qualified to make > changes would step forward and do so. (This is in the time-honored Free > Software / Open Source tradition, where the person who has the need adds > the feature.) Unfortunately, I'm still waiting. > I will also point out that if gawk is going to become multibyte-aware, > it has to be able to handle *any* multibyte locale, not just Unicode. > > All that said, I have begun looking at the issue. I have a private > version that I believe correctly handles index(), length(), and substr(), > although I have no data with which to test it. I still have to find > two hours or so to tackle match(), which will be less straightforward > than the others. Really? Then you've done a great job ^_^ Thank you. Can I have your version of gawk? I think I could test it. And index(), length() and substr() happen to be the only string functions used in my script. Can I have your source and make a test? I feel sorry because I am not a C developer (to directly help you on the development). Please do not give me x86 binary code because I am now on sparc64. > My changes increase the size of a NODE from 32 bytes to 40 bytes, about > which I am not happy at all; I'm not sure that that problem is solvable, > though. > > When they're ready for review, I'll post a note in this group. Or > maybe even the patch. > > I am curious if any commercial version of awk correctly handles these > issues. If anyone can report to me (privately) which ones do, I'd > appreciate it. Of the free ones, gawk will probably eventually support > multibyte characters correctly; I can't say anything about the others. I was told commercial awk versions are not as good as gawk generally speaking. If there is a multi-byte compatible awk (even commercial) please let me know too;) > Finally, I will remind the readers of this group that a very large > percentage of GNU software, and gawk in particular, is maintained > ON VOLUNTEER TIME. I have a family to support and a mortgage to pay, and > I have yet to see one dime come directly from the work I've done on gawk. > > This has two implications: 1. if you think a feature is missing, be > POLITE in how you ask for it, and 2. if you need it so badly that > you're willing to pay for it, please email me off line. Thank you for the kind notice. Actually my complaint is "are there so few Asian users?" So it's clear I know free software is supported by its users. I just wonder, if there are Asian people (Chinese, Japanese...) in the development team, this problem should be already solved, as it's so necessary for them. It's a pity Asian users participated less than American and European people :(
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.