Code Comments
Programming Forum and web based access to our favorite programming groups.Newbie learning gentoo. I just spend 5 hours to google around and test and only to find gawk failed to support multibyte character in length() and substr(). This is really a shock to me, as I think this is the BASIC function that an Asian user might need, and gawk has been ages old not supporting it. Are there so few Asian users!! Now just happy to find nawk support multibytes: http://www.cryst.bbk.ac.uk/CCSG/uni.../awk.html#sect3 > the *awk* command differs from the commands *oawk* and *gawk* in that > *awk* conforms to the x/open portability guide, issue 4 (xpg4). the > *awk* command is therefore capable of handling multibyte characters > that occur in coded character sets defined for some native languages. Now come to the question: _esearch nawk_ returns no result. Isn't nawk ported into gentoo, or it is ported to gentoo under a different name? Furthermore, I found it seems nawk is from Bell-Lab. I happen to find another verison of awk, the bwk is also from Brian Kernighan from Bell-Lab (and ain't in portage either). I wish to install the one awk that CSSG book (mensioned above) claimed to be able to support multi-bytes, which one is it? OT: Truly, the migration to awk is very painful. I had been using JScript (by Windows Scripting File) dealing with my text processing 4 years ago (mostly my historical research data), JScript support Unicode and (perhaps not as good as in awk) regular expression. Now I switched to Linux, and all my previous Chinese text files become a headache for 4 years. Awk could not process them, even vim could destroy several text files simply by opening and saving it (as there are rare Chinese ideograph might not covered by vim I guess, as I do Chinese history research in spare time). Today in new years holiday I tried to find 5 hours of a whole block of time, I am so determined to start process those old files on Linux, end up with hours of debugging and googling around only to discover more gawk-multi-byte-incompatible problems. Sorry, as I will go on using Linux despite of these problems, this is just a useless complaint to make me not feel too bad ...
Post Follow-up to this messageZhang Weiwu wrote: > Newbie learning gentoo. I just spend 5 hours to google around and test Sorry this message is supposed to go to gentoo-user list. But it's not very OT for this group, right? And I am newbie learning awk not gentoo.
Post Follow-up to this messageZhang Weiwu wrote: > Newbie learning gentoo. I just spend 5 hours to google around and test > and only to find gawk failed to support multibyte character in length() > and substr(). This is really a shock to me, as I think this is the BASIC > function that an Asian user might need, and gawk has been ages old not > supporting it. Are there so few Asian users!! I am quite happy that finally someone dares to ask the question. Go on, I am eagerly awaiting comments. > Now just happy to find nawk support multibytes: > http://www.cryst.bbk.ac.uk/CCSG/uni.../awk.html#sect3 > Hmm, really ? I dont trust this source. For example, this source writes "begin" instead of "BEGIN". This description is incomplete and partly wrong. What really counts is this one: http://www.opengroup.org/onlinepubs...99/xcu/awk.html > Now come to the question: _esearch nawk_ returns no result. Isn't nawk > ported into gentoo, or it is ported to gentoo under a different name? nawk is not part of the POSIX standard. nawk is traditionally supported by many Linux system and all SunOS derivatives. > around only to discover more gawk-multi-byte-incompatible problems. What is "multi-byte-incompatible" ? You expect AWK to behave like JScript (which is a Microsoft-variant of JavaScript as far as I know).
Post Follow-up to this messageZhang Weiwu wrote: > Sorry this message is supposed to go to gentoo-user list. But it's not > very OT for this group, right? And I am newbie learning awk not gentoo. This is definitely on-topic. Go on asking, otherwise we would never start solving these problems.
Post Follow-up to this messageJürgen Kahrs wrote: > Zhang Weiwu wrote: > > > > I am quite happy that finally someone dares > to ask the question. Go on, I am eagerly > awaiting comments. Hope GNU people don't eat me for this question ;) But I am not a developer who can contribute on this topic. I could only ask questions :( > > > What is "multi-byte-incompatible" ? > You expect AWK to behave like JScript (which is a > Microsoft-variant of JavaScript as far as I know). At least all JScript functions destinguish multi-byte and single-byte character correctly, and there is always an option in substr(), indexOf(), length().. specify wheather or not the string should be treated as unicode (although Microsoft understnad unicode as UTF16LE). I dislike JScript itself but it just did what I wished. And it deals with rare Chinese ideographs as well. In Windows, JScript could be put into .wsf file and process text file being called from CMD commandline.
Post Follow-up to this messageJürgen Kahrs wrote: > Zhang Weiwu wrote: > > > > I am quite happy that finally someone dares > to ask the question. Go on, I am eagerly > awaiting comments. One more question: can I avoid this question by using other language (in my case, perl)? I am not sure if perl could deal with multi-byte, but I prefer to tap the knowledge of this group rather than spending another 5 hours to find it out :( I have lots of files to process, and substr/index/length will be used many a time.
Post Follow-up to this messageOn Tue, 04 Jan 2005 02:27:12 +0800 Zhang Weiwu <zhangweiwu@realss.com> wrote: > Newbie learning gentoo. I just spend 5 hours to google around and test > and only to find gawk failed to support multibyte character in length() > and substr(). This is really a shock to me, as I think this is the BASIC > function that an Asian user might need, and gawk has been ages old not > supporting it. Are there so few Asian users!! > You could try to use TCL, which supports nicely unicode, and has strong text processing features, even if different from awk. --Marc
Post Follow-up to this messageZhang Weiwu wrote: > Newbie learning gentoo. I just spend 5 hours to google around and test > and only to find gawk failed to support multibyte character in length() > and substr(). This is really a shock to me, as I think this is the BASIC > function that an Asian user might need, and gawk has been ages old not > supporting it. Are there so few Asian users!! I am quite happy that finally someone dares to ask the question. Go on, I am eagerly awaiting comments. > Now just happy to find nawk support multibytes: > http://www.cryst.bbk.ac.uk/CCSG/uni.../awk.html#sect3 > Hmm, really ? I dont trust this source. For example, this source writes "begin" instead of "BEGIN". This description is incomplete and partly wrong. What really counts is this one: http://www.opengroup.org/onlinepubs...99/xcu/awk.html > Now come to the question: _esearch nawk_ returns no result. Isn't nawk > ported into gentoo, or it is ported to gentoo under a different name? nawk is not part of the POSIX standard. nawk is traditionally supported by many Linux system and all SunOS derivatives. > around only to discover more gawk-multi-byte-incompatible problems. What is "multi-byte-incompatible" ? You expect AWK to behave like JScript (which is a Microsoft-variant of JavaScript as far as I know).
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.