Home > Archive > AWK > March 2008 > Re: Gawk and multi-byte characters
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: Gawk and multi-byte characters
|
|
| Steffen Schuler 2008-03-20, 3:59 am |
| On Wed, 16 Jan 2008 21:18:09 +0100, Jürgen Kahrs wrote:
> Hermann Peifer schrieb:
>
>
> I found this in the man page:
>
> As of version 3.1.5, gawk is multibyte aware. This means that
> index(), length(), substr() and match() all work in terms of
> characters, not bytes.
>
>
> I guess it's the POSIX standard that defines printf to count in terms of
> characters and not bytes.
Hi,
Hermann, you find the ChangeLog here: gawk-stable/ChangeLog
(CVS-version: http://cvs.savannah.gnu.org/viewvc/gawk-stable/)
SUSv3 (http://www.unix.org) defines that printf counts the field width
and precision in bytes---not in (multibyte) characters in
"5. File Format Notation".
--
Steffen
| |
| Hermann Peifer 2008-03-20, 7:58 am |
| Steffen Schuler wrote:
> On Wed, 16 Jan 2008 21:18:09 +0100, Jürgen Kahrs wrote:
>
>
> Hi,
>
> Hermann, you find the ChangeLog here: gawk-stable/ChangeLog
> (CVS-version: http://cvs.savannah.gnu.org/viewvc/gawk-stable/)
> SUSv3 (http://www.unix.org) defines that printf counts the field width
> and precision in bytes---not in (multibyte) characters in
> "5. File Format Notation".
>
Didn't Arnold provide a "fix" for this "bug" which now seems to be POSIX
compliance rather than a bug? From the ChangeLog:
Tue Mar 4 21:02:25 2008 Arnold D. Robbins <arnold@skeeve.com>
* builtin.c (mbc_char_count, mbc_byte_count): New functions to return
the number of m.b. chars there are and the number of bytes needed to
copy them.
(format_tree): Use them for %s and %c cases to adjust precision and
for copying characters at pr_tail label.
?
Hermann
PS
Your link to savannah seems to be broken
| |
| Jürgen Kahrs 2008-03-20, 6:59 pm |
| Hermann Peifer schrieb:
I looked it up in SUSv6 and it says:
c The int argument shall be converted to an unsigned char, and the resulting byte shall be written.
[color=darkred]
> Didn't Arnold provide a "fix" for this "bug" which now seems to be POSIX
> compliance rather than a bug? From the ChangeLog:
Yes, Arnold fixed it. He contemplated it a long time
before applying the fix. I dont know what convinced
him that the fix is necessary (i.e. printf should
count characters, not bytes).
| |
| Hermann Peifer 2008-03-20, 6:59 pm |
| Jürgen Kahrs wrote:
> Hermann Peifer schrieb:
>
>
> I looked it up in SUSv6 and it says:
>
> c The int argument shall be converted to an unsigned char, and the resulting byte shall be written.
>
>
> Yes, Arnold fixed it. He contemplated it a long time
> before applying the fix. I dont know what convinced
> him that the fix is necessary (i.e. printf should
> count characters, not bytes).
That printf counts characters rather than bytes makes more sense to me,
but I would feel bad if my posting resulted in making gawk POSIX
non-compliant.
There is obviously a workaround as length() returns the number of
characters. Instead of
printf "%-30s", $1
I use something like:
printf "%s%*s", $1,30-length($1),""
Hermann
|
|
|
|
|