Home > Archive > AWK > February 2005 > translate characters in gawk
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
translate characters in gawk
|
|
|
| All
How do I code this TAWK program in gawk? I took up the challenge to count
the number of unique words in the Bible, that originated in the
comp.lang.pl1 site (see description
below). My TAWK code doesn't work with gawk as there is no translate
function. I also realize I have to
code up a sort routine or figure out asort(). I read that someone had a
solution to this challenge in
awk, but I could not find it.
REX
The Bible text is at: http://patriot.net/~bmcgin/kjvpage.html. Using a
text editor remove text before
Book 01 Genesis and text after last word in Revelations, (amen) producing
bible.txt file. Then
convert all punctuation and numbers to blanks, and uppercase to lower. One
punctuation exception is '
(within a word) is deleted leaving wife's as wifes. Then produce a sorted
list all of the words and their counts.
--------------bible.txt
Book 01 Genesis
001:001 In the beginning God created the heaven and the earth.
001:002 And the earth was without form, and void; and darkness was
upon the face of the deep. And the Spirit of God moved upon
the face of the waters.
001:003 And God said, Let there be light: and there was light.
-----------snip
(1){
$0 = tolower($0)
$0 = translate($0,"'","") #remove ' from
wive's
$0 = translate($0,"\r\n\t!?\\-(),.;:0-9"," ") #remove numbers and
punctuation
w = w + NF #count words
for (i=1;i<=NF;i++) {c=c+length($i);++x[$i] }
}
END{
for (i in x) {
t++
print x[i], i
}
print "unique words : ",t
print "total words : ",w
print "total chars (excluding spaces): ",c
}
---------------output
8177 a
319 aaron
2 aaronites
31 aarons
1 abaddon
1 abagtha
1 abana
4 abarim
.. snip
..
3 zorobabel
5 zuar
3 zuph
5 zur
1 zuriel
5 zurishaddai
1 zuzims
unique words : 12691
total words : 789781
total chars (excluding spaces): 3223126
| |
| Ed Morton 2005-02-20, 3:55 am |
|
trexx wrote:
> All
>
> How do I code this TAWK program in gawk? I took up the challenge to count
> the number of unique words in the Bible, that originated in the
> comp.lang.pl1 site (see description
> below). My TAWK code doesn't work with gawk as there is no translate
> function. I also realize I have to
> code up a sort routine or figure out asort(). I read that someone had a
> solution to this challenge in
> awk, but I could not find it.
>
> REX
>
> The Bible text is at: http://patriot.net/~bmcgin/kjvpage.html. Using a
> text editor remove text before
> Book 01 Genesis and text after last word in Revelations, (amen) producing
> bible.txt file. Then
> convert all punctuation and numbers to blanks, and uppercase to lower. One
> punctuation exception is '
> (within a word) is deleted leaving wife's as wifes. Then produce a sorted
> list all of the words and their counts.
>
> --------------bible.txt
> Book 01 Genesis
>
> 001:001 In the beginning God created the heaven and the earth.
>
> 001:002 And the earth was without form, and void; and darkness was
> upon the face of the deep. And the Spirit of God moved upon
> the face of the waters.
>
> 001:003 And God said, Let there be light: and there was light.
> -----------snip
>
> (1){
> $0 = tolower($0)
> $0 = translate($0,"'","") #remove ' from
> wive's
> $0 = translate($0,"\r\n\t!?\\-(),.;:0-9"," ") #remove numbers and
> punctuation
> w = w + NF #count words
> for (i=1;i<=NF;i++) {c=c+length($i);++x[$i] }
> }
> END{
>
> for (i in x) {
> t++
> print x[i], i
> }
> print "unique words : ",t
> print "total words : ",w
> print "total chars (excluding spaces): ",c
> }
Try this:
{
$0 = tolower($0)
gsub("'","") #remove ' from wive's
gsub("[\r\n\t!?\\-(),.;:0-9]"," ") #remove numbers and punctuation
w = w + NF #count words
for (i=1;i<=NF;i++) {c=c+length($i);++x[$i] }
}
END{
n = asorti(x,y)
for (i=1; i<=n; i++) {
t++
print x[y[i]], y[i]
}
print "unique words : ",t
print "total words : ",w
print "total chars (excluding spaces): ",c
}
I wasn't willing to click on an unknown posted web link and your sample
output doesn't match your sample input so I don't know for sure, but I
just converted the "translate"s to "gsub"s and sorted the output
alphabetically using "asorti()" so it seems like it should work.
Regards,
Ed.
| |
| Kenny McCormack 2005-02-20, 3:55 pm |
| In article <gfednessufcnhYXfRVn-3g@comcast.com>, trexx <foo@foo.com> wrote:
>How do I code this TAWK program in gawk?
The obvious question is: Why?
....
>(1){
>$0 = tolower($0)
>$0 = translate($0,"'","") #remove ' from
>wive's
>$0 = translate($0,"\r\n\t!?\\-(),.;:0-9"," ") #remove numbers and
>punctuation
>w = w + NF #count words
>for (i=1;i<=NF;i++) {c=c+length($i);++x[$i] }
>}
>END{
>
> for (i in x) {
> t++
> print x[i], i
> }
> print "unique words : ",t
> print "total words : ",w
> print "total chars (excluding spaces): ",c
>}
As Ed notes, your example uses of translate() are simple enough that they
can be trivially replaced with gsubs.
Where translate() comes into its own is when you want:
1) to actually translate a set of characters (like sed's y command)
or
2) to examine the translated value without assigning it to
a variable. For this, you might want to look at GAWK's gensub() function.
| |
|
| Ed,
thanks. your code works perfectly (after I got the most current gawk.exe
[10-02-03] I could find at http://unxutils.sourceforge.net/). Asorti()
appears to be a very recent addition to gawk. I used systime() to benchmark
the run and gawk runs through the task in 9 sec. while TAWK ran in 7. Not a
whole lot of difference for this task. The answer is in seconds... Do you
know of a way to get run times in fractions of a second? I'm running
Win2000 on a 733 MHz machine here at home. When I go into the office the
1.7 GHz machine will obviously run these tests faster...full second
increments are too rough got comparisons. Also, Gawk doesn't appear to have
INIT blocks...therefore I used this if statement at the beginning of the
program
if (NR=1) {start = systime()}
instead of TAWK
INIT{
start = time()
}
When you use the automatic loop, how would you recommend defining any
initial conditions for a gawk program?
rex
"Ed Morton" <morton@lsupcaemnt.com> wrote in message
news:2oidnQ06XaHxvIXfRVn-vw@comcast.com...
>
>
> trexx wrote:
count[color=darkred]
a[color=darkred]
producing[color=darkred]
One[color=darkred]
sorted[color=darkred]
from[color=darkred]
and[color=darkred]
words[color=darkred]
>
> Try this:
>
> {
> $0 = tolower($0)
> gsub("'","") #remove ' from wive's
> gsub("[\r\n\t!?\\-(),.;:0-9]"," ") #remove numbers and punctuation
> w = w + NF #count words
> for (i=1;i<=NF;i++) {c=c+length($i);++x[$i] }
> }
> END{
> n = asorti(x,y)
> for (i=1; i<=n; i++) {
> t++
> print x[y[i]], y[i]
> }
> print "unique words : ",t
> print "total words : ",w
> print "total chars (excluding spaces): ",c
> }
>
> I wasn't willing to click on an unknown posted web link and your sample
> output doesn't match your sample input so I don't know for sure, but I
> just converted the "translate"s to "gsub"s and sorted the output
> alphabetically using "asorti()" so it seems like it should work.
>
> Regards,
>
> Ed.
| |
| Kenny McCormack 2005-02-20, 3:55 pm |
| In article <Uo6dnfG_-MfpLYXfRVn-gg@comcast.com>, trexx <foo@foo.com> wrote:
>Also, Gawk doesn't appear to have
>INIT blocks...therefore I used this if statement at the beginning of the
>program
Just use BEGIN. I didn't understand why you were using INIT in the TAWK
version anyway.
| |
|
| Kenny
Good point. The tawk manual says that the INIT block is like the BEGIN
block, but all INIT blocks are executed
before theBEGIN blocks. I just got in the habit of puting all assignment
vars there. Question: Why would anyone
use more that one BEGIN block in a program?
REX
"Kenny McCormack" <gazelle@yin.interaccess.com> wrote in message
news:cvabdh$i1q$1@yin.interaccess.com...
> In article <Uo6dnfG_-MfpLYXfRVn-gg@comcast.com>, trexx <foo@foo.com>
wrote:
>
> Just use BEGIN. I didn't understand why you were using INIT in the TAWK
> version anyway.
>
| |
| Kenny McCormack 2005-02-20, 3:55 pm |
| In article <d5OdndVH_8DFQYXfRVn-gQ@comcast.com>, trexx <foo@foo.com> wrote:
>Kenny
>Good point. The tawk manual says that the INIT block is like the BEGIN
>block, but all INIT blocks are executed before theBEGIN blocks. I just
>got in the habit of puting all assignment vars there. Question: Why would
>anyone use more that one BEGIN block in a program?
The purpose of INIT & TERM is to do "system-y" type things - like, for
example, stuff involving the Windows API. IIRC, TERM gets executed even if
you exit via abort(). I've never used them.
You might get multiple BEGINs or ENDs if you are including source code
written elsewhere (or a drop-in package you've developed yourself).
|
|
|
|
|