For Programmers: Free Programming Magazines  


Home > Archive > Fortran > February 2005 > Word-processing benchmark anyone?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Word-processing benchmark anyone?
David Frank

2005-02-13, 4:01 pm

Want to give this a shot?
Get the text file at: http://patriot.net/~bmcgin/kjvpage.html
Using a text editor (notepad) remove text before
Book 01 Genesis and text after last word in Revelations, (amen)
producing bible.txt file containing
4,947,047 chars.

The challenge is to process bible.txt file into words array and count unique
word occurances in count array.
(this challenge was initiated by LR, whose C++ source/results will be posted
later, along with my own source/results)..

In this processing, convert all punctuation and numbers to blanks, and
uppercase to lower.
One punctuation exception is ' (within a word) is deleted leaving wife's
as wifes

When bible.txt is "thusly" processed I expect you shud get following outputs
total words = 789781
unique words = 12691
xx.xx Sec ?.?? Ghz PC

Pls post your time and PC speed..

! A few template statements in this word-processing benchmark challenge to
get started are:
! -----------------------------
program count_word_occurances ! in bible.txt
implicit none
integer,parameter :: maxw = 65536 ! or lower if possible
character(24) :: words(maxw)
integer :: i, n, counts(maxw)=0, t1(8), t2(8)

call date_and_time(values=t1) ! get benchmark start time

! open file='bible.txt' ...

! process word occurances into the 2 arrays words,counts
! until EOF as per your word-processing algorithm

call date_and_time(values=t2) ! get benchmark stop time

n = 0 ! count unique words found
do i = 1,maxw
if (counts(i) /= 0) n = n+1
end do

write (*,*) 'total words =',sum(counts)
write (*,*) 'unique words =',n

write (*,'(f0.2,a)') (t2(5)*3600.+t2(6)*60.+t2(7) +t2(8)/1000.) &
-(t1(5)*3600.+t1(6)*60.+t1(7) +t1(8)/1000.), ' Sec <- 2.8 Ghz PC'
end program


glen herrmannsfeldt

2005-02-13, 8:59 pm

David Frank wrote:
> Want to give this a shot?
> Get the text file at: http://patriot.net/~bmcgin/kjvpage.html
> Using a text editor (notepad) remove text before
> Book 01 Genesis and text after last word in Revelations, (amen)
> producing bible.txt file containing
> 4,947,047 chars.
>
> The challenge is to process bible.txt file into words array and count unique
> word occurances in count array.
> (this challenge was initiated by LR, whose C++ source/results will be posted
> later, along with my own source/results)..


How about if we divide the times by the number of statements to
make it more fair?

-- glen

epc8@juno.com

2005-02-14, 4:00 am


glen herrmannsfeldt wrote:
> David Frank wrote:
[color=darkred]
count unique[color=darkred]
be posted[color=darkred]
>
> How about if we divide the times by the number of statements to
> make it more fair?
>
> -- glen


How about we take programmer time and effort into consideration?

>From the time I downloaded the text file, translated it to cr/lf

delimited, removed leading and trailing stuff, wrote a few programs to
explore the data in the file, wrote some scratch programs, and wrote a
working program _in AWK, not Fortran_ it took 1 hour and 20 minutes.
gawk 3.1.3 took 2.9 seconds on a 2.8 GHz P4HT under Windows XP-SP2.

I did take a shortcut. I ignored all text in the first 8 columns except
to count the number of books and add it to the total number of words.

glen herrmannsfeldt

2005-02-14, 4:00 am

David Frank wrote:

> Want to give this a shot?
> Get the text file at: http://patriot.net/~bmcgin/kjvpage.html
> Using a text editor (notepad) remove text before
> Book 01 Genesis and text after last word in Revelations, (amen)
> producing bible.txt file containing
> 4,947,047 chars.


> The challenge is to process bible.txt file into words array and count unique
> word occurances in count array.
> (this challenge was initiated by LR, whose C++ source/results will be posted
> later, along with my own source/results)..
>
> In this processing, convert all punctuation and numbers to blanks, and
> uppercase to lower.
> One punctuation exception is ' (within a word) is deleted leaving wife's
> as wifes
>
> When bible.txt is "thusly" processed I expect you shud get following outputs
> total words = 789781
> unique words = 12691


So far, I get:

total words = 789797
unique words = 12700

19 seconds on a 350MHz P-II.

Also, my program doesn't count words before Genesis, instead of
requiring another program to remove them, and took less than 10
minutes to write.

/Genesis/ { flag=1;$0="";}
{
if(!flag) next;
gsub("'","");
gsub("[^A-Za-z]"," ");
t += NF;
for(i=1;i<=NF;i++) words[tolower($i)]++;
}
END {
for(i in words) w++;
print "total words",t,"unique words",w;
}

epc8@juno.com

2005-02-14, 4:00 am


epc8@juno.com wrote:
> glen herrmannsfeldt wrote:
(amen)[color=darkred]
>
> count unique
will[color=darkred]
> be posted
>
> How about we take programmer time and effort into consideration?
>
> delimited, removed leading and trailing stuff, wrote a few programs

to
> explore the data in the file, wrote some scratch programs, and wrote

a
> working program _in AWK, not Fortran_ it took 1 hour and 20 minutes.
> gawk 3.1.3 took 2.9 seconds on a 2.8 GHz P4HT under Windows XP-SP2.
>
> I did take a shortcut. I ignored all text in the first 8 columns

except
> to count the number of books and add it to the total number of words.


Please excuse me replying to a previous post. A brute force character
by character + linear search program in Fortran 77 took about the same
time to write. It ran in 18.8 seconds using g77 (gcc 2.95) MinGW on the
same equipment as before. Just reading the data file without any
processing took 2.5 seconds. Note that the run time of my AWK program
is not much more than that. This supports the idea that the algorithms
and data structures used to implement the search for duplicate words
dominate the run time far more than the choice of language.

glen herrmannsfeldt

2005-02-14, 4:00 am

epc8@juno.com wrote:

> glen herrmannsfeldt wrote:


(snip)

[color=darkred]
> How about we take programmer time and effort into consideration?


> delimited, removed leading and trailing stuff, wrote a few programs to
> explore the data in the file, wrote some scratch programs, and wrote a
> working program _in AWK, not Fortran_ it took 1 hour and 20 minutes.
> gawk 3.1.3 took 2.9 seconds on a 2.8 GHz P4HT under Windows XP-SP2.


It was exactly AWK that I was thinking of. The associative arrays
of awk make many problems like this much easier. How many
lines long is your AWK program?

Studies show that programmer productivity in lines per (productivity
unit such as days or dollars) is about the same independent of language.
Assembler or high-level are still about the same.

-- glen

epc8@juno.com

2005-02-14, 4:01 pm

>How many lines long is your AWK program?

17 lines - of course this depends on style and some optimizations I
missed. :-).

Greg Lindahl

2005-02-14, 8:58 pm

In article <420f46cf$0$38864$ec3e2dad@news.usenetmonster.com>,
David Frank <dave_frank@hotmail.com> wrote:

>Want to give this a shot?


Please don't feed the Troll. This benchmark is as meanless as his
Fortran benchmark.

-- greg
David Frank

2005-02-15, 9:00 am


"David Frank" <dave_frank@hotmail.com> wrote in message
news:420f46cf$0$38864$ec3e2dad@news.usenetmonster.com...
>
> The challenge is to process bible.txt file into words array and count
> unique word occurances in count array.
> (this challenge was initiated by LR, whose C++ source/results will be
> posted later, along with my own source/results)..
>



Here is my Fortran source/results:
http://home.cfl.rr.com/davegemini/wp_bible.f90


David Frank

2005-02-17, 9:02 am


"David Frank" <dave_frank@hotmail.com> wrote in message
news:4211af61$0$38830$ec3e2dad@news.usenetmonster.com...
>
> "David Frank" <dave_frank@hotmail.com> wrote in message
> news:420f46cf$0$38864$ec3e2dad@news.usenetmonster.com...
>
>
> Here is my Fortran source/results:
> http://home.cfl.rr.com/davegemini/wp_bible.f90
>


Above runs in 2.14 sec
Below runs in 1.14 sec which I believe is fastest solution yet posted for
ANY language.

http://home.cfl.rr.com/davegemini/wc_file.f90


beliavsky@aol.com

2005-02-17, 9:02 am

David Frank wrote:

count[color=darkred]
be[color=darkred]
>
> Above runs in 2.14 sec
> Below runs in 1.14 sec which I believe is fastest solution yet

posted for
> ANY language.
>
> http://home.cfl.rr.com/davegemini/wc_file.f90


I do some text analysis in Fortran, so I tried out your program.
Compiling the code with CVF 6.1 with the -optimize:5 option on a 3.2
GHz Pentium 4, the end of the output (removing the 2.8 GHz you
hard-coded) is

21 corruption
1 unrebukable
1 dalphon
c:\data\bible.txt
total words = 789953
unique words = 12727
collisions = 15704
0.56 Sec

However, the timethis utility reports an elapsed time of 2.2 s. Btw,
you can use the cpu_time function of Fortran 95 instead of manipulating
the output of date_and_time .

When I use Intel Visual Fortran 8.0.047 I get the results

1 dictin
9 lix
5 cill
c:\data\bible.txt
total words = 847920
unique words = 397
collisions = 546
0.17 Sec

The results are different, so something is fishy.

One respect in which your program is nonstandard is that an initialized
character string also appears in a DATA statement. The following code

program xx
implicit none
character(3) :: c = ' '
data c(1:2) /'ab'/
print*,"c=",c
end program xx

produces an error message

In file xinit_char.f90:4

data c(1:2) /'ab'/
1
Error: Variable 'c' at (1) already has an initialization

from g95 and gfortran and a similar warning from Lahey/Fujitsu Fortran
95.

Intel Fortran compiles without warnings, even with -stand:f95
specified, and produces output

c=

CVF 6.1 also compiles without warnings and gives output

c=ab

There appears to be a bug in Intel Fortran.

David Frank

2005-02-17, 4:05 pm


<beliavsky@aol.com> wrote in message
news:1108641681.734347.117660@z14g2000cwz.googlegroups.com...
> David Frank wrote:
>
>
> I do some text analysis in Fortran, so I tried out your program.
> Compiling the code with CVF 6.1 with the -optimize:5 option on a 3.2
> GHz Pentium 4, the end of the output (removing the 2.8 GHz you
> hard-coded) is
>
> 21 corruption
> 1 unrebukable
> 1 dalphon
> c:\data\bible.txt
> total words = 789953
> unique words = 12727
> collisions = 15704
> 0.56 Sec
>


Its well established (from AWK,Perl,Python,Cobol runs) that bible.txt
total words = 789781
unique words = 12691

I am using CVF 6.6C the last version ever of this compiler?
Remove the data xlat statements and just add the lower case 'a..z' into a
loong xlat init. string.
and see if thats why you dont get expected word counts.

<snip testing results with other compilers>


David Frank

2005-02-17, 4:05 pm


<beliavsky@aol.com> wrote in message
news:1108641681.734347.117660@z14g2000cwz.googlegroups.com...
> David Frank wrote:
>
> posted for
>


<snip already replied to>

>
> you can use the cpu_time function of Fortran 95 instead of manipulating
> the output of date_and_time .
>


I confirmed that cpu_time produces approx same timing results x.xx as
date_and_time with less processing syntax, so I have updated my source at
above link with cpu_time plus my previous suggestion to use a looong xlat
init string.

It would appear that using cpu_time means accuracy of subtracting 2 real
results near
86399.999 sec means the result isnt going to be accurate to 0.0x sec
whereas using date_and_time is probably more accurate.



David Frank

2005-02-17, 4:05 pm


<beliavsky@aol.com> wrote in message
news:1108641681.734347.117660@z14g2000cwz.googlegroups.com...
>
> 21 corruption
> 1 unrebukable
> 1 dalphon
> c:\data\bible.txt
> total words = 789953
> unique words = 12727
> collisions = 15704
> 0.56 Sec
>


You have added words to the Holy Bible, and thats a sin <g>
You are supposed to edit out text before
Book 01 Genesis
and after Revelations final "amen"

This was spelled out in my 1st message in this topic..



Rich Townsend

2005-02-17, 4:05 pm

David Frank wrote:
> <beliavsky@aol.com> wrote in message
> news:1108641681.734347.117660@z14g2000cwz.googlegroups.com...
>
>
>
> You have added words to the Holy Bible, and thats a sin <g>


Also, the Holy Bible gives poor benchmark, on account of its
intrinsically-serial text. Much better results can be achieved using the
Necronomicon.
beliavsky@aol.com

2005-02-18, 3:59 am

Removing the extra lines from the data file, and using the latest
version of your program, the results are now the same as yours, and on
my 3.2 GHz PC the times in seconds, using -optimize:5, are now

cpu_time timethis
CVF 6.1 0.58 2.14
ifort 8.0 0.31 1.78

At run-time, LF95 stops with the message

FTELL function cannot be executed for a unit connected for BINARY I/O
(unit= 1).

The ishc function is not standard F95 but is supported by all the
commercial compilers I tried (CVF, IFORT, LF95, Absoft, Salford) but
not g95 and gfortran. Maybe g95 and gfortran should support it.

Greg Lindahl

2005-02-18, 3:59 am

In article <1108695618.625577.245670@f14g2000cwb.googlegroups.com>,
<beliavsky@aol.com> wrote:

>The ishc function is not standard F95 but is supported by all the
>commercial compilers I tried (CVF, IFORT, LF95, Absoft, Salford) but
>not g95 and gfortran. Maybe g95 and gfortran should support it.


Namespace pollution is a bad thing, especially when you're talking
about a function which exists in the standard with a different name,
ISHFTC(). Once you get away from front-ends that came from VMS or
tried hard to be like VMS, you won't find ISHC() to be common.

-- greg



Richard E Maine

2005-02-18, 3:59 pm

In article <42155d7e$1@news.meer.net>, lindahl@pbm.com (Greg Lindahl)
wrote:

> In article <1108695618.625577.245670@f14g2000cwb.googlegroups.com>,
> <beliavsky@aol.com> wrote:
>
>
> Namespace pollution is a bad thing, especially when you're talking
> about a function which exists in the standard with a different name,
> ISHFTC(). Once you get away from front-ends that came from VMS or
> tried hard to be like VMS, you won't find ISHC() to be common.


I agree. Aside from the namespace pollution, which could be solved,
better to spend effort in extensions that actually add functionality.

Code that uses isch() isn't particularly portable anyway. Yes, there are
multiple compilers that support it, but as Greg says, the support base
isn't particularly wide.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
David Frank

2005-02-19, 8:56 am


"David Frank" <dave_frank@hotmail.com> wrote in message
news:4214594a$0$39277$ec3e2dad@news.usenetmonster.com...

ATTN: Awk, C++, Cobol, Fortran, Perl, Python, Rexx, Etc respondees to this
challenge, thanks..
If you update your source/response to create following outputs, I will
enter it in a table of results.

---
The requirements are to time the execution of reading bible.txt file
producing a sorted list of unique words and their counts. ALL non-alpha
chars are to be treated as blanks, except quote within a word, e.g.
Wife's, in which case its deleted and
becomes the word wifes Upper-case is lower-cased.
Document your results by posting the following info extracted from your
output file.
8177 a
319 aaron
..........
5 zurishaddai
1 zuzims
bible.txt
total words = 789781
unique words = 12691
xx.xx Sec ?.?? Ghz CPU ID
+ any further distinguishing info , e.g.
language/compiler/version/programmer's name
---

My Fortran source/response that meets above requirements can be viewed:
http://home.cfl.rr.com/davegemini/wc_file.f90

DF









David Frank

2005-02-20, 8:57 am


<beliavsky@aol.com> wrote in message
news:1108695618.625577.245670@f14g2000cwb.googlegroups.com...
> Removing the extra lines from the data file, and using the latest
> version of your program, the results are now the same as yours, and on
> my 3.2 GHz PC the times in seconds, using -optimize:5, are now
>
> cpu_time timethis
> CVF 6.1 0.58 2.14
> ifort 8.0 0.31 1.78
>
> At run-time, LF95 stops with the message
>
> FTELL function cannot be executed for a unit connected for BINARY I/O
> (unit= 1).
>
> The ishc function is not standard F95 but is supported by all the
> commercial compilers I tried (CVF, IFORT, LF95, Absoft, Salford) but
> not g95 and gfortran. Maybe g95 and gfortran should support it.
>


I now use ISHIFTC(odd,5,hashbits) with hashbits = 17 (1st time I have
ever used the optional 3rd arg)
and collisions are now reduced to 4k from 15k. That along with adding a
sorted output reduces 2.8 pentium runtime to 0.88sec (dont ask why sorting
reduces time)..

See if your time is reduced with code below and your 3.2 pentium, a 2.5mhz
G5 is pushing your result using Rexx,
I would like Fortran to have the fastest reported time for this benchmark.

http://home.cfl.rr.com/davegemini/wc_file.f90


David Frank

2005-02-20, 8:57 am


"David Frank" <dave_frank@hotmail.com> wrote in message
news:421855dd$0$38884$ec3e2dad@news.usenetmonster.com...
>


I just re-confirmed that doing a quicksort of the output word list reduces
my runtime from
1.20 sec to 0.88 sec Can anyone look at below code and give a rational
answer?

The quicksort is fast at 0.016 sec but thats a positive runtime not a
twilight zone negative runtime..

http://home.cfl.rr.com/davegemini/wc_file.f90



glen herrmannsfeldt

2005-02-20, 8:57 pm

David Frank wrote:
> "David Frank" <dave_frank@hotmail.com> wrote in message
> news:421855dd$0$38884$ec3e2dad@news.usenetmonster.com...


> I just re-confirmed that doing a quicksort of the output word list reduces
> my runtime from
> 1.20 sec to 0.88 sec Can anyone look at below code and give a rational
> answer?



This file is way too small to make a good benchmark.

I once did some timing reading a 600MB file on a network mounted
disk and realized that the times were faster than the network transfer
rate. Then I considered that on a 4GB machine the disk cache could
easily hold an entire 600MB file.

Most likely your file is in the cache be the second time, so it
is faster.

-- glen

David Frank

2005-02-26, 3:59 pm

"David Frank" <dave_frank@hotmail.com> wrote in message
news:420f46cf$0$38864$ec3e2dad@news.usenetmonster.com...

> Want to give this a shot?
> Get the text file at: http://patriot.net/~bmcgin/kjvpage.html
> Using a text editor (notepad) remove text before
> Book 01 Genesis and text after last word in Revelations, (amen)
> producing bible.txt file containing
> 4,947,047 chars.
>


Some cross-language reporting from com.lang.cobol, comp.lang.rexx


"Rick Smith" <ricksmith@mfi.net> wrote in message
news:111sbrck2cbim23@corp.supernews.com...
>
>
> My latest version runs in 0.44s.
>


Can you send me a executable?

re: files archived at:

http://home.cfl.rr.com/davegemini/wc_file.c <- with timing calls
inserted by LR for source posted in comp.lang.rexx by Ian Collier (did'ya
follow that?)

I have compiled above with MSVC 6.0 and it runs on my 2.8ghz = 0.265 sec
makes his C
solution current fastest.

Otoh,
I have just lowered my CVF6.6c Fortran runtime from 0.89 sec to 0.29 sec
challenging again for the lead,
and I suspect a significantly faster time will result if someone compiles my
source below with Intel Fortran and sends me their executable.

http://home.cfl.rr.com/davegemini/wc_file.f90

AND as I told LR in the beginning, his C++ challenge using MAP syntax
likely has lots of overhead,
and now my guess is confirmed as his version compiled and run on my 2.8ghz
= 2.9 sec
which is 10x slower on my computer than my 0.29sec




David Frank

2005-02-27, 4:00 pm


"David Frank" <dave_frank@hotmail.com> wrote in message
news:42209193$0$39265$ec3e2dad@news.usenetmonster.com...
> "David Frank" <dave_frank@hotmail.com> wrote in message
> news:420f46cf$0$38864$ec3e2dad@news.usenetmonster.com...
>
>

http://home.cfl.rr.com/davegemini/wc_file.c

I have compiled above with MSVC 6.0 and it runs on my 2.8ghz = 0.265 sec

This morning's incarnation of my CVF Fortran solution (with the help of a
kinda obscure extension UNION MAP)
now matches the C solution's 0.265 sec
Still waiting for a reader with latest Intel Fortran to try below.

http://home.cfl.rr.com/davegemini/wc_file.f90



LR

2005-02-27, 4:00 pm

David Frank wrote:


> AND as I told LR in the beginning, his C++ challenge using MAP syntax
> likely has lots of overhead,
> and now my guess is confirmed as his version compiled and run on my 2.8ghz
> = 2.9 sec
> which is 10x slower on my computer than my 0.29sec


But that was never the point. The point was to find out if you could
duplicate the _functionality_ of std::cin >> line and std::map.

Can you?

LR
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com