For Programmers: Free Programming Magazines  


Home > Archive > Fortran > March 2004 > Expression as intrinsic argument question.









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Expression as intrinsic argument question.
Paul Van Delst

2004-03-27, 12:18 am

Hello,

I have a line of code like so:

Convolved_Tau = SUM( Tau * Response ) * FREQUENCY_INTERVAL

where both "Tau" and "Response" are arrays of size that can range from 10000 to 260000.
Total processing time for the application with the above line is about 1.05 hours for a
single run (I need to do 2352 separate runs).

Does anyone have a feel if it would be faster to do something like

TauTmp = Tau * Response
Convolved_Tau = SUM( TauTmp ) * FREQUENCY_INTERVAL

??

(assuming TauTmp is dimensioned correctly of course)

I figured either is producing a copy - one implicitly, the other explicitly - so the time
should be the same. Previously buggy code did the following:

Tau = Tau * Response
Convolved_Tau = SUM( Tau ) * FREQUENCY_INTERVAL

and the total time for a single run was 0.8 hours. The ~15 minute difference compared to
the first example is significant given the number of runs. (BTW, the code is buggy because
I need to reuse the original "Tau" data.)

Is this something that compilers (on same and different platforms) will treat totally
differently or are there things that one can expect compilers to do similarly?

Testing of this is a bit laborious, hence my initial lazy request for info here. :o)

Thanks,

paulv

Michel OLAGNON

2004-03-27, 12:18 am


In article <c3sblc$e2q$1@news.nems.noaa.gov>, Paul Van Delst <paul.vandelst@noaa.gov> writes:
>Hello,
>
>I have a line of code like so:
>
> Convolved_Tau = SUM( Tau * Response ) * FREQUENCY_INTERVAL
>
>where both "Tau" and "Response" are arrays of size that can range from 10000 to 260000.
>Total processing time for the application with the above line is about 1.05 hours for a
>single run (I need to do 2352 separate runs).



Did you try
Convolved_Tau = DOT_PRODUCT( Tau, Response ) * FREQUENCY_INTERVAL

By the way, are you sure that that line is the culprit ?


>
>Does anyone have a feel if it would be faster to do something like
>
> TauTmp = Tau * Response
> Convolved_Tau = SUM( TauTmp ) * FREQUENCY_INTERVAL
>
>??
>
>(assuming TauTmp is dimensioned correctly of course)
>
>I figured either is producing a copy - one implicitly, the other explicitly - so the time
>should be the same. Previously buggy code did the following:
>
> Tau = Tau * Response
> Convolved_Tau = SUM( Tau ) * FREQUENCY_INTERVAL
>
>and the total time for a single run was 0.8 hours. The ~15 minute difference compared to
>the first example is significant given the number of runs. (BTW, the code is buggy because
>I need to reuse the original "Tau" data.)
>
>Is this something that compilers (on same and different platforms) will treat totally
>differently or are there things that one can expect compilers to do similarly?
>
>Testing of this is a bit laborious, hence my initial lazy request for info here. :o)
>
>Thanks,
>
>paulv
>




Richard Maine

2004-03-27, 12:18 am

Paul Van Delst <paul.vandelst@noaa.gov> writes:

> Does anyone have a feel if it would be faster to do something like
>
> TauTmp = Tau * Response
> Convolved_Tau = SUM( TauTmp ) * FREQUENCY_INTERVAL

....

> Is this something that compilers (on same and different platforms)
> will treat totally differently or are there things that one can expect
> compilers to do similarly?

.....

(I assume this happens somewhere inside a loop; that statement alone
couldn't plausibly take within orders of magnitude of the time in
question).

This is highly compiler dependent. It is even highly version dependent
for a particular compiler vendor. You are basically asking about how
smart the compiler's optimizer is at elminating array temporaries.

In the abstract, one can guess that if the compiler allocates a
temporary array for each exection of this statement, then that wil be
slower than using your preallocated one. But whether the compiler
does that or not probably varies.

> Testing of this is a bit laborious,...


Alas, its close to the only way to get a good answer. Well, perhaps
other than bugging the vendors to tell you whether their compilers are
that good.

You could do a simplified test that doesn't take an hour to run.
For that matter, if you want to test for the generation of a temporary
array (assuming that to be the main determiner of the time question,
which is probably a reasonable assumption, though I wouldn't
absolutely guarantee it), there are ways to "cheat" and test that.
make a short program that will exceed known memory limits if an
array temp is generated, but won't otherwise.

Oh, and I notice you left out an important coding possibility.
The statement shown doesn't in principle need an array temporary
at all. Using an array temporary increases the required memory
bandwidth, independent of the question of how the temporary is
allocated. By making the temporary explicit, you are making it
harder for an optimizer to recognize that maybe it could ignore
your explicit array temporary. I'd be modestly surprised if any
current optimizers were up to that job (but only moderately,
because I'm not an expert in optimization technology). The
original code doesn't have an explicit array temprary, but might
plausibly cause some compilers to make one (I really don't know
the odds).

But there is a way to write this that pretty much guarantees
that there will be no array temporary. Do the summation in a loop
instead of with the SUM intrinsic. Some people get so hung up on
whole array syntax that they forget that loops still exist. In
abstract, I think the SUM intrinsic is simpler to read in this
case and it might well be as efficient, depending on the compiler.
So my initial inclination would be to write it as in your original.
But if performance is enough of an issue that compiler-specific
tuning is justified, then don't forget the DO loop option.

Again, all my generalizations are worth very little compared to
testing. I've been surprised by test results many times before,
so I'm not about to tell you that I couldn't be surprised again.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
Jan C. Vorbrüggen

2004-03-27, 12:18 am

First, let me second all that Richard says, in the abstract. In addition
and in particular:

> Oh, and I notice you left out an important coding possibility.
> The statement shown doesn't in principle need an array temporary
> at all. Using an array temporary increases the required memory
> bandwidth, independent of the question of how the temporary is
> allocated. By making the temporary explicit, you are making it
> harder for an optimizer to recognize that maybe it could ignore
> your explicit array temporary. I'd be modestly surprised if any
> current optimizers were up to that job (but only moderately,
> because I'm not an expert in optimization technology).


A substantial part of 187.facerec (yes, again) is exactly that piece of
code. I tested this on several platforms - IIRC, on Sun with Sun's V2
compiler and on Alpha/Linux with the DEC compiler, as well as with DVF.
I tested DOT_PRODUCT, SUM, explicit loops (I was using two-dimensional
arrays, with and without casting to one-dimensional arrays) and SDOT
from the vendor's performance library. Surprise: in all cases, SUM was
the fastest approach. The runner-up varied among the platforms.

Summary:
- there are optimizers that produce good code for SUM in this case, in
all likelihood _not_producing temporaries
- there's no substitute for testing.

I don't think I still have my original testing routines, but it should be
fairly easy to whip up some code for that.

I also second another poster's suggestion to make sure this is really the
place where your program is spending its time.

Jan
John Harper

2004-03-27, 12:18 am

In article <c3sblc$e2q$1@news.nems.noaa.gov>,
Paul Van Delst <paul.vandelst@noaa.gov> wrote:
>Hello,
>
>I have a line of code like so:
>
> Convolved_Tau = SUM( Tau * Response ) * FREQUENCY_INTERVAL
>
>where both "Tau" and "Response" are arrays


The NAG f95 version 4.2 compiler has a bug in this area (which NAG have
fixed in version 5.0, but that's not yet available on some platforms).
The following simple but sub-optimal way to evaluate a Taylor series
caused a compile-time crash (DP was a parameter = kind(1.0d0)):

REAL(DP) FUNCTION test(s)
REAL(DP),INTENT(IN)::s
REAL(DP),PARAMETER ::coefs(8)=1
IF(s>0.3_DP) STOP 'Use another method for moderate or large s!'
test = sum(coefs*s**(/(2*i-1,i=1,size(coefs))/))
END FUNCTION test

John Harper, School of Mathematical and Computing Sciences,
Victoria University, PO Box 600, Wellington, New Zealand
e-mail john.harper@vuw.ac.nz phone (+64)(4)463 5341 fax (+64)(4)463 5045
Richard Maine

2004-03-27, 12:18 am

harper@mcs.vuw.ac.nz (John Harper) writes:

>... s**(/(2 ...


Not related to the original topic, but....

Looking forward to the square brackets (by whatever name :-))
in f2003. I swear my eyes kept wanting to parse the "/" above
as division with the numerator was missing. I'll find it so much
less jarring to read as

... s**[(2 ...

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
beliavsky@aol.com

2004-03-27, 12:18 am

I wrote a simple program to compute the dot product of two vectors
using various methods. For Compaq Visual Fortran 6.6, the execution
time is basically the same for all methods. For Lahey/Fujitsu Fortran
95 5.70c, the same is true, except that storing the element-by-element
product of the two vectors and then summing takes more than twice as
long. I ought to increase niter to do a more precise test, but I doubt
the results would change.

times in seconds
ich method Lahey CVF
1 SUM 5.15 5.72
2 DOT_PRODUCT 5.20 5.60
3 loop 5.28 5.58
4 product, then SUM 11.56 5.47
5 BLAS 5.05 5.59


program xdot
! compare speed of various ways of computing the dot product
implicit none
integer :: i,irate,t0,t1,t01,ich,iter
integer, parameter :: n = 10000000, inc = 1, niter = 100
real :: xx(n),yy(n),xy(n),aa(n),zz
print*,"n, niter =",n,niter
call random_seed()
do ich=1,5
t01 = 0
do iter=1,niter
zz = 0.0
call random_number(xx)
call random_number(yy)
call system_clock(count=t0)
if (ich == 1) then ! use SUM
zz = sum(xx*yy)
else if (ich == 2) then ! use DOT_PRODUCT
zz = dot_product(xx,yy)
else if (ich == 3) then ! use loop
do i=1,n
zz = zz + xx(i)*yy(i)
end do
else if (ich == 4) then ! compute product, then SUM
xy = xx*yy
zz = sum(xy)
else if (ich == 5) then ! call BLAS
zz = sdot(n,xx,inc,yy,inc)
end if
call system_clock(count=t1,count_rate=irate)
t01 = t01 + t1 - t0
end do
write (*,*) "ich, time =",ich,real(t01)/irate
print*,"zz =",zz
end do
contains
function sdot(n,sx,incx,sy,incy) result(value)
! Fortran 77 BLAS converted to free format
! forms the dot product of two vectors.
real , intent(in) :: sx(:),sy(:)
integer, intent(in) :: n,incx,incy
real :: value
real :: stemp
integer :: i,ix,iy,m,mp1
stemp = 0.0e0
value = 0.0e0
if (n <= 0) return
if(incx == 1 .and. incy == 1) go to 20
! code for unequal increments or equal increments
! not equal to 1
ix = 1
iy = 1
if (incx < 0) ix = (-n+1)*incx + 1
if (incy < 0) iy = (-n+1)*incy + 1
do i = 1,n
stemp = stemp + sx(ix)*sy(iy)
ix = ix + incx
iy = iy + incy
end do
value = stemp
return
! code for both increments equal to 1
! clean-up loop
20 m = mod(n,5)
if (m == 0) go to 40
do i = 1,m
stemp = stemp + sx(i)*sy(i)
end do
if (n < 5) go to 60
40 mp1 = m + 1
do i = mp1,n,5
stemp = stemp + sx(i)*sy(i) + sx(i+1)*sy(i+1) + &
sx(i+2)*sy(i+2) + sx(i+3)*sy(i+3) + sx(i+4)*sy(i+4)
end do
60 value = stemp
return
end function sdot
end program xdot
Joost VandeVondele

2004-03-27, 12:18 am

> Is this something that compilers (on same and different platforms) will treat totally
> differently or are there things that one can expect compilers to do similarly?
>
> Testing of this is a bit laborious, hence my initial lazy request for info


some testing on my code (where on one needs to copy a real array into
a complex array) shows interesting behavior wrt. ALLOCATABLE/POINTER
arrays, and also some effect of array syntax vs do loops.

(five numbers for five different ways of doing the 'copy'). In
principle the 10 methods should be very similar in timings, however,
in reality one method can be up to 10 times slower for the same
operation. Only a few months ago, array syntax could be much slower
than do loops, but this seems to have changed.

the operation is : Z(1:N)=CMPLX(R(1:N),0.0_dbl,dbl)

f77-like ! subroutine + DO & z(1,i)=r1(i) ; z(2,i)=0._dbl
inline_do ! inline DO & Z(I3)=CMPLX(R(I),KIND=dbl)
function_do ! behind a subroutine call
function_array ! behind a subroutine
Z(1:N)=CMPLX(R(1:N),0.0_dbl,dbl)
inline_array ! inline Z(1:N)=CMPLX(R(1:N),0.0_dbl,dbl)

Z and R can be either allocatable or pointer.
(send email for the 190 line code)

E.g.
ifc -O3
Allocatable test
3.019 2.709 2.577 2.546 3.768
Pointer test
2.799 13.741 2.761 2.854 3.457
ifort -O3
Allocatable test
2.381 2.891 2.784 2.433 2.536
Pointer test
2.871 18.071 2.536 2.533 19.095
pgf90 -fast
Allocatable test
2.249 3.133 2.278 3.624 3.060
Pointer test
2.431 36.617 2.263 3.779 39.014
nag f95 -O3
Allocatable test
2.758 5.381 2.916 3.242 3.582
Pointer test
3.237 9.641 2.347 3.235 3.062
xlf90 -O3
Allocatable test
2.617 2.879 2.559 2.350 2.602
Pointer test
2.494 2.480 2.681 2.679 2.714
sun f90 -O3
Allocatable test
12.313 12.279 12.300 12.373 12.279
Pointer test
9.788 9.494 9.787 9.856 9.495
Tim Prince

2004-03-27, 12:18 am


"Jan C. Vorbrüggen" <jvorbrueggen@mediasec.de> wrote in message
news:4061C43E.508256A8@mediasec.de...
> A substantial part of 187.facerec (yes, again) is exactly that piece of
> code. I tested this on several platforms - IIRC, on Sun with Sun's V2
> compiler and on Alpha/Linux with the DEC compiler, as well as with DVF.
> I tested DOT_PRODUCT, SUM, explicit loops (I was using two-dimensional
> arrays, with and without casting to one-dimensional arrays) and SDOT
> from the vendor's performance library. Surprise: in all cases, SUM was
> the fastest approach. The runner-up varied among the platforms.
>

Those compilers may have been tuned for facerec. I'd be disappointed, if
dot_product didn't provide a good combination of speed and accuracy. With
explicit loops, you may have to deal with batching the sums according to the
characteristics of the hardware, or you may get lucky and write a pattern
which the optimizer recognizes.


Jan C. Vorbrüggen

2004-03-27, 12:18 am

> Those compilers may have been tuned for facerec.

No way no how - I did that test while _writing_ the benchmark - at least six
months before it
was published!

> I'd be disappointed, if
> dot_product didn't provide a good combination of speed and accuracy.


Yeah, that's what I would have thought. I do seem to remember there was one
case where
DOT_PRODUCT was the worst implementation - likely because it was careful with
intermediate
precision etc.

Jan
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com