For Programmers: Free Programming Magazines  


Home > Archive > Fortran > October 2006 > How to Simplify This in Fortran 90?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author How to Simplify This in Fortran 90?
qquito

2006-10-01, 10:01 pm

Hello, All:

I am working with the Empirical Orthogonal Functions (EOFs) and need a
lot of array manipulations. I am trying to take the advantages of
Fortran 90 in array manipulation, but still feel inadequate when facing
complex manipulations.

For instance, I need to reconstruct the data with the first nc EOFs,
and I would write


integer, parameter:: nd=100, nt=150, nc=5

real, dimension(nc, nt):: c
real, dimension(nd, nc):: eof
real, dimension(nd, nt):: rec=0.0

integer:: it, ic

do it=1, nt
do ic=1, nc
rec(:, it)=rec(:, it)+c(ic, it)*eof(:, ic)
enddo
enddo
.......

In other words, the it'th column of rec is equal to the sum all columns
of eof multiplied by each corresponding element of c in its it'th
column.

As you can see, I still need a double do-loop---in spite of the use of
Fortran 90 features.

Can I further simplify the above double do-loop to a single do-loop? Or
even to no do-loop at all?

Thanks for reading and replying!

--Roland

Tim Prince

2006-10-02, 4:04 am

qquito wrote:

>
> integer, parameter:: nd=100, nt=150, nc=5
>
> real, dimension(nc, nt):: c
> real, dimension(nd, nc):: eof
> real, dimension(nd, nt):: rec=0.0
>
> integer:: it, ic
>
> do it=1, nt
> do ic=1, nc
> rec(:, it)=rec(:, it)+c(ic, it)*eof(:, ic)
> enddo
> enddo


If you are using the explicit outer loop for better performance, or
avoiding matmul entirely for performance reasons, this looks OK to me.
I suppose it would perform better yet if you force unrolling of the loop
of length 5.
I just got through finding that I no longer get performance improvements
in Polyhedron benchmarks by "simplifying" with matmul.
Ron Shepard

2006-10-02, 4:04 am

In article <1159760883.631393.203100@m7g2000cwm.googlegroups.com>,
"qquito" <qquito@hotmail.com> wrote:

> do it=1, nt
> do ic=1, nc
> rec(:, it)=rec(:, it)+c(ic, it)*eof(:, ic)
> enddo
> enddo
> ......

[...]
> Can I further simplify the above double do-loop to a single do-loop? Or
> even to no do-loop at all?


It looks to me like

rec = matmul( eof, c )

will work.

$.02 -Ron Shepard
highegg

2006-10-02, 4:04 am

> rec = matmul( eof, c )
>
> will work.

Or, if performance is crucial, check out whether tuned BLAS are
available on target machine.
GEMM usually outperforms hand-coded matrix multiplication (unless the
compiler uses
BLAS for matmul, but AFAIK few compilers do so).

Tim Prince

2006-10-02, 8:03 am

highegg wrote:
> Or, if performance is crucial, check out whether tuned BLAS are
> available on target machine.
> GEMM usually outperforms hand-coded matrix multiplication (unless the
> compiler uses
> BLAS for matmul, but AFAIK few compilers do so).
>


I guess you are assuming this sequence is executed only once; otherwise
your result differs from OP posted code.
"usually" may be true, but this isn't a usual case. How does your
favorite tuned BLAS do on this one? On more than one CPU model? Did you try

do it=1, nt
do id=1, nd
rec(id, it)=rec(id, it)+dot_product(c(:, it),eof(id, :))
enddo
enddo

assuming a compiler which will unroll the dot_product fully. You might
even find a compiler which could optimize the matmul version.
highegg

2006-10-02, 7:03 pm

> I guess you are assuming this sequence is executed only once; otherwise
> your result differs from OP posted code.
> "usually" may be true, but this isn't a usual case. How does your
> favorite tuned BLAS do on this one? On more than one CPU model? Did you try
>
> do it=1, nt
> do id=1, nd
> rec(id, it)=rec(id, it)+dot_product(c(:, it),eof(id, :))
> enddo
> enddo
>
> assuming a compiler which will unroll the dot_product fully. You might
> even find a compiler which could optimize the matmul version.


Oops, I didn't notice that nc is so small. OK, I agree with you, I have
a similar experience:
for "thin" matrix-matrix multiplication, i.e., when one dimension is
small and known at compile time, hand-coded (or, perhaps matmul) matrix
multiplication with the tiny loop as the innermost one gets better than
BLAS - probably because compiler eliminates the tiny loop completely.
BLAS become drastically better for "thicker" matrices.
Following your advice, I tested this with my production-work compiler,
Intel 9.0
on a quad-core Itanium 2, and with Intel MKL 7.2:

program mmtest
use blas95
integer, parameter:: nd=100, nt=150, nc=5

real, dimension(nc, nt):: c
real, dimension(nd, nc):: eof
real, dimension(nd, nt):: rec
real:: sm

integer:: it, ic
real:: t_start,t_end
call random_number(c)
call random_number(eof)

rec = 0
call cpu_time(t_start)
dumbloop1: do iter=1,5000
!$OMP PARALLEL DO
do it=1, nt
do id=1, nd
rec(id, it)=rec(id, it)+dot_product(c(:, it),eof(id, :))
enddo
enddo
!$OMP END PARALLEL DO
end do dumbloop1
call cpu_time(t_end)
sm = sum(rec) ! prevent dead code elimination
print *,sm
print *,'hand coded: ',t_end - t_start

rec = 0
call cpu_time(t_start)
dumbloop3: do iter=1,5000
rec = rec + matmul(eof,c)
end do dumbloop3
call cpu_time(t_end)
print *,'matmul: ',t_end - t_start

rec = 0
call cpu_time(t_start)
dumbloop2: do iter=1,5000
call gemm(eof,c,rec,beta=0.)
end do dumbloop2
call cpu_time(t_end)
print *,'gemm: ',t_end - t_start


end program

compiled using
ifort -O3 -ip -openmp mmtest.f90 blas95.o -L
/opt/intel/mkl72cluster/lib/64/ -lmkl
I got:
9.9819832E+07
hand coded: 8.3936006E-02
matmul: 0.1952000
gemm: 0.1786080

without OpenMP, it gets
9.9819832E+07
hand coded: 0.1815360
matmul: 0.2108160
gemm: 0.1844640

with nc = 50 (and OpenMP), the score changes drastically:

9.3374029E+08
hand coded: 4.082608
matmul: 1.557696
gemm: 0.6109762

interesting ...

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com