Home > Archive > Fortran > October 2006 > How to Simplify This in Fortran 90?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
How to Simplify This in Fortran 90?
|
|
| qquito 2006-10-01, 10:01 pm |
| Hello, All:
I am working with the Empirical Orthogonal Functions (EOFs) and need a
lot of array manipulations. I am trying to take the advantages of
Fortran 90 in array manipulation, but still feel inadequate when facing
complex manipulations.
For instance, I need to reconstruct the data with the first nc EOFs,
and I would write
integer, parameter:: nd=100, nt=150, nc=5
real, dimension(nc, nt):: c
real, dimension(nd, nc):: eof
real, dimension(nd, nt):: rec=0.0
integer:: it, ic
do it=1, nt
do ic=1, nc
rec(:, it)=rec(:, it)+c(ic, it)*eof(:, ic)
enddo
enddo
.......
In other words, the it'th column of rec is equal to the sum all columns
of eof multiplied by each corresponding element of c in its it'th
column.
As you can see, I still need a double do-loop---in spite of the use of
Fortran 90 features.
Can I further simplify the above double do-loop to a single do-loop? Or
even to no do-loop at all?
Thanks for reading and replying!
--Roland
| |
| Tim Prince 2006-10-02, 4:04 am |
| qquito wrote:
>
> integer, parameter:: nd=100, nt=150, nc=5
>
> real, dimension(nc, nt):: c
> real, dimension(nd, nc):: eof
> real, dimension(nd, nt):: rec=0.0
>
> integer:: it, ic
>
> do it=1, nt
> do ic=1, nc
> rec(:, it)=rec(:, it)+c(ic, it)*eof(:, ic)
> enddo
> enddo
If you are using the explicit outer loop for better performance, or
avoiding matmul entirely for performance reasons, this looks OK to me.
I suppose it would perform better yet if you force unrolling of the loop
of length 5.
I just got through finding that I no longer get performance improvements
in Polyhedron benchmarks by "simplifying" with matmul.
| |
| Ron Shepard 2006-10-02, 4:04 am |
| In article <1159760883.631393.203100@m7g2000cwm.googlegroups.com>,
"qquito" <qquito@hotmail.com> wrote:
> do it=1, nt
> do ic=1, nc
> rec(:, it)=rec(:, it)+c(ic, it)*eof(:, ic)
> enddo
> enddo
> ......
[...]
> Can I further simplify the above double do-loop to a single do-loop? Or
> even to no do-loop at all?
It looks to me like
rec = matmul( eof, c )
will work.
$.02 -Ron Shepard
| |
| highegg 2006-10-02, 4:04 am |
| > rec = matmul( eof, c )
>
> will work.
Or, if performance is crucial, check out whether tuned BLAS are
available on target machine.
GEMM usually outperforms hand-coded matrix multiplication (unless the
compiler uses
BLAS for matmul, but AFAIK few compilers do so).
| |
| Tim Prince 2006-10-02, 8:03 am |
| highegg wrote:
> Or, if performance is crucial, check out whether tuned BLAS are
> available on target machine.
> GEMM usually outperforms hand-coded matrix multiplication (unless the
> compiler uses
> BLAS for matmul, but AFAIK few compilers do so).
>
I guess you are assuming this sequence is executed only once; otherwise
your result differs from OP posted code.
"usually" may be true, but this isn't a usual case. How does your
favorite tuned BLAS do on this one? On more than one CPU model? Did you try
do it=1, nt
do id=1, nd
rec(id, it)=rec(id, it)+dot_product(c(:, it),eof(id, :))
enddo
enddo
assuming a compiler which will unroll the dot_product fully. You might
even find a compiler which could optimize the matmul version.
| |
| highegg 2006-10-02, 7:03 pm |
| > I guess you are assuming this sequence is executed only once; otherwise
> your result differs from OP posted code.
> "usually" may be true, but this isn't a usual case. How does your
> favorite tuned BLAS do on this one? On more than one CPU model? Did you try
>
> do it=1, nt
> do id=1, nd
> rec(id, it)=rec(id, it)+dot_product(c(:, it),eof(id, :))
> enddo
> enddo
>
> assuming a compiler which will unroll the dot_product fully. You might
> even find a compiler which could optimize the matmul version.
Oops, I didn't notice that nc is so small. OK, I agree with you, I have
a similar experience:
for "thin" matrix-matrix multiplication, i.e., when one dimension is
small and known at compile time, hand-coded (or, perhaps matmul) matrix
multiplication with the tiny loop as the innermost one gets better than
BLAS - probably because compiler eliminates the tiny loop completely.
BLAS become drastically better for "thicker" matrices.
Following your advice, I tested this with my production-work compiler,
Intel 9.0
on a quad-core Itanium 2, and with Intel MKL 7.2:
program mmtest
use blas95
integer, parameter:: nd=100, nt=150, nc=5
real, dimension(nc, nt):: c
real, dimension(nd, nc):: eof
real, dimension(nd, nt):: rec
real:: sm
integer:: it, ic
real:: t_start,t_end
call random_number(c)
call random_number(eof)
rec = 0
call cpu_time(t_start)
dumbloop1: do iter=1,5000
!$OMP PARALLEL DO
do it=1, nt
do id=1, nd
rec(id, it)=rec(id, it)+dot_product(c(:, it),eof(id, :))
enddo
enddo
!$OMP END PARALLEL DO
end do dumbloop1
call cpu_time(t_end)
sm = sum(rec) ! prevent dead code elimination
print *,sm
print *,'hand coded: ',t_end - t_start
rec = 0
call cpu_time(t_start)
dumbloop3: do iter=1,5000
rec = rec + matmul(eof,c)
end do dumbloop3
call cpu_time(t_end)
print *,'matmul: ',t_end - t_start
rec = 0
call cpu_time(t_start)
dumbloop2: do iter=1,5000
call gemm(eof,c,rec,beta=0.)
end do dumbloop2
call cpu_time(t_end)
print *,'gemm: ',t_end - t_start
end program
compiled using
ifort -O3 -ip -openmp mmtest.f90 blas95.o -L
/opt/intel/mkl72cluster/lib/64/ -lmkl
I got:
9.9819832E+07
hand coded: 8.3936006E-02
matmul: 0.1952000
gemm: 0.1786080
without OpenMP, it gets
9.9819832E+07
hand coded: 0.1815360
matmul: 0.2108160
gemm: 0.1844640
with nc = 50 (and OpenMP), the score changes drastically:
9.3374029E+08
hand coded: 4.082608
matmul: 1.557696
gemm: 0.6109762
interesting ...
|
|
|
|
|