For Programmers: Free Programming Magazines  


Home > Archive > Fortran > January 2006 > fast matrix multiplication









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author fast matrix multiplication
andrea campera

2006-01-10, 9:58 pm

Hi all,

I have a question regardin matrix multipllication. In my program I call
a subroutine to multiplies to n*n matrices thousands of times

SUBROUTINE multi(Nc,m1,m2,mtot)
INTEGER i,j,l,Nc
complex aa,m1(Nc,Nc),m2(Nc,Nc),mtot(Nc,Nc)
do i=1,Nc
do j=1,Nc
aa=0.
do l=1,Nc
aa=aa+m1(i,l)*m2(l,j)
enddo
mtot(i,j)=aa
enddo
enddo
return
end

my question is: is there a faster algorithm to solve this simple
multiplication? where can I find the code (fortran 77 if possible)?
Thanks in advance for any help

Regards

Andrea

Dave Seaman

2006-01-10, 9:58 pm

On 10 Jan 2006 15:13:16 -0800, andrea campera wrote:
> Hi all,


> I have a question regardin matrix multipllication. In my program I call
> a subroutine to multiplies to n*n matrices thousands of times


> SUBROUTINE multi(Nc,m1,m2,mtot)
> INTEGER i,j,l,Nc
> complex aa,m1(Nc,Nc),m2(Nc,Nc),mtot(Nc,Nc)
> do i=1,Nc
> do j=1,Nc
> aa=0.
> do l=1,Nc
> aa=aa+m1(i,l)*m2(l,j)
> enddo
> mtot(i,j)=aa
> enddo
> enddo
> return
> end


> my question is: is there a faster algorithm to solve this simple
> multiplication? where can I find the code (fortran 77 if possible)?
> Thanks in advance for any help


<http://www.netlib.org/lapack/>



--
Dave Seaman
U.S. Court of Appeals to review three issues
concerning case of Mumia Abu-Jamal.
<http://www.mumia2000.org/>
Gordon Sande

2006-01-10, 9:58 pm

On 2006-01-10 19:13:16 -0400, "andrea campera" <andreacampera@gmail.com> said:

> Hi all,
>
> I have a question regardin matrix multipllication. In my program I call
> a subroutine to multiplies to n*n matrices thousands of times
>
> SUBROUTINE multi(Nc,m1,m2,mtot)
> INTEGER i,j,l,Nc
> complex aa,m1(Nc,Nc),m2(Nc,Nc),mtot(Nc,Nc)
> do i=1,Nc
> do j=1,Nc
> aa=0.
> do l=1,Nc
> aa=aa+m1(i,l)*m2(l,j)
> enddo
> mtot(i,j)=aa
> enddo
> enddo
> return
> end
>
> my question is: is there a faster algorithm to solve this simple
> multiplication? where can I find the code (fortran 77 if possible)?
> Thanks in advance for any help
>
> Regards
>
> Andrea


You can either

1. realize that this is not worth trying to improve on for your
immediate needs. In fact thousands is not very interesting
anymore for things faster than hand operated abacuses. Is
five days programming and debugging really worth saving a
couple hours time on a computer that cost less that $2000?
I am not in favor of N factorial methods (like Cramer's rule
of algebra) when there are standard N cubed methods around
but I do believe that some elementary but hard cost benefit
accounting should be done.

2. polish your code so it does not have cache problems, bad compiler
options or other arcane issues that would require you to tell us
what computer, compiler, etc you are using. A first cure of this
sort is to stop using interpreted Basic (or Java) but you are
already using a Fortran compiler given where you asked the
question. See above on cost benefit accounting.

3. use a "fast matrix multiply" of the Stassen or Pan type. The trouble
is they are rather complicated and only come into their own for
rather big problems and have too much overhead for small problems.
(But they really give you good bragging rights in some circles!)
See above on cost benefit accounting. Strassen is N to about 2.7
and the last I noticed Pan had it down to N to about 2.3 (IIRC).







Joost

2006-01-11, 3:57 am

use optimised blas library routines (i.e. not netlib) for the
multiplication. You're looking for the routine cgemm. You can (freely)
obtain GOTO blas, ATLAS blas, MKL, ACML, ... and they are f77
compatible. They will, depending on your Nc and computer be 2-100 times
faster than what you have now.

Joost

Andrea Campera

2006-01-11, 7:56 am

To be more precise I have to do an operation like (A*B+I)^ -1. up to now I
have used

call multi(...)
call summa(...)
call gaussj(...)

I post also the profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ks/call Ks/call name
35.98 6490.45 6490.45 3577 0.00 0.00 multi_
30.85 12055.08 5564.63 100 0.06 0.06 density_
18.67 15422.26 3367.18 695 0.00 0.00 gaussj_
4.92 16309.68 887.42 101 0.01 0.01 tqli_
4.22 17070.11 760.43 1 0.76 0.91 dens_
3.65 17728.57 658.46 1 0.66 16.15 calcoladens_
0.93 17896.45 167.88 575596800 0.00 0.00 modul_
0.21 17934.09 37.64 101 0.00 0.01 scho1_
0.11 17954.49 20.40 1 0.02 0.95 solvscho_
0.09 17970.36 15.87 1 0.02 0.02 tred2_
0.08 17984.88 14.52 1 0.01 0.01 tqli2_
0.08 17998.66 13.78 596 0.00 0.00 diffe_
0.07 18011.67 13.01 198 0.00 0.03 combination_
0.07 18024.52 12.85 596 0.00 0.00 summa_
0.04 18031.49 6.97 100 0.00 0.03 matri_
0.03 18037.03 5.54 101 0.00 0.00 eigsr1_
0.00 18037.11 0.08 1 0.00 0.03 traccia_
0.00 18037.18 0.07 1 0.00 18.04 MAIN__
0.00 18037.22 0.04 1 0.00 0.00 eigsrt_
0.00 18037.26 0.04 1 0.00 17.09 smatrix_

if I increase the dimension Nc of the matrices gaussj and multi become the
most time consuming routines. I have seen the subroutine cgemm.f of Blas for
the multiplication, but what about the inversion? is ther anything faster
than gaussj? I use g77 (3.2 or 3.4) Pentium IV or Xeon or Opteron or AMD64
(we have several machines).

Andrea



"Joost" <jv244@cam.ac.uk> ha scritto nel messaggio
news:1136956436.318850.151600@g14g2000cwa.googlegroups.com...
> use optimised blas library routines (i.e. not netlib) for the
> multiplication. You're looking for the routine cgemm. You can (freely)
> obtain GOTO blas, ATLAS blas, MKL, ACML, ... and they are f77
> compatible. They will, depending on your Nc and computer be 2-100 times
> faster than what you have now.
>
> Joost
>



David Jones

2006-01-11, 7:56 am

Andrea Campera wrote:
> To be more precise I have to do an operation like (A*B+I)^ -1.


Are you aware that you may be able to speed things up by using a known identity for this type of expression ? See 3.2.2 of the following:
http://www2.imm.dtu.dk/pubdb/views/...pdf/imm3274.pdf

Some of the libraries suggested may have this type of thing already built in, but you need to be able to recognise that your problem has the right structure.
Joost

2006-01-11, 7:56 am

Also here, use a library routine. IIRC the inverse can be computed
using the LAPACK CGESV routine. The usual remark is that often you
might not need the inverse, but you really want to solve A*X=B instead.
CGESV does exactly that, so solve with B=I if you need the inverse.
Also here, CGESV will be part of MKL (typically used on intel chips)
and ACML (used on AMD), and should be relatively fast also with netlib
lapack if combined with atlas or GOTO blas (available for both brands).

And yes, the reference provided by David Jones might be an interesting
additional read...

Joost

David Flower

2006-01-11, 7:56 am


andrea campera wrote:
> Hi all,
>
> I have a question regardin matrix multipllication. In my program I call
> a subroutine to multiplies to n*n matrices thousands of times
>
> SUBROUTINE multi(Nc,m1,m2,mtot)
> INTEGER i,j,l,Nc
> complex aa,m1(Nc,Nc),m2(Nc,Nc),mtot(Nc,Nc)
> do i=1,Nc
> do j=1,Nc
> aa=0.
> do l=1,Nc
> aa=aa+m1(i,l)*m2(l,j)
> enddo
> mtot(i,j)=aa
> enddo
> enddo
> return
> end
>
> my question is: is there a faster algorithm to solve this simple
> multiplication? where can I find the code (fortran 77 if possible)?
> Thanks in advance for any help
>
> Regards
>
> Andrea


Well, I can suggest one improvement - lower case L in many fonts can
look identical to 1 (numeral one), so never use it!

Dave Flower

Gordon Sande

2006-01-11, 7:56 am

On 2006-01-11 04:14:42 -0400, "Andrea Campera"
<andrea.campera@iet.unipi.it> said:

> To be more precise I have to do an operation like (A*B+I)^ -1. up to
> now I have used
>
> call multi(...)
> call summa(...)
> call gaussj(...)
>
> I post also the profile:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls Ks/call Ks/call name


<snip>

> if I increase the dimension Nc of the matrices gaussj and multi become
> the most time consuming routines. I have seen the subroutine cgemm.f of
> Blas for the multiplication, but what about the inversion? is ther
> anything faster than gaussj? I use g77 (3.2 or 3.4) Pentium IV or Xeon
> or Opteron or AMD64 (we have several machines).
>
> Andrea



The fastest way to invert matrices is to not invert matrices. This apparent
piece of self contradictory advice really means that most uses of inversion
are better done some other way.

Solving is usually what is wanted and that can be better done by factoring
and solving with the the factors. The question is "Why to you think you
actually need the inverse?" Other that your simple derivation in matrix
algebra.

Matrix multiply will always be n^3 and inversion and/or solving will
also be n^3. You can fiddle the coefficient of multiply by polishing
code, or having someone else do it for you in a library. You can also
polish inversion but beware the "highly polished" codes that pay no
attention to numerical stability. Many Gauss-Jordan matrix inversion
programs are the trivial algebra coded naively so you have to read and
understand the documentation. Factoring has a bigger reduction in
coefficient over inversion than you are ever going to see with polishing.
And it may be that with a bit of insight into you real problem that
there are other things that can be done. Thee are examples of easy
problems with awkward derivations that are turned into awkward, and
numerically harder, problems with easy derivations. For more details
consult a competent numerical analyst.




robin

2006-01-11, 9:57 pm

"andrea campera" <andreacampera@gmail.com> wrote in message
news:1136934796.383711.298240@f14g2000cwb.googlegroups.com...
> I have a question regardin matrix multipllication. In my program I call
> a subroutine to multiplies to n*n matrices thousands of times
>
> SUBROUTINE multi(Nc,m1,m2,mtot)
> INTEGER i,j,l,Nc
> complex aa,m1(Nc,Nc),m2(Nc,Nc),mtot(Nc,Nc)
> do i=1,Nc
> do j=1,Nc
> aa=0.
> do l=1,Nc
> aa=aa+m1(i,l)*m2(l,j)
> enddo
> mtot(i,j)=aa
> enddo
> enddo
> return
> end
>
> my question is: is there a faster algorithm to solve this simple
> multiplication? where can I find the code (fortran 77 if possible)?


You are better off using a Fortran 90 or later compiler because
matrix multiplication is built in. The implementations are,
typically faster than what you have written


Andrea Campera

2006-01-12, 3:58 am

I know that matrix multiplication alforithm can be lower than n^3, for
example thew Strassen algorithm is n^2.81 and there exists algorithm of
order n^2.3. I have also read somewhere that those algorithm are slightly
unstable in some cases. I have also tried with MATMUL (f90) but the code
runs slower. MAy there is a smart way to resolve (I+A*B)^(-1). Thanks again

Andrea

"Gordon Sande" <g.sande@worldnet.att.net> ha scritto nel messaggio
news:2006011109410016807-gsande@worldnetattnet...
> On 2006-01-11 04:14:42 -0400, "Andrea Campera"
> <andrea.campera@iet.unipi.it> said:
>
>
> <snip>
>
>
>
> The fastest way to invert matrices is to not invert matrices. This
> apparent
> piece of self contradictory advice really means that most uses of
> inversion
> are better done some other way.
>
> Solving is usually what is wanted and that can be better done by factoring
> and solving with the the factors. The question is "Why to you think you
> actually need the inverse?" Other that your simple derivation in matrix
> algebra.
>
> Matrix multiply will always be n^3 and inversion and/or solving will
> also be n^3. You can fiddle the coefficient of multiply by polishing
> code, or having someone else do it for you in a library. You can also
> polish inversion but beware the "highly polished" codes that pay no
> attention to numerical stability. Many Gauss-Jordan matrix inversion
> programs are the trivial algebra coded naively so you have to read and
> understand the documentation. Factoring has a bigger reduction in
> coefficient over inversion than you are ever going to see with polishing.
> And it may be that with a bit of insight into you real problem that
> there are other things that can be done. Thee are examples of easy
> problems with awkward derivations that are turned into awkward, and
> numerically harder, problems with easy derivations. For more details
> consult a competent numerical analyst.
>
>
>
>



Joost

2006-01-12, 3:58 am

You might want to read this paper:

http://surj.stanford.edu/2004/pdfs/kakaradov.pdf
'Ultra-Fast Matrix Multiplication:An Empirical Analysis of Highly
Optimized Vector Algorithms'

It has the following interesting sentence:

The value of n for which Strassen's algorithm (in this
implementation) becomes
better than the naive algorithm is n = 60,750.

Having said this, I have the following timings comparing multi with
cgemm on an opteron using GOTO blas:

Nc=351
multi 0.28795699999999996
CGEMM 0.04499200000000003

Nc=1351
multi 35.40061800000001
CGEMM 2.246658999999994

Cheers,

Joost

David Jones

2006-01-12, 3:58 am

Andrea Campera wrote:
> I know that matrix multiplication alforithm can be lower than n^3,

for
> example thew Strassen algorithm is n^2.81 and there exists algorithm
> of order n^2.3. I have also read somewhere that those algorithm are
> slightly unstable in some cases. I have also tried with MATMUL (f90)
> but the code runs slower. MAy there is a smart way to resolve
> (I+A*B)^(-1). Thanks again
>

The result wanted is

(I+A*B)^(-1)= I-A*( (I+B*A)^(-1) )*B

savings arise if B*A is smaller in dimension than A*B.


Andrea Campera

2006-01-12, 7:58 am

"unfortunately" A and B are square matrices, hence A*B and B*A have the same
dimensions. Using cgemm or multi plus summa to do (I+A*B) with Nc=300 gives
both 0.34 seconds on a 2.4 GHz pentium IV. but I don't use goto BLAS. for
N=1000 cgemm and multi plus summa gives, respectively, 8.5 and 13.2 seconds.
May be implemented goto BLAS can save lots of time in the multiplication.

Andrea


"David Jones" <dajxxx@ceh.ac.uk> ha scritto nel messaggio
news:43c6268d$1@news.nwl.ac.uk...
> Andrea Campera wrote:
> for
> The result wanted is
>
> (I+A*B)^(-1)= I-A*( (I+B*A)^(-1) )*B
>
> savings arise if B*A is smaller in dimension than A*B.
>
>



Joost

2006-01-12, 7:58 am

> Using cgemm or multi plus summa to do (I+A*B) with Nc=300 gives
> both 0.34 seconds on a 2.4 GHz pentium IV. but I don't use goto BLAS.
> May be implemented goto BLAS can save lots of time in the multiplication.


:-) The last sentence makes the point...

In my opinion, BLAS is just a specification of an interface to some
basic linear algebra operations, the implementation made available in
netlib is just a reference implementation that provides 'correct
results' in a reasonable time, but not fast (i.e. the fortran code for
cgemm in the reference implementation is very similar to your multi).
Certain specific implementations of BLAS do much better when it comes
to the 'fast' bit. In these implementations people make sure that the
operations are ordered in such a way the code is optimal for the CPU.

Joost

Janne Blomqvist

2006-01-12, 7:04 pm

Joost wrote:
> Also here, CGESV will be part of MKL (typically used on intel chips)
> and ACML (used on AMD), and should be relatively fast also with netlib
> lapack if combined with atlas or GOTO blas (available for both brands).


Going off on a tangent wrt your "relatively fast" comment, a while ago
I stumbled upon a paper where it was mentioned that in solving Ax=B
for large systems, >99.9 % of the flops is actually spent in
*GEMM. That would suggest that having a really good *GEMM
implementation is all that matters, the speed of *GESV isn't really
significant.

Of course, for a parallel solver things change, and the paper was
actually about SGI:s implementation of linpack for their big shared
memory computers. You can see it here:

http://amrit.ittc.ku.edu/tclark/eur...apers/panzi.pdf

and the slides here:

http://www.isc2005.org/download/cp/Panziera_Baron.pdf


--
Janne Blomqvist
Joost

2006-01-12, 7:04 pm

Hi Janne,

I think that's right (even though they cite N>100000 for 99.9%).
I decided to give it a try on an opteron using 3 combinations

1) g95 compiled LAPACK + Goto BLAS
2) ifort + mkl (721/emt64)
3) pgf90 + acml (the version that seems to come with pgi 6.0-5)

(notice that the point is that acml/mkl might be optimised, whereas g95
uses netlib blas. Of course, precise timings might change depending on
the version of the library used and so on)

For N=351
goto/netlib mkl acml
CGEMM 0.042 0.046 0.045
CGESV 0.069 0.079 0.100

For N=1351
goto/netlib mkl acml
CGEMM 2.21 2.48 2.29
CGESV 3.27 3.66 3.77

For N=3351
goto/netlib mkl acml
CGEMM 32.8 37.2 34.3
CGESV 46.6 52.4 51.2

So, indeed, a fast blas is enough in this case.

Joost

Joost

2006-01-12, 7:04 pm

> (notice that the point is that acml/mkl might be optimised, whereas g95
> uses netlib blas)

--------------------^
read: lapack

Joost

Victor Eijkhout

2006-01-26, 7:03 pm

Gordon Sande <g.sande@worldnet.att.net> wrote:

> The fastest way to invert matrices is to not invert matrices.


One of the reasons for that being that inversion is not numerically
stable, whereas solution with a factored matrix is.

Victor.
--
Victor Eijkhout -- eijkhout at tacc utexas edu
ph: 512 471 5809
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com