For Programmers: Free Programming Magazines  


Home > Archive > Fortran > May 2004 > Element Wise Matrix Multiplication, BLAS and Fortran Arrays









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Element Wise Matrix Multiplication, BLAS and Fortran Arrays
Matthias Kehlenbeck

2004-05-12, 9:09 pm

Hello,

I am new to Fortran and would very much appreciate if someone could help
me with a small problem:

I have to do an element wise matrix multiplication:

| A11, A12 | (*) | B11, B12 | = | A11 * B11, A12 * B12 |
| A21, A22 | | B21, B22 | | A21 * B21, A22 * B22 |

Matrix A (*) Matrix B = Matrix C

Unfortunately, this is not part of the Basic Linear Algebra Subroutines.

Therefore, I studied the reference implementation and came to the
conclusion that the following should work:

CALL DGEMV('N', 1, itM*itN, 1.0D0, dbMatrixA, 1, dbMatrixB, 1, 0.0D0,
dbMatrixC, 1)

[itM and itN are the dimensions of the three matrices.]

Luckily, this returns the right result for all but the first element of
Matrix C.

Now, my question is: Why does it not work for the first element, too?

Any ideas?

Matthias Kehlenbeck
Michel OLAGNON

2004-05-12, 9:09 pm


In article <c7nk1o$8m4$05$1@news.t-online.com>, Matthias Kehlenbeck <matthias.kehlenbeck@stud.uni-hannover.de> writes:
>Hello,
>
>I am new to Fortran and would very much appreciate if someone could help
>me with a small problem:
>
>I have to do an element wise matrix multiplication:
>
>| A11, A12 | (*) | B11, B12 | = | A11 * B11, A12 * B12 |
>| A21, A22 | | B21, B22 | | A21 * B21, A22 * B22 |
>
> Matrix A (*) Matrix B = Matrix C
>
>Unfortunately, this is not part of the Basic Linear Algebra Subroutines.


Fortunately, it is part of the Fortran language ;-)

dbMatrixC = dbMatrixA * dbMatrixB





>
>Therefore, I studied the reference implementation and came to the
>conclusion that the following should work:
>
>CALL DGEMV('N', 1, itM*itN, 1.0D0, dbMatrixA, 1, dbMatrixB, 1, 0.0D0,
>dbMatrixC, 1)
>
>[itM and itN are the dimensions of the three matrices.]
>
>Luckily, this returns the right result for all but the first element of
>Matrix C.
>
>Now, my question is: Why does it not work for the first element, too?
>
>Any ideas?
>
>Matthias Kehlenbeck




Matthias Kehlenbeck

2004-05-12, 9:09 pm

Hello Michel,

> Fortunately, it is part of the Fortran language ;-)
>
> dbMatrixC = dbMatrixA * dbMatrixB


Yes, I know. But it is slow.
However, I just found out, that the stuff I just described is not
working. I forgot to reset matrix c in my test program and it turned
out that the rest of c is not changing at all.

Sorry guys,

Matthias
Michel OLAGNON

2004-05-12, 9:09 pm


In article <c7nqju$dc5$00$1@news.t-online.com>, Matthias Kehlenbeck <matthias.kehlenbeck@stud.uni-hannover.de> writes:
>Hello Michel,
>
>
>Yes, I know. But it is slow.


Then may be you should try to understand why it is slow.
There are very few chances that you can find a faster way
than the intrinsic elementwise operation of the
language if you use it wisely.


Michel



>However, I just found out, that the stuff I just described is not
>working. I forgot to reset matrix c in my test program and it turned
>out that the rest of c is not changing at all.
>
>Sorry guys,
>
>Matthias




Ron Shepard

2004-05-12, 9:09 pm

In article <c7nk1o$8m4$05$1@news.t-online.com>,
Matthias Kehlenbeck <matthias.kehlenbeck@stud.uni-hannover.de>
wrote:

> Hello,
>
> I am new to Fortran and would very much appreciate if someone could help
> me with a small problem:
>
> I have to do an element wise matrix multiplication:
>
> | A11, A12 | (*) | B11, B12 | = | A11 * B11, A12 * B12 |
> | A21, A22 | | B21, B22 | | A21 * B21, A22 * B22 |
>
> Matrix A (*) Matrix B = Matrix C
>
> Unfortunately, this is not part of the Basic Linear Algebra Subroutines.
>
> Therefore, I studied the reference implementation and came to the
> conclusion that the following should work:
>
> CALL DGEMV('N', 1, itM*itN, 1.0D0, dbMatrixA, 1, dbMatrixB, 1, 0.0D0,
> dbMatrixC, 1)
>
> [itM and itN are the dimensions of the three matrices.]


I think this is just a complicated way of doing a dot product. There
are 8 elements in, 4 adds, 4 multiplies, 1 result out. This isn't
anything at all like what you want. Or maybe I'm not remembering
the arguments to gemv()?

In f90 and newer, you should simply use the array syntax

matrixc = matrixa * matrixb

In f77, you should write out the two do-loops. In either case,
there are 8 elements in, 4 multiplies, and 4 results out.

$.02 -Ron Shepard
Matthias Kehlenbeck

2004-05-12, 9:09 pm

Hello Ron,

you are right. I made a rather stupid fault. Please look at the other
part of the thread for that.

However, the array multiplication you and Michel suggest seems not to be
very fast. I wrote a little benchmark program to test this. (I have to
admit, I did not care for cache flushing, so far). Please look at the
following table (MFLOPS, Intel Fortran 8.0):

--8<---------------------------------------------------------------------

BENCHMARK 'MATRIX C = MATRIX A (*) MATRIX B' PERFORMANCE
========================================
================

RANDOM SEED:
2147483562 2147483398

DIMENSION LINE BY LINE COL. BY COL. ELEMENTARY
-------------------------------------------------------------
2 x 2 66.6667 106.6667 65.3061
3 x 3 88.8887 152.3807 79.9999
4 x 4 100.0000 177.7777 100.0000
5 x 5 114.2857 200.0000 118.5185
6 x 6 123.0757 246.1512 133.3321
8 x 8 133.3334 266.6668 152.3808
10 x 10 133.3334 246.1536 145.4545
12 x 12 139.1294 290.9052 159.9986
16 x 16 152.3808 320.0003 177.7777
21 x 21 159.9949 355.5439 199.9931
24 x 24 159.9841 399.9592 199.9803
32 x 32 177.7777 399.9992 200.0003
42 x 42 177.7719 457.1261 228.5640
48 x 48 177.6639 532.9936 213.1970
64 x 64 49.2150 110.3096 88.8603
85 x 85 159.6727 245.6498 199.5917
96 x 96 77.9988 139.0414 122.9980
128 x 128 32.2715 245.7598 187.9346
170 x 170 24.4538 36.5402 28.9000
192 x 192 19.2140 34.8385 27.5679
256 x 256 4.3509 34.1927 27.3542
341 x 341 13.0273 34.8842 30.4814
384 x 384 4.7276 33.6584 27.4033
512 x 512 1.9478 31.7750 26.8867
682 x 682 13.6134 35.3259 28.1893
768 x 768 3.2301 33.1361 27.3067
1024 x 1024 1.8310 32.0994 27.1183
1365 x 1365 8.7475 32.1248 30.5447
1536 x 1536 1.8403 32.7678 26.8101
2048 x 2048 1.7404 32.0175 26.2144
2730 x 2730 3.1158 34.9902 27.9135
3072 x 3072 1.7412 31.7751 26.4347
4096 x 4096 2.0299 33.8934 27.2801

--8<---------------------------------------------------------------------

As you can see, the typical double loop is faster than "c=a*b". However,
its performance breaks down as soon as the L1 cache is exhaustet. I have
to multiply matrices which are bigger than that.

Does anyone know, how to get reasonable performance beyond that point?

Bye,

Matthias
M&M

2004-05-12, 9:09 pm


Uzytkownik "Matthias Kehlenbeck" <matthias.kehlenbeck@stud.uni-hannover.de>
> ...
> However, the array multiplication you and Michel suggest seems not to be
> very fast. I wrote a little benchmark program to test this. (I have to
> admit, I did not care for cache flushing, so far). Please look at the
> following table (MFLOPS, Intel Fortran 8.0):
>
> --8<---------------------------------------------------------------------
>
> BENCHMARK 'MATRIX C = MATRIX A (*) MATRIX B' PERFORMANCE
> ========================================
================
>
> RANDOM SEED:
> 2147483562 2147483398
>
> DIMENSION LINE BY LINE COL. BY COL. ELEMENTARY
> -------------------------------------------------------------
> 2 x 2 66.6667 106.6667 65.3061
> 3 x 3 88.8887 152.3807 79.9999
> 4 x 4 100.0000 177.7777 100.0000
> 5 x 5 114.2857 200.0000 118.5185
> 6 x 6 123.0757 246.1512 133.3321
> 8 x 8 133.3334 266.6668 152.3808
> 10 x 10 133.3334 246.1536 145.4545
> 12 x 12 139.1294 290.9052 159.9986
> 16 x 16 152.3808 320.0003 177.7777
> 21 x 21 159.9949 355.5439 199.9931
> 24 x 24 159.9841 399.9592 199.9803
> 32 x 32 177.7777 399.9992 200.0003
> 42 x 42 177.7719 457.1261 228.5640
> 48 x 48 177.6639 532.9936 213.1970
> 64 x 64 49.2150 110.3096 88.8603
> 85 x 85 159.6727 245.6498 199.5917
> 96 x 96 77.9988 139.0414 122.9980
> 128 x 128 32.2715 245.7598 187.9346
> 170 x 170 24.4538 36.5402 28.9000
> 192 x 192 19.2140 34.8385 27.5679
> 256 x 256 4.3509 34.1927 27.3542
> 341 x 341 13.0273 34.8842 30.4814
> 384 x 384 4.7276 33.6584 27.4033
> 512 x 512 1.9478 31.7750 26.8867
> 682 x 682 13.6134 35.3259 28.1893
> 768 x 768 3.2301 33.1361 27.3067
> 1024 x 1024 1.8310 32.0994 27.1183
> 1365 x 1365 8.7475 32.1248 30.5447
> 1536 x 1536 1.8403 32.7678 26.8101
> 2048 x 2048 1.7404 32.0175 26.2144
> 2730 x 2730 3.1158 34.9902 27.9135
> 3072 x 3072 1.7412 31.7751 26.4347
> 4096 x 4096 2.0299 33.8934 27.2801
>
> --8<---------------------------------------------------------------------
>
> As you can see, the typical double loop is faster than "c=a*b". However,
> its performance breaks down as soon as the L1 cache is exhaustet. I have
> to multiply matrices which are bigger than that.
>
> Does anyone know, how to get reasonable performance beyond that point?



What is your system configuration, especially memory speed?
With this it'll be quite easy to compute maximum theoretical throughput.
The operation you want to make is memory speed limited (not procesor).

Maciej Nawrocki, Ph.D.
Wroclaw University of Technology


Michael Metcalf

2004-05-12, 9:09 pm


"M&M" <nawrockimaciej@poczta.onet.pl> wrote in message
news:c7ofoi$r1t$1@nemesis.news.tpi.pl...
> What is your system configuration, especially memory speed?
> With this it'll be quite easy to compute maximum theoretical throughput.
> The operation you want to make is memory speed limited (not procesor).
>


Note that for the Fortran syntax version (C = A * B), the compiler may well
allocate a temporay array to hold the result of the RHS. If it does, this
could affect performance for large N. Depends on the compiler and
optimization level, of course.

Regards,

Mike Metcalf


Dick Hendrickson

2004-05-12, 9:09 pm



Matthias Kehlenbeck wrote:
> Hello Ron,
>
> you are right. I made a rather stupid fault. Please look at the other
> part of the thread for that.
>
> However, the array multiplication you and Michel suggest seems not to be
> very fast. I wrote a little benchmark program to test this. (I have to
> admit, I did not care for cache flushing, so far). Please look at the
> following table (MFLOPS, Intel Fortran 8.0):
>
> --8<---------------------------------------------------------------------
>
> BENCHMARK 'MATRIX C = MATRIX A (*) MATRIX B' PERFORMANCE
> ========================================
================
>
> RANDOM SEED:
> 2147483562 2147483398
>
> DIMENSION LINE BY LINE COL. BY COL. ELEMENTARY
> -------------------------------------------------------------
> 2 x 2 66.6667 106.6667 65.3061
> 3 x 3 88.8887 152.3807 79.9999
> 4 x 4 100.0000 177.7777 100.0000
> 5 x 5 114.2857 200.0000 118.5185
> 6 x 6 123.0757 246.1512 133.3321
> 8 x 8 133.3334 266.6668 152.3808
> 10 x 10 133.3334 246.1536 145.4545
> 12 x 12 139.1294 290.9052 159.9986
> 16 x 16 152.3808 320.0003 177.7777
> 21 x 21 159.9949 355.5439 199.9931
> 24 x 24 159.9841 399.9592 199.9803
> 32 x 32 177.7777 399.9992 200.0003
> 42 x 42 177.7719 457.1261 228.5640
> 48 x 48 177.6639 532.9936 213.1970
> 64 x 64 49.2150 110.3096 88.8603
> 85 x 85 159.6727 245.6498 199.5917
> 96 x 96 77.9988 139.0414 122.9980
> 128 x 128 32.2715 245.7598 187.9346
> 170 x 170 24.4538 36.5402 28.9000
> 192 x 192 19.2140 34.8385 27.5679
> 256 x 256 4.3509 34.1927 27.3542
> 341 x 341 13.0273 34.8842 30.4814
> 384 x 384 4.7276 33.6584 27.4033
> 512 x 512 1.9478 31.7750 26.8867
> 682 x 682 13.6134 35.3259 28.1893
> 768 x 768 3.2301 33.1361 27.3067
> 1024 x 1024 1.8310 32.0994 27.1183
> 1365 x 1365 8.7475 32.1248 30.5447
> 1536 x 1536 1.8403 32.7678 26.8101
> 2048 x 2048 1.7404 32.0175 26.2144
> 2730 x 2730 3.1158 34.9902 27.9135
> 3072 x 3072 1.7412 31.7751 26.4347
> 4096 x 4096 2.0299 33.8934 27.2801
>
> --8<---------------------------------------------------------------------
>
> As you can see, the typical double loop is faster than "c=a*b". However,
> its performance breaks down as soon as the L1 cache is exhaustet. I have
> to multiply matrices which are bigger than that.
>
> Does anyone know, how to get reasonable performance beyond that point?

The normal way to increase performance for large arrays is
to do some sort of blocking into submatrices. Since 48 by
48 looks like the max performance try something like

Do I_outer = 1,N,48
Do J_outer = 1,N,48
Do I = I_outer,max(N,I_outer+47)
Do J = J_outer,max(N,J_outer+47)
C(J,I) = ...
EndDo**4

That way each inner loop pair works on 48X48 matrices.
The downsides to this is that it's hard to read and
understand, it's wasy to get an off-by-one error in
soem of the loop indexes, you'll need to experiment
with the loop orders, it's harder to get the initial
values into C, it's hard to get the last terms if
the matrix size isn't a multiple of 48 (I'm not sure
I got it right ;-{ ), the various divides by 48 and
max functions might slow things down somewhat (it's
possible to set things up so that there is no divide
by 48 or max in the DO statements, but it's a bugger
to get correct the first time ;-( ), and finally I've
lost your original post, so I'm not sure this will
actually work on your code.

Dick Hendrickson

>
> Bye,
>
> Matthias


James Giles

2004-05-12, 9:09 pm

Michael Metcalf wrote:
> "M&M" <nawrockimaciej@poczta.onet.pl> wrote in message
> news:c7ofoi$r1t$1@nemesis.news.tpi.pl...
>
> Note that for the Fortran syntax version (C = A * B), the compiler may
> well allocate a temporay array to hold the result of the RHS. If it
> does, this could affect performance for large N. Depends on the compiler
> and optimization level, of course.



Note also that the copy may improve performance. In fact,
it's still not clear from the thread what's really being tested
here. When you write C = A * B, are A, B, and C contiguous
arrays? Does the compiler *know* they're contiguous? If not,
the extra manipulation to accommodate their (possible) lack of
contiguity may be the cause of the slowdown. If they're contiguous,
and the compiler knows it, the internal implementation of the
multiply should be a single loop with stride one on each of the
operands. After all, elementwise on arrays of any rank is the
same as elementwise on the rank-one memory images of those
arrays. So, if the code were written to make a copy, the copy
would have a cost, but the multiply would be faster. (On such a
simple benchmark which way is fastest may vary. But if more
calculations were involved, the single copy-in/out cycle on each
operand should be cheaper than the time required to constantly
worry about contiguity in the operations.)

Your best speed would be when no copy is made or necessary,
and the compiler *knows* that the arrays are contiguous anyway:
in other words, all the arrays are declared and all procedure arguments
are passed using F77 style declarations and argument associations.
By introducing the (possible) lack of contiguity, the F90 whole array
features didn't necessarily do us any favors as far as efficiency goes.
On the other hand, if your problem really involves the access of
discontiguous slices, the whole array stuff is arguably no worse
(efficiency-wise) than making the copies yourself, or carrying
around the strides and writing the loops yourself - and it's a lot
more convenient. The problem arises when there really isn't any
lack of contiguity but the compiler can't tell. What's really needed
is an explicit way to tell the compiler when to copy (and force
contiguity), when to assume contiguity, and when to just cope
with stride dynamically. Then the programmer could choose the
best for the actual situation.

--
J. Giles


Richard Maine

2004-05-12, 9:09 pm

"James Giles" <jamesgiles@worldnet.att.net> writes:

> What's really needed is an explicit way to tell the compiler when to
> copy (and force contiguity), when to assume contiguity, and when to
> just cope with stride dynamically. Then the programmer could choose
> the best for the actual situation.


Sounds a lot like NAG's -Oassumed command-line option. From the man page:

-Oassumed=shape
Optimises assumed-shape array dummy arguments according to
the value of shape, which must be one of

always_contig
Optimised for contiguous actual arguments. If the
actual argument is not contiguous a runtime error
will occur (the compiler is not standard-conforming
under this option).

contig Optimised for contiguous actual arguments; if the
actual argument is not contiguous (i.e. it is an
array section) a contiguous local copy is made.
This may speed up array section accessing if a suf-
ficiently large number of array element or array
operations is performed (i.e. if the cost of making
the local copy is less than the overhead of discon-
tiguous array accesses), but usually makes such
accesses slower. Note that this option does not
affect dummy arguments with the TARGET attribute;
these are always accessed via the dope vector.

section Optimised for low-moderate accesses to array sec-
tion (discontiguous) actual arguments. This is the
default.

Note that CHARACTER arrays are not affected by these options.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
Maciej Nawrocki

2004-05-12, 9:09 pm

Uzytkownik "Matthias Kehlenbeck" <matthias.kehlenbeck@stud.uni-hannover.de>
napisal w wiadomosci news:c7nk1o$8m4$05$1@news.t-online.com...
> Hello,
>
> I am new to Fortran and would very much appreciate if someone could help
> me with a small problem:
>
> I have to do an element wise matrix multiplication:
>
> | A11, A12 | (*) | B11, B12 | = | A11 * B11, A12 * B12 |
> | A21, A22 | | B21, B22 | | A21 * B21, A22 * B22 |
>
> Matrix A (*) Matrix B = Matrix C
>
> Unfortunately, this is not part of the Basic Linear Algebra Subroutines.
>
> Therefore, I studied the reference implementation and came to the
> conclusion that the following should work:
>
> CALL DGEMV('N', 1, itM*itN, 1.0D0, dbMatrixA, 1, dbMatrixB, 1, 0.0D0,
> dbMatrixC, 1)
>
> [itM and itN are the dimensions of the three matrices.]
>
> Luckily, this returns the right result for all but the first element of
> Matrix C.
>
> Now, my question is: Why does it not work for the first element, too?
>
> Any ideas?
>
> Matthias Kehlenbeck



As far as I know ;-) xGEMV does vector-matrix multiplication in 'matrix'
sense, not elementwise.

Maciej Nawrocki, Ph.D.
Wroclaw Univeristy of Technology


Maciej Nawrocki

2004-05-12, 9:09 pm

Użytkownik "James Giles" <jamesgiles@worldnet.att.net> napisał w wiadomości
news:usQnc.72338$Xj6.1215799@bgtnsc04-news.ops.worldnet.att.net...
> Michael Metcalf wrote:
throughput.[color=darkred]
>
>
> Note also that the copy may improve performance. In fact,
> it's still not clear from the thread what's really being tested
> here. When you write C = A * B, are A, B, and C contiguous
> arrays? Does the compiler *know* they're contiguous? If not,
> the extra manipulation to accommodate their (possible) lack of
> contiguity may be the cause of the slowdown. If they're contiguous,
> and the compiler knows it, the internal implementation of the
> multiply should be a single loop with stride one on each of the
> operands. After all, elementwise on arrays of any rank is the
> same as elementwise on the rank-one memory images of those
> arrays. So, if the code were written to make a copy, the copy
> would have a cost, but the multiply would be faster. (On such a
> simple benchmark which way is fastest may vary. But if more
> calculations were involved, the single copy-in/out cycle on each
> operand should be cheaper than the time required to constantly
> worry about contiguity in the operations.)
>
> Your best speed would be when no copy is made or necessary,
> and the compiler *knows* that the arrays are contiguous anyway:
> in other words, all the arrays are declared and all procedure arguments
> are passed using F77 style declarations and argument associations.
> By introducing the (possible) lack of contiguity, the F90 whole array
> features didn't necessarily do us any favors as far as efficiency goes.
> On the other hand, if your problem really involves the access of
> discontiguous slices, the whole array stuff is arguably no worse
> (efficiency-wise) than making the copies yourself, or carrying
> around the strides and writing the loops yourself - and it's a lot
> more convenient. The problem arises when there really isn't any
> lack of contiguity but the compiler can't tell. What's really needed
> is an explicit way to tell the compiler when to copy (and force
> contiguity), when to assume contiguity, and when to just cope
> with stride dynamically. Then the programmer could choose the
> best for the actual situation.
>
> --
> J. Giles
>


Yes guys, but note that whatever programer chooses it'll be always (except
of expensive supercomputers and old PCs) limited by memory speed for this
kind of calculation since there is no reuse of data already loaded into
cache (thanks to this reuse xGEMM becomes processor speed limited). You can
use various compilers, compiling options, assembler, libraries etc. but
processor waits most of the time for data from memory.

Best wishes,
Maciej Nawrocki, Ph.D.
Wroclaw Univeristy of Technology


Matthias Kehlenbeck

2004-05-12, 9:09 pm

Hello Maciej,

> What is your system configuration, especially memory speed?
> With this it'll be quite easy to compute maximum theoretical throughput.
> The operation you want to make is memory speed limited (not procesor).


I use an Athlon XP-M 2800+ (64KB L1 data cache, 64KB L2 instruction
cache, 512KB L2 cache, 2133MHz), running SuSE Linux 9.0 (Kernel 2.4.21).
How do I compute the theoretical throughput?


Bye,

Matthias
Matthias Kehlenbeck

2004-05-12, 9:09 pm

Hello Marciej,

> As far as I know ;-) xGEMV does vector-matrix multiplication in 'matrix'
> sense, not elementwise.


Well yes, I wanted to be clever...


Maybe I have more luck next time,

Matthias
Matthias Kehlenbeck

2004-05-12, 9:09 pm

Hello James,

> Note also that the copy may improve performance. In fact,
> it's still not clear from the thread what's really being tested
> here. When you write C = A * B, are A, B, and C contiguous
> arrays?


I declared the arrays as follows:

--8<-------------------------------------------------------------

DOUBLEPRECISION, ALLOCATABLE :: dbMatrixA(:, :), dbMatrixB(: ,:),
dbMatrixC(:, :)

--8<-------------------------------------------------------------

Bye,

Matthias
Matthias Kehlenbeck

2004-05-12, 9:09 pm

Hello Dick,

> The normal way to increase performance for large arrays is
> to do some sort of blocking into submatrices. Since 48 by
> 48 looks like the max performance try something like
>
> Do I_outer = 1,N,48
> Do J_outer = 1,N,48
> Do I = I_outer,max(N,I_outer+47)
> Do J = J_outer,max(N,J_outer+47)
> C(J,I) = ...
> EndDo**4


This sounds like a good idea. I will try that tomorrow.


Good night,

Matthias
James Giles

2004-05-12, 9:09 pm

Maciej Nawrocki wrote:
....
> Yes guys, but note that whatever programer chooses it'll be always
> (except of expensive supercomputers and old PCs) limited by memory speed
> for this kind of calculation since there is no reuse of data already
> loaded into cache (thanks to this reuse xGEMM becomes processor speed
> limited). You can use various compilers, compiling options, assembler,
> libraries etc. but processor waits most of the time for data from memory.



Yes, there will always be an asymptotic limit that is
related to memory speed. But, there are reasons to
believe that's not all that's going on in this case. For
one thing, the three columns are different, even when
the arrays are large, rather than converging to the same
value. And, a double-loop version yields the best speed
even for those large arrays. As I pointed out, the best speed
should be expected from doing the element-wise operation
on the equivalent rank-one memory images of the arrays.
So, there's still some improvement to be expected.

--
J. Giles


Maciej Nawrocki

2004-05-12, 9:09 pm

Użytkownik "James Giles" <jamesgiles@worldnet.att.net> napisał w wiadomości
news:aVRnc.40358$Ut1.1157891@bgtnsc05-news.ops.worldnet.att.net...
> Maciej Nawrocki wrote:
> ...
memory.[color=darkred]
>
>
> Yes, there will always be an asymptotic limit that is
> related to memory speed. But, there are reasons to
> believe that's not all that's going on in this case. For
> one thing, the three columns are different, even when
> the arrays are large, rather than converging to the same
> value. And, a double-loop version yields the best speed
> even for those large arrays. As I pointed out, the best speed
> should be expected from doing the element-wise operation
> on the equivalent rank-one memory images of the arrays.
> So, there's still some improvement to be expected.
>


You are right that memory speed is one of the limits (there are many
others). Concerning that three columns we don't know what options were used
in compiler so we can't discus about the results. That's right the speed
will be the best when we use data from contignous memory byte-by-byte and I
think that it doesn't matter if we use single loop for 1x(N^2) matrix or
double loop for NxN matrix.

Maciej Nawrocki, Ph.D.
Wroclaw University of Technology


Maciej Nawrocki

2004-05-12, 9:09 pm

Uzytkownik "Matthias Kehlenbeck" <matthias.kehlenbeck@stud.uni-hannover.de>
napisal w wiadomosci news:c7ooi6$d8t$04$1@news.t-online.com...
> Hello Maciej,
>
>
> I use an Athlon XP-M 2800+ (64KB L1 data cache, 64KB L2 instruction
> cache, 512KB L2 cache, 2133MHz), running SuSE Linux 9.0 (Kernel 2.4.21).
> How do I compute the theoretical throughput?


So, you probably have 266MHz DDR memory that gives 266 MHz x 8 bytes (64
bits) = 2.1 GBps (shared for both directions). To multiply two matrices
(elementwise) you need 2 arguments to read from memory and 1 result to write
to memory (3 memory acceses). Every element is double, so it uses 8 bytes.
So, the limit is 2.1 / (3 * 8) = 0.0875 Gflops = 87.5 Mflops.

Note, this is VERY rough estimete and other factors can limit this value.
Cache doesn't help you because you don't reuse your data. You can speed up
your program by interleaving this operations with others that are limited by
the processor (operations on reused data, trigonometry, etc) but this is
rather tricky and requires a lot of experience as well as your own tests and
often requires use of assembler.

Maciej Nawrocki, Ph.D.
Wroclaw University of Technology


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com