For Programmers: Free Programming Magazines  


Home > Archive > Fortran > June 2005 > Improve efficiency of 3 big array ops.









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Improve efficiency of 3 big array ops.
andy2O@hotmail.com

2005-06-01, 9:03 am

Hi all,

I have a working code which spends nearly all its time performing array
operations on big 2-D (densely populated) arrays. I've coded the
operations to directly express the mathematical algorithm, but I feel
sure there must be better ways of structuring the operations to make
them run more efficiently on real silicon.

(The application, by the way, is a basic easy-to-use 3-D Lattice
Boltzman fluid dynamics code that I want to tidy up and improve it's
efficiency now before to release on the web.... better variable names
might be in order too :-) )

Can anyone suggest bottlenecks / improvements / better data structures
for the following 3 key operations on modern PC (i.e. Pentium 4) type
processors? My knowledge of CPU (in)efficiencies and memory access
issues is too limited to spot them myself and I'm happy to recode for
decent gains.

By the way the array declarations are shown at the bottom of this post.
I use allocatable arrays for everything. If I've missed out critical
information, please let me know and I'll post it asap!

Many thanks for any advice,
andy.


1) Lots of linear combinations....

I'm calculating new values for ui. The integer variable ni (the number
of internal mesh nodes) is approx 200000, and num_vel is typically 19.
Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
f(1:ni,1:num-vel) is a double precision array, c is a double precision
scalar and rhoi(1:ni) is a double precision array. Any thoughts?

Here's the code, which gprof says takes a long time:

ui(1:ni,1) = c*e(1,1)*f(1:ni,1)
ui(1:ni,2) = c*e(1,2)*f(1:ni,1)
ui(1:ni,3) = c*e(1,3)*f(1:ni,1)
DO iv=2,num_vel
ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
END DO
!NB: rhoi must be up-to-date for this next line to be correct.
tempvec_ni = 1/rhoi
ui(1:ni,1) = ui(1:ni,1)*tempvec_ni
ui(1:ni,2) = ui(1:ni,2)*tempvec_ni
ui(1:ni,3) = ui(1:ni,3)*tempvec_ni



2) Big data shuffles:

This one is the most mysterious to me, as it shuffles array elements
around a lot. Do pointers have a role here? I've never used them much.
I need to transfer the value of f(j,iv) at to mesh node j to one its 19
or so neighbouring nodes. This needs to be repeated for all i=1,ni,
where ni~200000 and for all iv=1,19.

Here neighi(1:ni,1:num_vel) is an array which contains the integer
valued indices of the appropriate neighbour node in each of the num_vel
directions.

!We stream in each velocity direction in turn.
DO iv=1,num_vel

!We stream the iv'th population *from* the neighbour lying in
!the direction -e(iv,:) to the node in question. That is to
!say, we stream the iv'th population from the neighbour in
!direction op_vel(iv).
v = op_vel(iv)

!The streaming step defines the new values of f at the
!interior nodes.
f(1:ni,iv) = f(neighi(1:ni,v),iv)

END DO


3) Non-linear combinations:

I'm calculating new values for f_eq(1:ni,1:num_vel) from ui(1:ni,1:3)
and rhoi(1:ni). Note, ni = approx 200000 and num_vel = approx 19. Note
e(1:num_vel,1:3) is an integer array containing values 0,1,-1. The
array ui(1:ni,1:num_vel) is described above. wi(1:num_vel) is a small
array of double precision values. The coeff_... values are double
precision scalars.

tempvec_ni2 = ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)**
2
DO iv=1,num_vel

tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) +
e(iv,3)*ui(1:ni,3)
f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni &
& + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni &
& + coeff_u_dot_u*tempvec_ni2)

END DO






Here are the variable declarations and allocations. They've been culled
from 3 modules:

REAL(kind=dp), PRIVATE, ALLOCATABLE :: f(:,:)
REAL(kind=dp), PRIVATE, ALLOCATABLE :: f_eq(:,:)
REAL(kind=dp), PRIVATE, ALLOCATABLE :: tempvec_ni(:),tempvec_ni2(:)

REAL(kind=dp), ALLOCATABLE :: rhoi(:),ui(:,:),ub(:,:),pb(:)
REAL(kind=dp) :: c,cs,t,dt

INTEGER, ALLOCATABLE :: e(:,:), op_vel(:), bc_spatialvariabletypes(:)
REAL(kind=dp), ALLOCATABLE :: w(:)

!nn=total number of mesh nodes => typically around 200000
!ni=number of internal mesh nodes => comparable to nn.
!num_vel = typically 19, could be 15 or 27.

ALLOCATE(e(1:num_vel,1:3))
ALLOCATE(op_vel(1:num_vel))
ALLOCATE(w(1:num_vel))

e(1 ,1:3) = (/ 1, 0, 0 /)
e(2 ,1:3) = (/ -1, 0, 0 /)
e(3 ,1:3) = (/ 0, 0, 1 /)
[...snip similar values for e(7:19,1:3)...]

w(1:6) = 1.0_dp/9.0_dp
w(7:14) = 1.0_dp/72.0_dp
w(15) = 2.0_dp/9.0_dp

ALLOCATE(f(1:nn,1:num_vel))
ALLOCATE(f_eq(1:nn,1:num_vel))
ALLOCATE(rhoi(1:ni))
ALLOCATE(ui(1:ni,1:3))
ALLOCATE(tempvec_ni(1:ni))
ALLOCATE(tempvec_ni2(1:ni))

Michel OLAGNON

2005-06-01, 4:00 pm



andy2O@hotmail.com wrote:

>
> 1) Lots of linear combinations....
>
> I'm calculating new values for ui. The integer variable ni (the number
> of internal mesh nodes) is approx 200000, and num_vel is typically 19.
> Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
> f(1:ni,1:num-vel) is a double precision array, c is a double precision
> scalar and rhoi(1:ni) is a double precision array. Any thoughts?
>
> Here's the code, which gprof says takes a long time:
>
> ui(1:ni,1) = c*e(1,1)*f(1:ni,1)
> ui(1:ni,2) = c*e(1,2)*f(1:ni,1)
> ui(1:ni,3) = c*e(1,3)*f(1:ni,1)
> DO iv=2,num_vel
> ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
> ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
> ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
> END DO
> !NB: rhoi must be up-to-date for this next line to be correct.
> tempvec_ni = 1/rhoi
> ui(1:ni,1) = ui(1:ni,1)*tempvec_ni
> ui(1:ni,2) = ui(1:ni,2)*tempvec_ni
> ui(1:ni,3) = ui(1:ni,3)*tempvec_ni
>


Some idea for that one:

DO id = 1, 3
DO ini = 1, ni
ui(ini,id) = c / rhoi(ini) * &
dot_product(e(1:num_vel,id), f(ini,1:num_vel))
END DO
END DO

The underlying idea is that on PCs, better try to keep memory access,
and especially stores, to a minimum.

You might also define logical maskn and maskp (1:num_vel), and
DO id = 1, 3
maskn = e(:,id) == -1.0_dp
maskp = e(:,id) == +1.0_dp
DO ini = 1, ni
ui(ini,id) = c / rhoi(ini) * &
(sum(f(ini,1:num_vel), MASK=maskp) - &
sum(f(ini,1:num_vel), MASK=maskn))
END DO
END DO

Of course, it would even be better if you could exchange
the dimensions of f, (f(1:num-vel,1:ni)), but you would
probably then loose the time elsewhere...

Michel OLAGNON

2005-06-01, 4:00 pm



andy2O@hotmail.com wrote:

>
> 2) Big data shuffles:
>
> This one is the most mysterious to me, as it shuffles array elements
> around a lot. Do pointers have a role here? I've never used them much.
> I need to transfer the value of f(j,iv) at to mesh node j to one its 19
> or so neighbouring nodes. This needs to be repeated for all i=1,ni,
> where ni~200000 and for all iv=1,19.
>
> Here neighi(1:ni,1:num_vel) is an array which contains the integer
> valued indices of the appropriate neighbour node in each of the num_vel
> directions.
>
> !We stream in each velocity direction in turn.
> DO iv=1,num_vel
>
> !We stream the iv'th population *from* the neighbour lying in
> !the direction -e(iv,:) to the node in question. That is to
> !say, we stream the iv'th population from the neighbour in
> !direction op_vel(iv).
> v = op_vel(iv)
>
> !The streaming step defines the new values of f at the
> !interior nodes.
> f(1:ni,iv) = f(neighi(1:ni,v),iv)
>
> END DO
>


I had some experience where using one's own temporary work arrays
improved significantly the performance :
DO iv=1,num_vel

v = op_vel(iv)
work(1:ni) = f(1:ni,iv)
f(1:ni,iv) = work (neighi(1:ni,v))

END DO

Michel OLAGNON

2005-06-01, 4:00 pm



andy2O@hotmail.com wrote:

>
> 3) Non-linear combinations:
>
> I'm calculating new values for f_eq(1:ni,1:num_vel) from ui(1:ni,1:3)
> and rhoi(1:ni). Note, ni = approx 200000 and num_vel = approx 19. Note
> e(1:num_vel,1:3) is an integer array containing values 0,1,-1. The
> array ui(1:ni,1:num_vel) is described above. wi(1:num_vel) is a small
> array of double precision values. The coeff_... values are double
> precision scalars.
>
> tempvec_ni2 = ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)**
2
> DO iv=1,num_vel
>
> tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) +
> e(iv,3)*ui(1:ni,3)
> f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni &
> & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni &
> & + coeff_u_dot_u*tempvec_ni2)
>
> END DO
>
>
>


Array operations are nice, but sometimes loops are more efficient:



DO iv=1,num_vel
DO ini = 1, ni
temp = e(iv,1)*ui(ini,1) + e(iv,2)*ui(ini,2) + &
e(iv,3)*ui(ini,3)
temp2 = ui(ini,1)**2+ui(ini,2)**2+ui(ini,3)**2
f_eq(ini,iv) = w(iv)*rhoi(ini)*( 1 + coeff_e_dot_u*temp &
& + coeff_e_dot_u_sq*temp*temp &
& + coeff_u_dot_u*temp2)

END DO
END DO






Tim Prince

2005-06-01, 4:00 pm

andy2O@hotmail.com wrote:

> Can anyone suggest bottlenecks / improvements / better data structures
> for the following 3 key operations on modern PC (i.e. Pentium 4) type
> processors? My knowledge of CPU (in)efficiencies and memory access
> issues is too limited to spot them myself and I'm happy to recode for
> decent gains.
>


> I'm calculating new values for ui. The integer variable ni (the number
> of internal mesh nodes) is approx 200000, and num_vel is typically 19.
> Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
> f(1:ni,1:num-vel) is a double precision array, c is a double precision
> scalar and rhoi(1:ni) is a double precision array. Any thoughts?


> DO iv=2,num_vel
> ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
> ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
> ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
> END DO

Maybe it would be better to write these out as dot_product() operations, if
you could reverse the order of subscripts on f(:,:) so as to permit use of
parallel instructions.

> 3) Non-linear combinations:
>


> tempvec_ni2 = ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)**
2
> DO iv=1,num_vel
>
> tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) +
> e(iv,3)*ui(1:ni,3)
> f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni &
> & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni &
> & + coeff_u_dot_u*tempvec_ni2)
>
> END DO

tempvec_ni2 = (ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)*
*2)*coeff_u_dot_u
DO iv=1,num_vel

tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) + e(iv,3)*ui(1:ni,3)
temp = w(iv)*rhoi
f_eq(1:ni,iv) = temp + temp*( (coeff_e_dot_u &
& + coeff_e_dot_u_sq*tempvec_ni)*tempvec_ni &
& + tempvec_ni2)

END DO
It looks unlikely that your compiler will see the simple optimizations to
remove 50% of your multiply operations.
Distribution of the (1 + ...) operation is more for the benefit of accuracy
than for speed.

In view of the size of your problem, cache behavior is likely to be
important. You may have to block your loops on iv so that they don't
overflow L1 cache, if multiple loops are using the same data.


--
Tim Prince
beliavsky@aol.com

2005-06-01, 4:00 pm

Tim Prince wrote:

<snip>

> In view of the size of your problem, cache behavior is likely to be
> important. You may have to block your loops on iv so that they don't
> overflow L1 cache, if multiple loops are using the same data.


I don't understand the last sentence, having little knowledge of code
optimization. What does it mean to "block a loop"? Thanks.

Joost

2005-06-01, 4:00 pm

I would explain it by example. Code a normal (i.e. using the most
straightforward do loops 1..512) matrix matrix multiply of 512 x 512
matrices, this will be a slow, unblocked way of doing things. Now, try
an alternative approach, in which you partition your matrix in
subblocks of 16x16 matrices and multiply the full matrix together
computing subblock after subblock (i.e. as a function of matrix
multiplies 16x16 in size). You'll have blocked the loops. Try it,
you'll notice it runs 10 times faster (exact number of course depending
on e.g. the size of the subblocks, and the cache of your CPU).

Cheers,

Joost

Greg Lindahl

2005-06-01, 4:00 pm

In article <1117639978.567691.187140@z14g2000cwz.googlegroups.com>,
Joost <jv244@cam.ac.uk> wrote:

> Try it,
> you'll notice it runs 10 times faster (exact number of course depending
> on e.g. the size of the subblocks, and the cache of your CPU).


Unless, of course, your compiler does cache blocking. Blocked code is
often a bit slower with a good compiler.

-- greg


andy2O@hotmail.com

2005-06-01, 4:00 pm

Michel, Tim,

Thanks for replies. I'll try coding up the changes you suggested asap
and see how the timings change.

Tim - About loop blocking: A google search turned up the snippet from
an old post to c.l.f. from Dan Nagle (thanks Dan), copied below. Is
this what you meant?

Also, and perhaps of interest to beliavsky too, I found Intel has an
article on this topic at

http://www.intel.com/cd/ids/develop...36.htm?page=3D1

which I'm working my way through (I note the quote from this article
that "This explains why a na=EFve data blocking may slow down an
application." with apprehension!) . But I have a few very newbie level
CPU queries....

i) How do I work out how big the blocks should be? Is it time to look
up processor cache sizes, find the number of bytes in a double and get
out my calculator?

ii) Can I safely assume that my array data can have the whole L1 cache
to itself? Or do I have to play safe, assuming there might be other
stuff cluttering it up...

Thanks again.
andy


Dan Nagle wrote:

Hello,

Off the top of my head,

do i =3D 1, n ! vector code
a( i) =3D b( i) + c( i)
d( i) =3D b( i) * c( i)
enddo

becomes

isize =3D (fits in cache 4 times in this example) ! figure somehow
nblocks =3D n / isize ! note: must do remainder
do i =3D 1, nblocks, isize ! nblocks of isize
a( i: i+isize-1) =3D b( i: i+isize-1) + c( i: i+isize-1)
d( i: i+isize-1) =3D b( i: i+isize-1) * c( i: i+isize-1)
enddo

I hope that's close, at least +-1 on the indices :-)

Here, b and c can remain in cache where in the "vector" case,
they may be flushed.

andy2O@hotmail.com

2005-06-01, 4:00 pm

andy2O@hotmail.com wrote:

[snip]

> I found Intel has an
> article on this topic at
>
> http://www.intel.com/cd/ids/develop...36.htm?page=3D1
>
> which I'm working my way through (I note the quote from this article
> that "This explains why a na=EFve data blocking may slow down an

^^^^^
Sorry that should read naive - I cut-n-pasted non ascii by mistake.

> application." with apprehension!) .


Dr Ivan D. Reid

2005-06-01, 4:00 pm

On 1 Jun 2005 02:59:23 -0700, andy2O@hotmail.com <andy2O@hotmail.com>
wrote in <1117619963.208629.261180@o13g2000cwo.googlegroups.com>:
> 1) Lots of linear combinations....


> I'm calculating new values for ui. The integer variable ni (the number
> of internal mesh nodes) is approx 200000, and num_vel is typically 19.
> Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
> f(1:ni,1:num-vel) is a double precision array, c is a double precision
> scalar and rhoi(1:ni) is a double precision array. Any thoughts?


> Here's the code, which gprof says takes a long time:


> ui(1:ni,1) = c*e(1,1)*f(1:ni,1)
> ui(1:ni,2) = c*e(1,2)*f(1:ni,1)
> ui(1:ni,3) = c*e(1,3)*f(1:ni,1)
> DO iv=2,num_vel
> ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
> ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
> ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
> END DO
> !NB: rhoi must be up-to-date for this next line to be correct.
> tempvec_ni = 1/rhoi
> ui(1:ni,1) = ui(1:ni,1)*tempvec_ni
> ui(1:ni,2) = ui(1:ni,2)*tempvec_ni
> ui(1:ni,3) = ui(1:ni,3)*tempvec_ni


What does a simple-minded:

if (e(iv,1).gt.0) then
ui(1:n1,1) = ui(1:ni,1) + c*f(1:ni,iv)
elseif (e(iv,1).lt.0) then
ui(1:n1,1) = ui(1:ni,1) - c*f(1:ni,iv)
endif

....etc. buy you here? Going further, is there any pattern in e() that
you can exploit? I always try to avoid known multiplications by 0, or
+/- 1.

One could also try a temporary array

tc(1:ni)=c*f(1:ni,iv)
....
ui(1:n1,1) = ui(1:n1,1) * tc(1:ni)
.... etc.

to see if that helps. (Unless, of course, my eyes have let me down again
and I've misinterpreted something...)

--
Ivan Reid, Electronic & Computer Engineering, ___ CMS Collaboration,
Brunel University. Ivan.Reid@brunel.ac.uk Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".
Joost

2005-06-01, 8:58 pm

The other thing you might want to try is to declare num_vel, e (and
less importantly op_vel and w) with a parameter attribute. This
requires a little rework of your code (the easiest solution will
require recompilation of the code for each different operatore 'e') but
might bring quite something, since it will eliminate all
multiplications with e at compile time, and might give the compiler a
good hint on how much e.g. unrolling might bring.

It's a little messy but worth a try for production stuff

Joost

Greg Lindahl

2005-06-01, 8:58 pm

In article <slrnd9rqrk.2kg.Ivan.Reid@loki.brunel.ac.uk>,
Dr Ivan D. Reid <Ivan.Reid@brunel.ac.uk> wrote:

> I always try to avoid known multiplications by 0, or +/- 1.


Multiplication can be cheaper than an if/elseif, especially if branch
prediction doesn't work well with the actual data pattern.

-- greg
Richard E Maine

2005-06-01, 8:58 pm

In article <429e0013$1@news.meer.net>, lindahl@pbm.com (Greg Lindahl)
wrote:

> In article <slrnd9rqrk.2kg.Ivan.Reid@loki.brunel.ac.uk>,
> Dr Ivan D. Reid <Ivan.Reid@brunel.ac.uk> wrote:
>
>
> Multiplication can be cheaper than an if/elseif,


Been there. Done that. About 30 years ago. My "optimized" version turned
out to be much slower than the "dumb" one, at least on the particular
machine I was using at the time.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
Christoph Arns

2005-06-02, 3:59 am

andy2O@hotmail.com wrote:

the problemsize is rather small, see the response of Tim Prince. For
larger runs of D3Q19 or similar I think you will find that the slowest
part is 2) Big data shuffles. I played around with a D3Q19 code, and
found that programming for specific cases (e.g. tau=1, force direction
along coordinate directions) gives a fair speedup - this might be
relevant for your problem size. Also, switching the dimensions in f
might help. Further, if you use mirror boundary conditions, you could
split regions where a body-force is applied from those, where it is not.

cheers,
christoph

> Hi all,
>
> I have a working code which spends nearly all its time performing array
> operations on big 2-D (densely populated) arrays. I've coded the
> operations to directly express the mathematical algorithm, but I feel
> sure there must be better ways of structuring the operations to make
> them run more efficiently on real silicon.
>
> (The application, by the way, is a basic easy-to-use 3-D Lattice
> Boltzman fluid dynamics code that I want to tidy up and improve it's
> efficiency now before to release on the web.... better variable names
> might be in order too :-) )
>
> Can anyone suggest bottlenecks / improvements / better data structures
> for the following 3 key operations on modern PC (i.e. Pentium 4) type
> processors? My knowledge of CPU (in)efficiencies and memory access
> issues is too limited to spot them myself and I'm happy to recode for
> decent gains.
>
> By the way the array declarations are shown at the bottom of this post.
> I use allocatable arrays for everything. If I've missed out critical
> information, please let me know and I'll post it asap!
>
> Many thanks for any advice,
> andy.
>
>
> 1) Lots of linear combinations....
>
> I'm calculating new values for ui. The integer variable ni (the number
> of internal mesh nodes) is approx 200000, and num_vel is typically 19.
> Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
> f(1:ni,1:num-vel) is a double precision array, c is a double precision
> scalar and rhoi(1:ni) is a double precision array. Any thoughts?
>
> Here's the code, which gprof says takes a long time:
>
> ui(1:ni,1) = c*e(1,1)*f(1:ni,1)
> ui(1:ni,2) = c*e(1,2)*f(1:ni,1)
> ui(1:ni,3) = c*e(1,3)*f(1:ni,1)
> DO iv=2,num_vel
> ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
> ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
> ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
> END DO
> !NB: rhoi must be up-to-date for this next line to be correct.
> tempvec_ni = 1/rhoi
> ui(1:ni,1) = ui(1:ni,1)*tempvec_ni
> ui(1:ni,2) = ui(1:ni,2)*tempvec_ni
> ui(1:ni,3) = ui(1:ni,3)*tempvec_ni
>
>
>
> 2) Big data shuffles:
>
> This one is the most mysterious to me, as it shuffles array elements
> around a lot. Do pointers have a role here? I've never used them much.
> I need to transfer the value of f(j,iv) at to mesh node j to one its 19
> or so neighbouring nodes. This needs to be repeated for all i=1,ni,
> where ni~200000 and for all iv=1,19.
>
> Here neighi(1:ni,1:num_vel) is an array which contains the integer
> valued indices of the appropriate neighbour node in each of the num_vel
> directions.
>
> !We stream in each velocity direction in turn.
> DO iv=1,num_vel
>
> !We stream the iv'th population *from* the neighbour lying in
> !the direction -e(iv,:) to the node in question. That is to
> !say, we stream the iv'th population from the neighbour in
> !direction op_vel(iv).
> v = op_vel(iv)
>
> !The streaming step defines the new values of f at the
> !interior nodes.
> f(1:ni,iv) = f(neighi(1:ni,v),iv)
>
> END DO
>
>
> 3) Non-linear combinations:
>
> I'm calculating new values for f_eq(1:ni,1:num_vel) from ui(1:ni,1:3)
> and rhoi(1:ni). Note, ni = approx 200000 and num_vel = approx 19. Note
> e(1:num_vel,1:3) is an integer array containing values 0,1,-1. The
> array ui(1:ni,1:num_vel) is described above. wi(1:num_vel) is a small
> array of double precision values. The coeff_... values are double
> precision scalars.
>
> tempvec_ni2 = ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)**
2
> DO iv=1,num_vel
>
> tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) +
> e(iv,3)*ui(1:ni,3)
> f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni &
> & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni &
> & + coeff_u_dot_u*tempvec_ni2)
>
> END DO
>
>
>
>
>
>
> Here are the variable declarations and allocations. They've been culled
> from 3 modules:
>
> REAL(kind=dp), PRIVATE, ALLOCATABLE :: f(:,:)
> REAL(kind=dp), PRIVATE, ALLOCATABLE :: f_eq(:,:)
> REAL(kind=dp), PRIVATE, ALLOCATABLE :: tempvec_ni(:),tempvec_ni2(:)
>
> REAL(kind=dp), ALLOCATABLE :: rhoi(:),ui(:,:),ub(:,:),pb(:)
> REAL(kind=dp) :: c,cs,t,dt
>
> INTEGER, ALLOCATABLE :: e(:,:), op_vel(:), bc_spatialvariabletypes(:)
> REAL(kind=dp), ALLOCATABLE :: w(:)
>
> !nn=total number of mesh nodes => typically around 200000
> !ni=number of internal mesh nodes => comparable to nn.
> !num_vel = typically 19, could be 15 or 27.
>
> ALLOCATE(e(1:num_vel,1:3))
> ALLOCATE(op_vel(1:num_vel))
> ALLOCATE(w(1:num_vel))
>
> e(1 ,1:3) = (/ 1, 0, 0 /)
> e(2 ,1:3) = (/ -1, 0, 0 /)
> e(3 ,1:3) = (/ 0, 0, 1 /)
> [...snip similar values for e(7:19,1:3)...]
>
> w(1:6) = 1.0_dp/9.0_dp
> w(7:14) = 1.0_dp/72.0_dp
> w(15) = 2.0_dp/9.0_dp
>
> ALLOCATE(f(1:nn,1:num_vel))
> ALLOCATE(f_eq(1:nn,1:num_vel))
> ALLOCATE(rhoi(1:ni))
> ALLOCATE(ui(1:ni,1:3))
> ALLOCATE(tempvec_ni(1:ni))
> ALLOCATE(tempvec_ni2(1:ni))
>



--
--------------------------------------------------------------------------
Dr. Christoph Arns
Dept. Applied Mathematics
Research School of Physical Sciences and Engineering
The Australian National University
Canberra ACT 0200 AUSTRALIA
Ph.: +61 (2) 6125 5170
Fax: +61 (2) 6125 0732
http://rsphysse.anu.edu.au/appmaths

glen herrmannsfeldt

2005-06-02, 3:59 am

Joost wrote:

> I would explain it by example. Code a normal (i.e. using the most
> straightforward do loops 1..512) matrix matrix multiply of 512 x 512
> matrices, this will be a slow, unblocked way of doing things. Now, try
> an alternative approach, in which you partition your matrix in
> subblocks of 16x16 matrices and multiply the full matrix together
> computing subblock after subblock (i.e. as a function of matrix
> multiplies 16x16 in size). You'll have blocked the loops. Try it,
> you'll notice it runs 10 times faster (exact number of course depending
> on e.g. the size of the subblocks, and the cache of your CPU).


Yes, it is the effect of the cache that makes the answer to this
problem hard in general. If you are not doing an operation that uses
matrix elements more than once per pass, though, it isn't so hard.
Still, it is hard to make any general rules.

-- glen

Dr Ivan D. Reid

2005-06-03, 9:07 am

On 1 Jun 2005 11:36:03 -0700, Greg Lindahl <lindahl@pbm.com>
wrote in <429e0013$1@news.meer.net>:
> In article <slrnd9rqrk.2kg.Ivan.Reid@loki.brunel.ac.uk>,
> Dr Ivan D. Reid <Ivan.Reid@brunel.ac.uk> wrote:


[color=darkred]
> Multiplication can be cheaper than an if/elseif, especially if branch
> prediction doesn't work well with the actual data pattern.


But in thus case, unless I'm misreading it or the compiler unrolls
things contrary to my simple expectations, one branch covers 20,000*3
arithmetic operations which can be eliminated in two-thirds of the cases
(extrapolating from three DATA declarations...) and reduced to 20,000*2 in
the rest. There may be better methods, but I tend to try the simpler ones
first.

Looking again, if indeed only one of e(iv,1:3) is non-zero, I'd
try a slightly different tack again since only one slice of ui(1:ni,1:3)
is updated in each loop.

e(1:19)=(/1,-1,3,...

do iv=1,19
ii=abs(e(iv))
if (e(iv).gt.0) then
ui(1:n1,ii) = ui(1:ni,ii) + c*f(1:ni,iv)
else
ui(1:n1,ii) = ui(1:ni,ii) - c*f(1:ni,iv)
endif

--
Ivan Reid, Electronic & Computer Engineering, ___ CMS Collaboration,
Brunel University. Ivan.Reid@brunel.ac.uk Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com