Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Improve efficiency of 3 big array ops.
Hi all,

I have a working code which spends nearly all its time performing array
operations on big 2-D (densely populated) arrays. I've coded the
operations to directly express the mathematical algorithm, but I feel
sure there must be better ways of structuring the operations to make
them run more efficiently on real silicon.

(The application, by the way, is a basic easy-to-use 3-D Lattice
Boltzman fluid dynamics code that I want to tidy up and improve it's
efficiency now before to release on the web.... better variable names
might be in order too :-) )

Can anyone suggest bottlenecks / improvements / better data structures
for the following 3 key operations on modern PC (i.e. Pentium 4) type
processors? My knowledge of CPU (in)efficiencies and memory access
issues is too limited to spot them myself and I'm happy to recode for
decent gains.

By the way the array declarations are shown at the bottom of this post.
I use allocatable arrays for everything. If I've missed out critical
information, please let me know and I'll post it asap!

Many thanks for any advice,
andy.


1) Lots of linear combinations....

I'm calculating new values for ui. The integer variable ni (the number
of internal mesh nodes) is approx 200000, and num_vel is typically 19.
Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
f(1:ni,1:num-vel) is a double precision array, c is a double precision
scalar and rhoi(1:ni) is a double precision array. Any thoughts?

Here's the code, which gprof says takes a long time:

ui(1:ni,1) = c*e(1,1)*f(1:ni,1)
ui(1:ni,2) = c*e(1,2)*f(1:ni,1)
ui(1:ni,3) = c*e(1,3)*f(1:ni,1)
DO iv=2,num_vel
ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
END DO
!NB: rhoi must be up-to-date for this next line to be correct.
tempvec_ni = 1/rhoi
ui(1:ni,1) = ui(1:ni,1)*tempvec_ni
ui(1:ni,2) = ui(1:ni,2)*tempvec_ni
ui(1:ni,3) = ui(1:ni,3)*tempvec_ni



2) Big data shuffles:

This one is the most mysterious to me, as it shuffles array elements
around a lot. Do pointers have a role here? I've never used them much.
I need to transfer the value of f(j,iv) at to mesh node j to one its 19
or so neighbouring nodes. This needs to be repeated for all i=1,ni,
where ni~200000 and for all iv=1,19.

Here neighi(1:ni,1:num_vel) is an array which contains the integer
valued indices of the appropriate neighbour node in each of the num_vel
directions.

!We stream in each velocity direction in turn.
DO iv=1,num_vel

!We stream the iv'th population *from* the neighbour lying in
!the direction -e(iv,:) to the node in question. That is to
!say, we stream the iv'th population from the neighbour in
!direction op_vel(iv).
v = op_vel(iv)

!The streaming step defines the new values of f at the
!interior nodes.
f(1:ni,iv) = f(neighi(1:ni,v),iv)

END DO


3) Non-linear combinations:

I'm calculating new values for f_eq(1:ni,1:num_vel) from ui(1:ni,1:3)
and rhoi(1:ni). Note, ni = approx 200000 and num_vel = approx 19. Note
e(1:num_vel,1:3) is an integer array containing values 0,1,-1. The
array ui(1:ni,1:num_vel) is described above. wi(1:num_vel) is a small
array of double precision values. The coeff_... values are double
precision scalars.

tempvec_ni2 =  ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)**
2
DO iv=1,num_vel

tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) +
e(iv,3)*ui(1:ni,3)
f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni &
& + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni  &
& + coeff_u_dot_u*tempvec_ni2)

END DO






Here are the variable declarations and allocations. They've been culled
from 3 modules:

REAL(kind=dp), PRIVATE, ALLOCATABLE :: f(:,:)
REAL(kind=dp), PRIVATE, ALLOCATABLE :: f_eq(:,:)
REAL(kind=dp), PRIVATE, ALLOCATABLE :: tempvec_ni(:),tempvec_ni2(:)

REAL(kind=dp), ALLOCATABLE :: rhoi(:),ui(:,:),ub(:,:),pb(:)
REAL(kind=dp) :: c,cs,t,dt

INTEGER, ALLOCATABLE :: e(:,:), op_vel(:), bc_spatialvariabletypes(:)
REAL(kind=dp), ALLOCATABLE :: w(:)

!nn=total number of mesh nodes => typically around 200000
!ni=number of internal mesh nodes => comparable to nn.
!num_vel = typically 19, could be 15 or 27.

ALLOCATE(e(1:num_vel,1:3))
ALLOCATE(op_vel(1:num_vel))
ALLOCATE(w(1:num_vel))

e(1 ,1:3) = (/  1,  0,  0 /)
e(2 ,1:3) = (/ -1,  0,  0 /)
e(3 ,1:3) = (/  0,  0,  1 /)
[...snip similar values for e(7:19,1:3)...]

w(1:6)  = 1.0_dp/9.0_dp
w(7:14) = 1.0_dp/72.0_dp
w(15)   = 2.0_dp/9.0_dp

ALLOCATE(f(1:nn,1:num_vel))
ALLOCATE(f_eq(1:nn,1:num_vel))
ALLOCATE(rhoi(1:ni))
ALLOCATE(ui(1:ni,1:3))
ALLOCATE(tempvec_ni(1:ni))
ALLOCATE(tempvec_ni2(1:ni))


Report this thread to moderator Post Follow-up to this message
Old Post
andy2O@hotmail.com
06-01-05 02:03 PM


Re: Improve efficiency of 3 big array ops.

andy2O@hotmail.com wrote:

>
> 1) Lots of linear combinations....
>
> I'm calculating new values for ui. The integer variable ni (the number
> of internal mesh nodes) is approx 200000, and num_vel is typically 19.
> Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
> f(1:ni,1:num-vel) is a double precision array, c is a double precision
> scalar and rhoi(1:ni) is a double precision array. Any thoughts?
>
> Here's the code, which gprof says takes a long time:
>
>  ui(1:ni,1) = c*e(1,1)*f(1:ni,1)
>  ui(1:ni,2) = c*e(1,2)*f(1:ni,1)
>  ui(1:ni,3) = c*e(1,3)*f(1:ni,1)
>  DO iv=2,num_vel
>     ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
>     ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
>     ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
>  END DO
>  !NB: rhoi must be up-to-date for this next line to be correct.
>  tempvec_ni = 1/rhoi
>  ui(1:ni,1) = ui(1:ni,1)*tempvec_ni
>  ui(1:ni,2) = ui(1:ni,2)*tempvec_ni
>  ui(1:ni,3) = ui(1:ni,3)*tempvec_ni
>

Some idea for that one:

DO id = 1, 3
DO ini = 1, ni
ui(ini,id) = c / rhoi(ini) * &
dot_product(e(1:num_vel,id), f(ini,1:num_vel))
END DO
END DO

The underlying idea is that on PCs, better try to keep memory access,
and especially stores, to a minimum.

You might also define logical maskn and maskp (1:num_vel), and
DO id = 1, 3
maskn = e(:,id) == -1.0_dp
maskp = e(:,id) == +1.0_dp
DO ini = 1, ni
ui(ini,id) = c / rhoi(ini) * &
(sum(f(ini,1:num_vel), MASK=maskp) - &
sum(f(ini,1:num_vel), MASK=maskn))
END DO
END DO

Of course, it would even be better if you could exchange
the dimensions of f, (f(1:num-vel,1:ni)), but you would
probably then loose the time elsewhere...


Report this thread to moderator Post Follow-up to this message
Old Post
Michel OLAGNON
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.

andy2O@hotmail.com wrote:

>
> 2) Big data shuffles:
>
> This one is the most mysterious to me, as it shuffles array elements
> around a lot. Do pointers have a role here? I've never used them much.
> I need to transfer the value of f(j,iv) at to mesh node j to one its 19
> or so neighbouring nodes. This needs to be repeated for all i=1,ni,
> where ni~200000 and for all iv=1,19.
>
> Here neighi(1:ni,1:num_vel) is an array which contains the integer
> valued indices of the appropriate neighbour node in each of the num_vel
> directions.
>
>   !We stream in each velocity direction in turn.
>   DO iv=1,num_vel
>
>     !We stream the iv'th population *from* the neighbour lying in
>     !the direction -e(iv,:) to the node in question. That is to
>     !say, we stream the iv'th population from the neighbour in
>     !direction op_vel(iv).
>     v = op_vel(iv)
>
>     !The streaming step defines the new values of f at the
>     !interior nodes.
>     f(1:ni,iv) = f(neighi(1:ni,v),iv)
>
>  END DO
>

I had some experience where using one's own temporary work arrays
improved significantly the performance :
DO iv=1,num_vel

v = op_vel(iv)
work(1:ni) = f(1:ni,iv)
f(1:ni,iv) = work (neighi(1:ni,v))

END DO


Report this thread to moderator Post Follow-up to this message
Old Post
Michel OLAGNON
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.

andy2O@hotmail.com wrote:

>
> 3) Non-linear combinations:
>
> I'm calculating new values for f_eq(1:ni,1:num_vel) from ui(1:ni,1:3)
> and rhoi(1:ni). Note, ni = approx 200000 and num_vel = approx 19. Note
> e(1:num_vel,1:3) is an integer array containing values 0,1,-1. The
> array ui(1:ni,1:num_vel) is described above. wi(1:num_vel) is a small
> array of double precision values. The coeff_... values are double
> precision scalars.
>
>  tempvec_ni2 =  ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)**
2
>  DO iv=1,num_vel
>
>   tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) +
> e(iv,3)*ui(1:ni,3)
>   f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni &
>        & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni  &
>        & + coeff_u_dot_u*tempvec_ni2)
>
>  END DO
>
>
>

Array operations are nice, but sometimes loops are more efficient:



DO iv=1,num_vel
DO ini = 1, ni
temp = e(iv,1)*ui(ini,1) + e(iv,2)*ui(ini,2) + &
e(iv,3)*ui(ini,3)
temp2 = ui(ini,1)**2+ui(ini,2)**2+ui(ini,3)**2
f_eq(ini,iv) = w(iv)*rhoi(ini)*( 1 + coeff_e_dot_u*temp &
& + coeff_e_dot_u_sq*temp*temp  &
& + coeff_u_dot_u*temp2)

END DO
END DO







Report this thread to moderator Post Follow-up to this message
Old Post
Michel OLAGNON
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.
andy2O@hotmail.com wrote:

> Can anyone suggest bottlenecks / improvements / better data structures
> for the following 3 key operations on modern PC (i.e. Pentium 4) type
> processors? My knowledge of CPU (in)efficiencies and memory access
> issues is too limited to spot them myself and I'm happy to recode for
> decent gains.
>

> I'm calculating new values for ui. The integer variable ni (the number
> of internal mesh nodes) is approx 200000, and num_vel is typically 19.
> Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1,
> f(1:ni,1:num-vel) is a double precision array, c is a double precision
> scalar and rhoi(1:ni) is a double precision array. Any thoughts?

>  DO iv=2,num_vel
>     ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv)
>     ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv)
>     ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv)
>  END DO
Maybe it would be better to write these out as dot_product() operations, if
you could reverse the order of subscripts on f(:,:) so as to permit use of
parallel instructions.

> 3) Non-linear combinations:
>

>  tempvec_ni2 =  ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)**
2
>  DO iv=1,num_vel
>
>   tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) +
> e(iv,3)*ui(1:ni,3)
>   f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni &
>        & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni  &
>        & + coeff_u_dot_u*tempvec_ni2)
>
>  END DO
tempvec_ni2 =  (ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)*
*2)*coeff_u_dot_u
DO iv=1,num_vel

tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) + e(iv,3)*ui(1:ni,3)
temp = w(iv)*rhoi
f_eq(1:ni,iv) = temp + temp*( (coeff_e_dot_u &
& + coeff_e_dot_u_sq*tempvec_ni)*tempvec_ni  &
& + tempvec_ni2)

END DO
It looks unlikely that your compiler will see the simple optimizations to
remove 50% of your multiply operations.
Distribution of the (1 + ...) operation is more for the benefit of accuracy
than for speed.

In view of the size of your problem, cache behavior is likely to be
important.  You may have to block your loops on iv so that they don't
overflow L1 cache, if multiple loops are using the same data.


--
Tim Prince

Report this thread to moderator Post Follow-up to this message
Old Post
Tim Prince
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.
Tim Prince wrote:

<snip>

> In view of the size of your problem, cache behavior is likely to be
> important.  You may have to block your loops on iv so that they don't
> overflow L1 cache, if multiple loops are using the same data.

I don't understand the last sentence, having little knowledge of code
optimization. What does it mean to "block a loop"? Thanks.


Report this thread to moderator Post Follow-up to this message
Old Post
beliavsky@aol.com
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.
I would explain it by example. Code a normal (i.e. using the most
straightforward do loops 1..512) matrix matrix multiply of 512 x 512
matrices, this will be a slow, unblocked way of doing things. Now, try
an alternative approach, in which you partition your matrix in
subblocks of 16x16 matrices and multiply the full matrix together
computing subblock after subblock (i.e. as a function of matrix
multiplies 16x16 in size). You'll have blocked the loops. Try it,
you'll notice it runs 10 times faster (exact number of course depending
on e.g. the size of the subblocks, and the cache of your CPU).

Cheers,

Joost


Report this thread to moderator Post Follow-up to this message
Old Post
Joost
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.
In article <1117639978.567691.187140@z14g2000cwz.googlegroups.com>,
Joost <jv244@cam.ac.uk> wrote:

> Try it,
> you'll notice it runs 10 times faster (exact number of course depending
> on e.g. the size of the subblocks, and the cache of your CPU).

Unless, of course, your compiler does cache blocking. Blocked code is
often a bit slower with a good compiler.

-- greg



Report this thread to moderator Post Follow-up to this message
Old Post
Greg Lindahl
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.
Michel, Tim,

Thanks for replies. I'll try coding up the changes you suggested asap
and see how the timings change.

Tim - About loop blocking: A google search turned up the snippet from
an old post to c.l.f. from Dan Nagle (thanks Dan), copied below. Is
this what you meant?

Also, and perhaps of interest to beliavsky too, I found Intel has an
article on this topic at

http://www.intel.com/cd/ids/develop...36.htm?page=3D1

which I'm working my way through (I note the quote from this article
that "This explains why a na=EFve data blocking may slow down an
application." with apprehension!) . But I have a few very newbie level
CPU queries....

i) How do I work out how big the blocks should be? Is it time to look
up processor cache sizes, find the number of bytes in a double and get
out my calculator?

ii) Can I safely assume that my array data can have the whole L1 cache
to itself? Or do I have to play safe, assuming there might be other
stuff cluttering it up...

Thanks again.
andy


Dan Nagle wrote:

Hello,

Off the top of my head,

do i =3D 1, n              ! vector code
a( i) =3D b( i) + c( i)
d( i) =3D b( i) * c( i)
enddo

becomes

isize =3D (fits in cache 4 times in this example) ! figure somehow
nblocks =3D n / isize               ! note: must do remainder
do i =3D 1, nblocks, isize          ! nblocks of isize
a( i: i+isize-1) =3D b( i: i+isize-1) + c( i: i+isize-1)
d( i: i+isize-1) =3D b( i: i+isize-1) * c( i: i+isize-1)
enddo

I hope that's close, at least +-1 on the indices :-)

Here, b and c can remain in cache where in the "vector" case,
they may be flushed.


Report this thread to moderator Post Follow-up to this message
Old Post
andy2O@hotmail.com
06-01-05 09:00 PM


Re: Improve efficiency of 3 big array ops.
andy2O@hotmail.com wrote:

[snip]

> I found Intel has an
> article on this topic at
>
> http://www.intel.com/cd/ids/develop...36.htm?page=3D1
>
> which I'm working my way through (I note the quote from this article
> that "This explains why a na=EFve data blocking may slow down an
^^^^^
Sorry that should read naive - I cut-n-pasted non ascii by mistake.

> application." with apprehension!) .


Report this thread to moderator Post Follow-up to this message
Old Post
andy2O@hotmail.com
06-01-05 09:00 PM


Sponsored Links




Last Thread Next Thread Next
Pages (2): [1] 2 »
Search this forum -> 
Post New Thread

Fortran archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 06:40 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.