Code Comments
Programming Forum and web based access to our favorite programming groups.Hi all, I have a working code which spends nearly all its time performing array operations on big 2-D (densely populated) arrays. I've coded the operations to directly express the mathematical algorithm, but I feel sure there must be better ways of structuring the operations to make them run more efficiently on real silicon. (The application, by the way, is a basic easy-to-use 3-D Lattice Boltzman fluid dynamics code that I want to tidy up and improve it's efficiency now before to release on the web.... better variable names might be in order too :-) ) Can anyone suggest bottlenecks / improvements / better data structures for the following 3 key operations on modern PC (i.e. Pentium 4) type processors? My knowledge of CPU (in)efficiencies and memory access issues is too limited to spot them myself and I'm happy to recode for decent gains. By the way the array declarations are shown at the bottom of this post. I use allocatable arrays for everything. If I've missed out critical information, please let me know and I'll post it asap! Many thanks for any advice, andy. 1) Lots of linear combinations.... I'm calculating new values for ui. The integer variable ni (the number of internal mesh nodes) is approx 200000, and num_vel is typically 19. Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1, f(1:ni,1:num-vel) is a double precision array, c is a double precision scalar and rhoi(1:ni) is a double precision array. Any thoughts? Here's the code, which gprof says takes a long time: ui(1:ni,1) = c*e(1,1)*f(1:ni,1) ui(1:ni,2) = c*e(1,2)*f(1:ni,1) ui(1:ni,3) = c*e(1,3)*f(1:ni,1) DO iv=2,num_vel ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv) ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv) ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv) END DO !NB: rhoi must be up-to-date for this next line to be correct. tempvec_ni = 1/rhoi ui(1:ni,1) = ui(1:ni,1)*tempvec_ni ui(1:ni,2) = ui(1:ni,2)*tempvec_ni ui(1:ni,3) = ui(1:ni,3)*tempvec_ni 2) Big data shuffles: This one is the most mysterious to me, as it shuffles array elements around a lot. Do pointers have a role here? I've never used them much. I need to transfer the value of f(j,iv) at to mesh node j to one its 19 or so neighbouring nodes. This needs to be repeated for all i=1,ni, where ni~200000 and for all iv=1,19. Here neighi(1:ni,1:num_vel) is an array which contains the integer valued indices of the appropriate neighbour node in each of the num_vel directions. !We stream in each velocity direction in turn. DO iv=1,num_vel !We stream the iv'th population *from* the neighbour lying in !the direction -e(iv,:) to the node in question. That is to !say, we stream the iv'th population from the neighbour in !direction op_vel(iv). v = op_vel(iv) !The streaming step defines the new values of f at the !interior nodes. f(1:ni,iv) = f(neighi(1:ni,v),iv) END DO 3) Non-linear combinations: I'm calculating new values for f_eq(1:ni,1:num_vel) from ui(1:ni,1:3) and rhoi(1:ni). Note, ni = approx 200000 and num_vel = approx 19. Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1. The array ui(1:ni,1:num_vel) is described above. wi(1:num_vel) is a small array of double precision values. The coeff_... values are double precision scalars. tempvec_ni2 = ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)** 2 DO iv=1,num_vel tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) + e(iv,3)*ui(1:ni,3) f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni & & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni & & + coeff_u_dot_u*tempvec_ni2) END DO Here are the variable declarations and allocations. They've been culled from 3 modules: REAL(kind=dp), PRIVATE, ALLOCATABLE :: f(:,:) REAL(kind=dp), PRIVATE, ALLOCATABLE :: f_eq(:,:) REAL(kind=dp), PRIVATE, ALLOCATABLE :: tempvec_ni(:),tempvec_ni2(:) REAL(kind=dp), ALLOCATABLE :: rhoi(:),ui(:,:),ub(:,:),pb(:) REAL(kind=dp) :: c,cs,t,dt INTEGER, ALLOCATABLE :: e(:,:), op_vel(:), bc_spatialvariabletypes(:) REAL(kind=dp), ALLOCATABLE :: w(:) !nn=total number of mesh nodes => typically around 200000 !ni=number of internal mesh nodes => comparable to nn. !num_vel = typically 19, could be 15 or 27. ALLOCATE(e(1:num_vel,1:3)) ALLOCATE(op_vel(1:num_vel)) ALLOCATE(w(1:num_vel)) e(1 ,1:3) = (/ 1, 0, 0 /) e(2 ,1:3) = (/ -1, 0, 0 /) e(3 ,1:3) = (/ 0, 0, 1 /) [...snip similar values for e(7:19,1:3)...] w(1:6) = 1.0_dp/9.0_dp w(7:14) = 1.0_dp/72.0_dp w(15) = 2.0_dp/9.0_dp ALLOCATE(f(1:nn,1:num_vel)) ALLOCATE(f_eq(1:nn,1:num_vel)) ALLOCATE(rhoi(1:ni)) ALLOCATE(ui(1:ni,1:3)) ALLOCATE(tempvec_ni(1:ni)) ALLOCATE(tempvec_ni2(1:ni))
Post Follow-up to this messageandy2O@hotmail.com wrote: > > 1) Lots of linear combinations.... > > I'm calculating new values for ui. The integer variable ni (the number > of internal mesh nodes) is approx 200000, and num_vel is typically 19. > Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1, > f(1:ni,1:num-vel) is a double precision array, c is a double precision > scalar and rhoi(1:ni) is a double precision array. Any thoughts? > > Here's the code, which gprof says takes a long time: > > ui(1:ni,1) = c*e(1,1)*f(1:ni,1) > ui(1:ni,2) = c*e(1,2)*f(1:ni,1) > ui(1:ni,3) = c*e(1,3)*f(1:ni,1) > DO iv=2,num_vel > ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv) > ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv) > ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv) > END DO > !NB: rhoi must be up-to-date for this next line to be correct. > tempvec_ni = 1/rhoi > ui(1:ni,1) = ui(1:ni,1)*tempvec_ni > ui(1:ni,2) = ui(1:ni,2)*tempvec_ni > ui(1:ni,3) = ui(1:ni,3)*tempvec_ni > Some idea for that one: DO id = 1, 3 DO ini = 1, ni ui(ini,id) = c / rhoi(ini) * & dot_product(e(1:num_vel,id), f(ini,1:num_vel)) END DO END DO The underlying idea is that on PCs, better try to keep memory access, and especially stores, to a minimum. You might also define logical maskn and maskp (1:num_vel), and DO id = 1, 3 maskn = e(:,id) == -1.0_dp maskp = e(:,id) == +1.0_dp DO ini = 1, ni ui(ini,id) = c / rhoi(ini) * & (sum(f(ini,1:num_vel), MASK=maskp) - & sum(f(ini,1:num_vel), MASK=maskn)) END DO END DO Of course, it would even be better if you could exchange the dimensions of f, (f(1:num-vel,1:ni)), but you would probably then loose the time elsewhere...
Post Follow-up to this messageandy2O@hotmail.com wrote: > > 2) Big data shuffles: > > This one is the most mysterious to me, as it shuffles array elements > around a lot. Do pointers have a role here? I've never used them much. > I need to transfer the value of f(j,iv) at to mesh node j to one its 19 > or so neighbouring nodes. This needs to be repeated for all i=1,ni, > where ni~200000 and for all iv=1,19. > > Here neighi(1:ni,1:num_vel) is an array which contains the integer > valued indices of the appropriate neighbour node in each of the num_vel > directions. > > !We stream in each velocity direction in turn. > DO iv=1,num_vel > > !We stream the iv'th population *from* the neighbour lying in > !the direction -e(iv,:) to the node in question. That is to > !say, we stream the iv'th population from the neighbour in > !direction op_vel(iv). > v = op_vel(iv) > > !The streaming step defines the new values of f at the > !interior nodes. > f(1:ni,iv) = f(neighi(1:ni,v),iv) > > END DO > I had some experience where using one's own temporary work arrays improved significantly the performance : DO iv=1,num_vel v = op_vel(iv) work(1:ni) = f(1:ni,iv) f(1:ni,iv) = work (neighi(1:ni,v)) END DO
Post Follow-up to this messageandy2O@hotmail.com wrote: > > 3) Non-linear combinations: > > I'm calculating new values for f_eq(1:ni,1:num_vel) from ui(1:ni,1:3) > and rhoi(1:ni). Note, ni = approx 200000 and num_vel = approx 19. Note > e(1:num_vel,1:3) is an integer array containing values 0,1,-1. The > array ui(1:ni,1:num_vel) is described above. wi(1:num_vel) is a small > array of double precision values. The coeff_... values are double > precision scalars. > > tempvec_ni2 = ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)** 2 > DO iv=1,num_vel > > tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) + > e(iv,3)*ui(1:ni,3) > f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni & > & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni & > & + coeff_u_dot_u*tempvec_ni2) > > END DO > > > Array operations are nice, but sometimes loops are more efficient: DO iv=1,num_vel DO ini = 1, ni temp = e(iv,1)*ui(ini,1) + e(iv,2)*ui(ini,2) + & e(iv,3)*ui(ini,3) temp2 = ui(ini,1)**2+ui(ini,2)**2+ui(ini,3)**2 f_eq(ini,iv) = w(iv)*rhoi(ini)*( 1 + coeff_e_dot_u*temp & & + coeff_e_dot_u_sq*temp*temp & & + coeff_u_dot_u*temp2) END DO END DO
Post Follow-up to this messageandy2O@hotmail.com wrote: > Can anyone suggest bottlenecks / improvements / better data structures > for the following 3 key operations on modern PC (i.e. Pentium 4) type > processors? My knowledge of CPU (in)efficiencies and memory access > issues is too limited to spot them myself and I'm happy to recode for > decent gains. > > I'm calculating new values for ui. The integer variable ni (the number > of internal mesh nodes) is approx 200000, and num_vel is typically 19. > Note e(1:num_vel,1:3) is an integer array containing values 0,1,-1, > f(1:ni,1:num-vel) is a double precision array, c is a double precision > scalar and rhoi(1:ni) is a double precision array. Any thoughts? > DO iv=2,num_vel > ui(1:ni,1) = ui(1:ni,1) + c*e(iv,1)*f(1:ni,iv) > ui(1:ni,2) = ui(1:ni,2) + c*e(iv,2)*f(1:ni,iv) > ui(1:ni,3) = ui(1:ni,3) + c*e(iv,3)*f(1:ni,iv) > END DO Maybe it would be better to write these out as dot_product() operations, if you could reverse the order of subscripts on f(:,:) so as to permit use of parallel instructions. > 3) Non-linear combinations: > > tempvec_ni2 = ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)** 2 > DO iv=1,num_vel > > tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) + > e(iv,3)*ui(1:ni,3) > f_eq(1:ni,iv) = w(iv)*rhoi*( 1 + coeff_e_dot_u*tempvec_ni & > & + coeff_e_dot_u_sq*tempvec_ni*tempvec_ni & > & + coeff_u_dot_u*tempvec_ni2) > > END DO tempvec_ni2 = (ui(1:ni,1)**2+ui(1:ni,2)**2+ui(1:ni,3)* *2)*coeff_u_dot_u DO iv=1,num_vel tempvec_ni = e(iv,1)*ui(1:ni,1) + e(iv,2)*ui(1:ni,2) + e(iv,3)*ui(1:ni,3) temp = w(iv)*rhoi f_eq(1:ni,iv) = temp + temp*( (coeff_e_dot_u & & + coeff_e_dot_u_sq*tempvec_ni)*tempvec_ni & & + tempvec_ni2) END DO It looks unlikely that your compiler will see the simple optimizations to remove 50% of your multiply operations. Distribution of the (1 + ...) operation is more for the benefit of accuracy than for speed. In view of the size of your problem, cache behavior is likely to be important. You may have to block your loops on iv so that they don't overflow L1 cache, if multiple loops are using the same data. -- Tim Prince
Post Follow-up to this messageTim Prince wrote: <snip> > In view of the size of your problem, cache behavior is likely to be > important. You may have to block your loops on iv so that they don't > overflow L1 cache, if multiple loops are using the same data. I don't understand the last sentence, having little knowledge of code optimization. What does it mean to "block a loop"? Thanks.
Post Follow-up to this messageI would explain it by example. Code a normal (i.e. using the most straightforward do loops 1..512) matrix matrix multiply of 512 x 512 matrices, this will be a slow, unblocked way of doing things. Now, try an alternative approach, in which you partition your matrix in subblocks of 16x16 matrices and multiply the full matrix together computing subblock after subblock (i.e. as a function of matrix multiplies 16x16 in size). You'll have blocked the loops. Try it, you'll notice it runs 10 times faster (exact number of course depending on e.g. the size of the subblocks, and the cache of your CPU). Cheers, Joost
Post Follow-up to this messageIn article <1117639978.567691.187140@z14g2000cwz.googlegroups.com>, Joost <jv244@cam.ac.uk> wrote: > Try it, > you'll notice it runs 10 times faster (exact number of course depending > on e.g. the size of the subblocks, and the cache of your CPU). Unless, of course, your compiler does cache blocking. Blocked code is often a bit slower with a good compiler. -- greg
Post Follow-up to this messageMichel, Tim, Thanks for replies. I'll try coding up the changes you suggested asap and see how the timings change. Tim - About loop blocking: A google search turned up the snippet from an old post to c.l.f. from Dan Nagle (thanks Dan), copied below. Is this what you meant? Also, and perhaps of interest to beliavsky too, I found Intel has an article on this topic at http://www.intel.com/cd/ids/develop...36.htm?page=3D1 which I'm working my way through (I note the quote from this article that "This explains why a na=EFve data blocking may slow down an application." with apprehension!) . But I have a few very newbie level CPU queries.... i) How do I work out how big the blocks should be? Is it time to look up processor cache sizes, find the number of bytes in a double and get out my calculator? ii) Can I safely assume that my array data can have the whole L1 cache to itself? Or do I have to play safe, assuming there might be other stuff cluttering it up... Thanks again. andy Dan Nagle wrote: Hello, Off the top of my head, do i =3D 1, n ! vector code a( i) =3D b( i) + c( i) d( i) =3D b( i) * c( i) enddo becomes isize =3D (fits in cache 4 times in this example) ! figure somehow nblocks =3D n / isize ! note: must do remainder do i =3D 1, nblocks, isize ! nblocks of isize a( i: i+isize-1) =3D b( i: i+isize-1) + c( i: i+isize-1) d( i: i+isize-1) =3D b( i: i+isize-1) * c( i: i+isize-1) enddo I hope that's close, at least +-1 on the indices :-) Here, b and c can remain in cache where in the "vector" case, they may be flushed.
Post Follow-up to this messageandy2O@hotmail.com wrote: [snip] > I found Intel has an > article on this topic at > > http://www.intel.com/cd/ids/develop...36.htm?page=3D1 > > which I'm working my way through (I note the quote from this article > that "This explains why a na=EFve data blocking may slow down an ^^^^^ Sorry that should read naive - I cut-n-pasted non ascii by mistake. > application." with apprehension!) .
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.