Code Comments
Programming Forum and web based access to our favorite programming groups.Derived data types are one of the more useful new features of modern Fortran. However, a few months ago I had to revert critical sections of one program back to separate arrays when I discovered that accesses to derived type components were heavily impacting computation times (IIRC replacing with a few standard arrays reduced the computation time for a large section of code by a factor of about 5). At the time I was a bit "annoyed" but did not take too much notice - just please to significantly reduce the computation time. But I am now finding the experience is making me reluctant to use derived types anywhere near computationally intensive sections of code (or even moderately intensive code). Frustrating because the alternative of multiple separate arrays can be cumbersome, untidy, and harder to follow. The latter problems can be avoided with suitable array naming- eg: site_ID, site_name, site_lat, site_long, site_elevation, etc. (indeed is more legible than the silly, often near illegible component selector "%" used for derived types.) Separate arrays are not so convenient though in other situations - eg: site(i) = tempsite; and READ(line,*) site(i), become a lot messier with separate arrays, especially where there are multiple components. The main reason for this post is to find out whether the computational deficiencies I encountered with derived types are typical, or compiler specific (ivf8), or only occur under specific circumstances (mixed type components, etc)? Any suggestions? David
Post Follow-up to this messagedud_address@xparadise.net.nz wrote: > Derived data types are one of the more useful new features of modern > Fortran. However, a few months ago I had to revert critical sections > of one program back to separate arrays when I discovered that accesses > to derived type components were heavily impacting computation times > (IIRC replacing with a few standard arrays reduced the computation > time for a large section of code by a factor of about 5). > At the time I was a bit "annoyed" but did not take too much notice - > just please to significantly reduce the computation time. The cache on modern processors assumes locality of reference, that is, if you reference one data item you will likely soon reference the following data item. If you reference one part in an array of structures in a computation, but not others then it might slow down by a large factor. You might be able to use a structure of arrays, though that might not help so much. There are stories of C compilers on some SPEC benchmarks that internally convert between array of structures and structure of arrays, though it is not normally allowed by the standard. (With some kinds of pointer arithmetic you can tell the difference. The compiler has to know that the program doesn't do that.) It might be that you can copy to another array, do the computation, and copy back faster than in the structure. -- glen
Post Follow-up to this messageOn Tue, 05 Oct 2004 14:31:41 +1300, dud_address@xparadise.net.nz <dud_address@xparadise.net.nz> wrote in <ssq3m0h385guifna0f7trqemgc4gos2j04@4ax.com>: > The main reason for this post is to find out whether the computational > deficiencies I encountered with derived types are typical, or compiler > specific (ivf8), or only occur under specific circumstances (mixed > type components, etc)? Any suggestions? One thing to be aware of with many types of aggregates (derived types, common blocks, etc) is that the order of the components may be important. Specifically if an entity is at an unsuitable address boundary (say, a 4-byte integer at an address that's not divisible by 4), some modern processors take a performance penalty to retrieve the data, e.g. make two separate reads and then reconstruct the misaligned object. One should strive to order the members in descending order of size to allow them to be always aligned on their "natural" boundaries. Similarly, if the aggregate's size is not a multiple of its largest member's, you should not force arrays to be packed in memory but allow the compiler to start each element on its natural boundary. -- Ivan Reid, Electronic & Computer Engineering, ___ CMS Collaboration , Brunel University. Ivan.Reid@brunel.ac.uk Room 40-1-B12, CER N KotPT -- "for stupidity above and beyond the call of duty".
Post Follow-up to this message> There are stories of C compilers on some SPEC benchmarks that > internally convert between array of structures and structure of > arrays, though it is not normally allowed by the standard. That would be the Sun C compiler mangling art, which is probably the worst element of SPEC CFP2000 in any case. Apart from the whole-program analysis required to re-write it a-of-s to s-of-a, it also does coalescing of malloc() calls, which apparently also gains a substantial factor. Jan
Post Follow-up to this message> The main reason for this post is to find out whether the computational > deficiencies I encountered with derived types are typical, or compiler > specific (ivf8), or only occur under specific circumstances (mixed > type components, etc)? Any suggestions? Some considerations: - Alignment: depending on the order of components and the specification, or not, of SEQUENCE in the definition, the compiler may or may be forced to generate slow code for accessing some or all components. General rule: order by size, larger items first. - A second alignment issue might occur when the total size does not fit well with the alignment required by the hardware - this can happen in arrays of structures when, for instance, the second array element is mis-aligned because of the size of the first element. Padding to some suitable total size can help here. Also, adverse interactions with cache line sizes or replacement policies can occur. - If you access most or all of a structures components in one basic block, an array of structures will usually be best. If, on the other hand, you first access multiple - perhaps even consecutive - elements of one com- ponent, a structure of arrays usually will be best. - I agree that there is this problem of handling one element's bits and pieces in the case of a structure of arrays. Perhaps you can build two helper functions that, given an element number, scatter and gather the components to and from the arrays...sort of achieving the best of both worlds. In the end, it's a matter of performance measurement to decide between the two approaches. Jan
Post Follow-up to this messageOn Tue, 05 Oct 2004 14:31:41 +1300, dud_address@xparadise.net.nz wrote: >The main reason for this post is to find out whether the computational >deficiencies I encountered with derived types are typical, or compiler >specific (ivf8), or only occur under specific circumstances (mixed >type components, etc)? Any suggestions? You have not provided examples of what you are referring to, so I am not sur e of the context. I would not expect any noticeable performance penalty for accessing components of a derived type variable. If you have a specific example that shows a problem in an Intel compiler, we'd love to see it at Intel Premier Support. One issue we are aware of is operations involving pointer or allocatable array components, where sometimes unnecessary array copies are made. Steve Lionel Software Products Division Intel Corporation Nashua, NH User communities for Intel Software Development Products http://softwareforums.intel.com/ Intel Fortran Support http://developer.intel.com/software/products/support/
Post Follow-up to this messageJan Vorbrüggen wrote: > That would be the Sun C compiler mangling art, which is probably > the worst element of SPEC CFP2000 in any case. Apart from the > whole-program analysis required to re-write it a-of-s to s-of-a, > it also does coalescing of malloc() calls, which apparently also > gains a substantial factor. How can that be ? SPEC programs are supposed to spend only a small fraction of their time in libraries not supplied (and thus compiled) with the program. -- Toon Moene - e-mail: toon@moene.indiv.nluug.nl - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html A maintainer of GNU Fortran 95: http://gcc.gnu.org/fortran/
Post Follow-up to this messageOn Tue, 05 Oct 2004 09:55:49 -0400, Steve Lionel <Steve.Lionel@REMOVEintelME.com> wrote: >On Tue, 05 Oct 2004 14:31:41 +1300, dud_address@xparadise.net.nz wrote: > > >You have not provided examples of what you are referring to, so I am not su re >of the context. I would not expect any noticeable performance penalty for >accessing components of a derived type variable. If you have a specific >example that shows a problem in an Intel compiler, we'd love to see it at >Intel Premier Support. One issue we are aware of is operations involving >pointer or allocatable array components, where sometimes unnecessary array >copies are made. > >Steve Lionel >Software Products Division >Intel Corporation >Nashua, NH > >User communities for Intel Software Development Products > http://softwareforums.intel.com/ >Intel Fortran Support > http://developer.intel.com/software/products/support/ Thanks for the various replies. In respect of "examples", the actual code in question is as below. Do not seem to any alignment issues? The attraction of derived types is the ability to hang related data together in a rational manner; eg. in this case EQ date & time, location, depth, magnitude, etc. It would be nice if I can use them without significant impact on computation time; eg data for a given event (i) can easily be read via READ(line,*) EQ(i) - not so simple with a structure of arrays? Steve, in respect of Intel Premier Support, my experience with that to date is not encouraging. This is probably not the place to go into detail, so I will send you an email directly. David ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Subroutine code below NB: This is an internal subroutine, hence not all variables declared here - only first half of subroutine below - should be adequate to provide context. As indicated in my original post, substituting the <<<< OFFENDING LINE >>>> with rsq = (EQ_east(i)-znx(j))**2 + (EQ_north(i)-zny(j))**2 had a large impact on the time for this code to run - despite the "log10" & "EXP" in the rest of the code. Compiler is ivf8.05. Because there is a tight deadline on this work, I am very reluctant to change to 8.1 midstream, especially as a number of people seem to have had problems. For the same reason, I am holding back on applying SP2 for WinXP. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SUBROUTINE GAUSSIAN ( EQ, nEQ, znx, zny, nG0, nG1, gridA, gridB, cSmooth, rmax ) IMPLICIT NONE integer i, j integer na, nb real pi ! EQ Catalogue data integer nEQ ! EQ source zones integer nG0, nG1 real znx(*), zny(*) ! EQ model variables real cSmooth, csq, const1 real rmax, rmax1, rmax2, rsq real sumxy_b, sumx_b, sumy_b, sumxsq_b real sumeqs, sumx_a, sumy_a real x, y, z, eqs, freq real gridA(nG1), gridB(nG1) TYPE :: eqRec integer year integer month integer day integer time real x real y integer depth real mag real mSD END TYPE eqRec TYPE(eqRec) EQ(nEQ) !===================================== ! Model constants rmax1 = rmax ! limit on extent of spreading rmax2 = rmax**2 csq = cSmooth*cSmooth ! SQ of smoothing parameter pi = 3.1415926535 const1 = csq*pi loopGrid: DO j = nG0, nG1 ! Loops source grid sumxy_b = 0.0 sumx_b = 0.0 sumy_b = 0.0 sumxsq_b = 0.0 freq = 0.0 sumeqs = 0.0 sumx_a = 0.0 sumy_a = 0.0 na = 0 nb = 0 loopEQ: do i = 1, nEQ ! <<<< OFFENDING LINE >>>> rsq = (EQ(i)%x-znx(j))**2 + (EQ(i)%y-zny(j))**2 ! <<<< OFFENDING LINE >>>> if(rsq > rmax2) CYCLE loopEQ ! probably c. 10% of instances pass this test nb = nb+1 x = EQ(i)%mag z = trec freq = freq + 1./z y = log10(freq) sumxy_b = sumxy_b + x*y sumx_b = sumx_b + x sumy_b = sumy_b + y sumxsq_b = sumxsq_b + x*x eqs = (EXP(-rsq/csq))/const1/z if (eqs < 0.1E-20) CYCLE loopEq sumeqs = sumeqs + eqs y = log10(sumeqs) na = na + 1 sumy_a = sumy_a + y sumx_a = sumx_a + x end do loopEQ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Post Follow-up to this message>> That would be the Sun C compiler mangling art, which is probably > How can that be ? SPEC programs are supposed to spend only a small > fraction of their time in libraries not supplied (and thus compiled) > with the program. From the standard's point of view, malloc() is part of the language, just as ALLOCATE is part of the Fortran language. I think the formal restriction is that something 98 % of a SPEC CPU program should be in user mode. Jan
Post Follow-up to this message> ! <<<< OFFENDING LINE >>>> > rsq = (EQ(i)%x-znx(j))**2 + (EQ(i)%y-zny(j))**2 > ! <<<< OFFENDING LINE >>>> > > if(rsq > rmax2) CYCLE loopEQ > ! probably c. 10% of instances pass this test This is one of the situations I described: You are accessing only a small part of your structure EQ(i) here - two consecutive elements - in 90% of the cases, if I understand your comment correctly. Given that on any access to memory that is not already cached, a cache line's worth of data - typically 32-64 bytes - will be fetched, in the case above you are fetching the whole structure (approximately), of which only a quarter is being used. In the alternative case, one fetch will load consecutive elements of the _north and _east arrays, which will be used on the next three turns through the loop. Given that your program is limited by bandwidth to memory, for a 32-byte cache line expect a speedup of 4, and for a 64-byte cache line a speedup of 4.5 - which seems to fit with what you have observed. Given this access pattern, do the following: Define two structure types, EQ_coord and EQ_data, the first containing the current components x and y, and the second containing the other current structure elements. That should give the performance of the structure of arrays, while having much less handling issues - you just need to read, write, copy... two items instead of a single one. Jan
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.