Code Comments
Programming Forum and web based access to our favorite programming groups.We have a program that we are running on both Xeon 32 bit and Opteron 64 bi t cpus. The program runs much faster on the 32 bit Xeon processors. The run time (wall clock) is as follows : Xeon 32 bit = .017 wall clock-hours Opteron 64 bit = .309 wall clock-hours The code is being compiled with Portland Group Compiler version 5.2-4. The internal customer then examined the code and changed the way arrays are allocated. Old Way : real, dimension (:,:), pointer :: p integer, dimension(:), pointer :: jmax,kmax,Lmax,ipbeg New Way : allocatable :: p(:,:) allocatable :: jmax(:), kmax(:),Lmax(:),ipbeg(:) Changing the array allocation decreased the wall clock time to almost nothin g on both types of cpus according to our internal customer. Can anyone explain why the Old Way code would run much slower on a 64 bit opteron CPU and can someone explain why the code would run much faster on both Xeon 32 bit and Opteron 64 bit using the allocatable statement inste ad of the pointer method. Thanx, BB The Boeing Company
Post Follow-up to this messageIn article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>, Joe Hill wrote: > We have a program that we are running on both Xeon 32 bit and Opteron 64 bit > cpus. The program runs much faster on the 32 bit Xeon processors. The ru n time > (wall clock) is as follows : > > Xeon 32 bit = .017 wall clock-hours > Opteron 64 bit = .309 wall clock-hours > > The code is being compiled with Portland Group Compiler version 5.2-4. With a difference that big (almost 20x!) I'd guess at a compiler (optimization) bug. Can you try a 32-bit executable on the opteron? Or try some other compiler? > The internal customer then examined the code and changed the way arrays ar e > allocated. > > Old Way : > > real, dimension (:,:), pointer :: p > integer, dimension(:), pointer :: jmax,kmax,Lmax,ipbeg > > New Way : > > allocatable :: p(:,:) > allocatable :: jmax(:), kmax(:),Lmax(:),ipbeg(:) > > Changing the array allocation decreased the wall clock time to almost noth ing on > both types of cpus according to our internal customer. > > Can anyone explain why the Old Way code would run much slower on a 64 bit > opteron CPU and can someone explain why the code would run much faster > on both Xeon 32 bit and Opteron 64 bit using the allocatable statement ins tead > of the pointer method. Pointers allow aliasing, but allocatable arrays don't, thus the optimizer has to be more conservative regarding pointers. However, I wouldn't expect to see a 20x performance difference due to aliasing, perhaps a few tens of percents or maybe a factor of two could be explained with aliasing, but not 20x IMHO. -- Janne Blomqvist
Post Follow-up to this messageIn article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>, Joe Hill <georgecostanz50@hotmail.com> wrote: > Can anyone explain why the Old Way code would run much slower on a 64 bit > opteron CPU Without knowing any more details about the hardware, I would guess that the Xeon CPU is simply faster than the Opteron CPU. It might also be cache size, memory speed, total memory, or other features of the two machines, or it could be that the compiler optimizations are different for the two machines. > and can someone explain why the code would run much faster > on both Xeon 32 bit and Opteron 64 bit using the allocatable statement ins tead > of the pointer method. With pointer arrays, the compiler must often assume that the arrays (at least the ones of the same type) can overlap through hidden aliases, and this inhibits many optimizations (including, for example, storage into fast registers rather than slower memory). With other array types (static, dummy, allocatable, common, automatic), the compiler can assume that there are no hidden aliases and can optimize much more aggressively. This is why fortran has traditionally been faster than other languages such as C and C++ in which arrays are really just a different syntax for pointer addressing, and the pointers are assumed to be wild. That is not to say that it is impossible to get good performance with pointer arrays, but it is more effort for the programmer to recognize and work around the various problems. $.02 -Ron Shepard
Post Follow-up to this messageRon Shepard wrote: > In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>, > Joe Hill <georgecostanz50@hotmail.com> wrote: > > > > > Without knowing any more details about the hardware, I would guess > that the Xeon CPU is simply faster than the Opteron CPU. It might > also be cache size, memory speed, total memory, or other features of > the two machines, or it could be that the compiler optimizations are > different for the two machines. > Assuming equal quality compilers (almost certainly not true here) the 64 bit code will be have more memory traffic for code and pointers so may run a bit slower. This was true back when 16 bit processors were being replaced by 32 bit ones, etc, etc. The extra memory traffic may also cause more cache misses. Caches issues were not a bit deal at the previous 16 to 32 bit transition time. All of this is masked by the 64 bit processors having faster hardware than the previous generation processors, so will only show up when on the same electronic speed processors as may be true with compatibility modes. > > > > With pointer arrays, the compiler must often assume that the arrays > (at least the ones of the same type) can overlap through hidden > aliases, and this inhibits many optimizations (including, for > example, storage into fast registers rather than slower memory). > With other array types (static, dummy, allocatable, common, > automatic), the compiler can assume that there are no hidden aliases > and can optimize much more aggressively. This is why fortran has > traditionally been faster than other languages such as C and C++ in > which arrays are really just a different syntax for pointer > addressing, and the pointers are assumed to be wild. That is not to > say that it is impossible to get good performance with pointer > arrays, but it is more effort for the programmer to recognize and > work around the various problems. > This assumes the compilers of equal quality which is still a very heroic assumption. It was not that long ago when allocatables were slower and even when assumed shape (":"s) was slower than the assumed size or explicit size of F77 style code. When compiler vendors are claiming that the new version is many many percent faster than the previous version one should understand it to mean that compiler improvements are still underway and it is reasonable to expect more to come. When the improvements are only incremental then the mature state may be close at hand. F90 is a bigger language than F77, and has been around for a shorter period of time, so its optimizers are both harder to write and less complete. > $.02 -Ron Shepard
Post Follow-up to this messageIn article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>, Joe Hill <georgecostanz50@hotmail.com> wrote: > We have a program that we are running on both Xeon 32 bit and Opteron 64 bit > cpus. The program runs much faster on the 32 bit Xeon processors. The ru n > time > (wall clock) is as follows : > > Xeon 32 bit = .017 wall clock-hours > Opteron 64 bit = .309 wall clock-hours Some difference could be explained several ways, but that's a pretty darned big difference for any of the explanations. You say you were using the same compiler for both (or anyway, that's how I interpreted what you said), but maybe it is just the same version number. Anyway, I can't explain that part. > The internal customer then examined the code and changed the way arrays ar e > allocated. > > Old Way : [pointers] > New Way : [allocatables] > Changing the array allocation decreased the wall clock time to almost noth ing > on > both types of cpus according to our internal customer. > Can anyone explain Others have talked about aliasing, but I'd guess that to be the wrong explanation here. Aliasing can be important, but I wouldn't expect to see changes quite as big as you describe except possibly in the most contrived special cases. However... I have personally seen *HUGE* differences between allocatable and pointer arrays because allocatables are known at compile time to be contiguous, whereas pointers are not. In some compilers, this causes unnecessary copy-in/copy-out operations. That can result in performance penalties that are almost arbitrarily large when huge arrays get copied around just to perform trivial operations on single elements. A good quality (in my opinion) compiler ought to notice at run-time that the arrays are actually contiguous and don't need copying. I know that the NAG compiler was doing that well over a decade ago. Other compilers were slower to pick up that it was important, but I thought most of them had it right by now. Have you tried other compilers? I've generally not been very happy with the PGI ones, though admittedly my gripes with it have been about excessive bugginess instead of speed. It has a reputation for decent speed if you can get your code to work (I couldn't say first hand as I got tired of working around bugs before I got into speed tests). -- Richard Maine | Good judgment comes from experience; email: my first.last at org.domain | experience comes from bad judgment. org: nasa, domain: gov | -- Mark Twain
Post Follow-up to this message"Richard E Maine" <nospam@see.signature> wrote in message news:nospam-0F7BD0.12463223042005@news.supernews.com... > In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>, > Joe Hill <georgecostanz50@hotmail.com> wrote: > > > Some difference could be explained several ways, but that's a pretty > darned big difference for any of the explanations. You say you were > using the same compiler for both (or anyway, that's how I interpreted > what you said), but maybe it is just the same version number. Anyway, I > can't explain that part. > > > > Others have talked about aliasing, but I'd guess that to be the wrong > explanation here. Aliasing can be important, but I wouldn't expect to > see changes quite as big as you describe except possibly in the most > contrived special cases. However... > > I have personally seen *HUGE* differences between allocatable and > pointer arrays because allocatables are known at compile time to be > contiguous, whereas pointers are not. In some compilers, this causes > unnecessary copy-in/copy-out operations. That can result in performance > penalties that are almost arbitrarily large when huge arrays get copied > around just to perform trivial operations on single elements. Aliasing could account for as much as a factor of 5 in performance on the Xeon, if it makes the difference between vectorizing or not. Not as much difference on the Opteron, but still significant, for single precision. A larger factor might come about, if temporary arrays were allocated in an inner loop, but can be eliminated by optimization with the new declaration.
Post Follow-up to this messageIn article <mmxae.59805$VF5.16452@edtnps89>, Gordon Sande <g.sande@worldnet.att.net> wrote: >Assuming equal quality compilers (almost certainly not true here) the >64 bit code will be have more memory traffic for code and pointers >so may run a bit slower. Remember that on the Opteron and EM64T, 64-bit mode has twice as many registers. I've never seen a Fortran program that ran slower in 64-bit mode. Lots of C programs do get slower in 64-bit mode, our compiler is an example -- SGI found it was 2x slower in 64-bit mode, and we've never even tried it. But you'd have to use an awful lot of pointers to get a Fortran program to slow down. By the way, while 64-bit instructions on the Opteron are sometimes one byte bigger, there are usually fewer of them because there are fewer register spills. And there are often fewer register-memory operations, which also shrinks instructions. So code size generally doesn't change dramatically between 32-bit and 64-bit on the Opteron. As a generalization, what you said is true, but in this case, the details reverse the generalization. -- greg
Post Follow-up to this messageJoe Hill wrote: > We have a program that we are running on both Xeon 32 bit and Opteron 64 bit > cpus. The program runs much faster on the 32 bit Xeon processors. The ru n time > (wall clock) is as follows : (snip) > Can anyone explain why the Old Way code would run much slower on a 64 bit > opteron CPU and can someone explain why the code would run much faster > on both Xeon 32 bit and Opteron 64 bit using the allocatable statement ins tead > of the pointer method. Compile with the option to show the generated assembly code and post that. Hopefully for a small program. -- glen
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.