For Programmers: Free Programming Magazines  


Home > Archive > Fortran > April 2005 > Program speed execution question









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Program speed execution question
Joe Hill

2005-04-23, 3:58 pm

We have a program that we are running on both Xeon 32 bit and Opteron 64 bit
cpus. The program runs much faster on the 32 bit Xeon processors. The run time
(wall clock) is as follows :

Xeon 32 bit = .017 wall clock-hours
Opteron 64 bit = .309 wall clock-hours

The code is being compiled with Portland Group Compiler version 5.2-4.

The internal customer then examined the code and changed the way arrays are
allocated.

Old Way :

real, dimension (:,:), pointer :: p
integer, dimension(:), pointer :: jmax,kmax,Lmax,ipbeg

New Way :

allocatable :: p(:,:)
allocatable :: jmax(:), kmax(:),Lmax(:),ipbeg(:)

Changing the array allocation decreased the wall clock time to almost nothing on
both types of cpus according to our internal customer.

Can anyone explain why the Old Way code would run much slower on a 64 bit
opteron CPU and can someone explain why the code would run much faster
on both Xeon 32 bit and Opteron 64 bit using the allocatable statement instead
of the pointer method.

Thanx,

BB
The Boeing Company
Janne Blomqvist

2005-04-24, 3:57 am

In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>, Joe Hill wrote:
> We have a program that we are running on both Xeon 32 bit and Opteron 64 bit
> cpus. The program runs much faster on the 32 bit Xeon processors. The run time
> (wall clock) is as follows :
>
> Xeon 32 bit = .017 wall clock-hours
> Opteron 64 bit = .309 wall clock-hours
>
> The code is being compiled with Portland Group Compiler version 5.2-4.


With a difference that big (almost 20x!) I'd guess at a compiler
(optimization) bug. Can you try a 32-bit executable on the opteron? Or
try some other compiler?

> The internal customer then examined the code and changed the way arrays are
> allocated.
>
> Old Way :
>
> real, dimension (:,:), pointer :: p
> integer, dimension(:), pointer :: jmax,kmax,Lmax,ipbeg
>
> New Way :
>
> allocatable :: p(:,:)
> allocatable :: jmax(:), kmax(:),Lmax(:),ipbeg(:)
>
> Changing the array allocation decreased the wall clock time to almost nothing on
> both types of cpus according to our internal customer.
>
> Can anyone explain why the Old Way code would run much slower on a 64 bit
> opteron CPU and can someone explain why the code would run much faster
> on both Xeon 32 bit and Opteron 64 bit using the allocatable statement instead
> of the pointer method.


Pointers allow aliasing, but allocatable arrays don't, thus the
optimizer has to be more conservative regarding pointers. However, I
wouldn't expect to see a 20x performance difference due to aliasing,
perhaps a few tens of percents or maybe a factor of two could be
explained with aliasing, but not 20x IMHO.



--
Janne Blomqvist
Ron Shepard

2005-04-24, 3:57 am

In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>,
Joe Hill <georgecostanz50@hotmail.com> wrote:

> Can anyone explain why the Old Way code would run much slower on a 64 bit
> opteron CPU


Without knowing any more details about the hardware, I would guess
that the Xeon CPU is simply faster than the Opteron CPU. It might
also be cache size, memory speed, total memory, or other features of
the two machines, or it could be that the compiler optimizations are
different for the two machines.

> and can someone explain why the code would run much faster
> on both Xeon 32 bit and Opteron 64 bit using the allocatable statement instead
> of the pointer method.


With pointer arrays, the compiler must often assume that the arrays
(at least the ones of the same type) can overlap through hidden
aliases, and this inhibits many optimizations (including, for
example, storage into fast registers rather than slower memory).
With other array types (static, dummy, allocatable, common,
automatic), the compiler can assume that there are no hidden aliases
and can optimize much more aggressively. This is why fortran has
traditionally been faster than other languages such as C and C++ in
which arrays are really just a different syntax for pointer
addressing, and the pointers are assumed to be wild. That is not to
say that it is impossible to get good performance with pointer
arrays, but it is more effort for the programmer to recognize and
work around the various problems.

$.02 -Ron Shepard
Gordon Sande

2005-04-24, 3:57 am



Ron Shepard wrote:
> In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>,
> Joe Hill <georgecostanz50@hotmail.com> wrote:
>
>
>
>
> Without knowing any more details about the hardware, I would guess
> that the Xeon CPU is simply faster than the Opteron CPU. It might
> also be cache size, memory speed, total memory, or other features of
> the two machines, or it could be that the compiler optimizations are
> different for the two machines.
>


Assuming equal quality compilers (almost certainly not true here) the
64 bit code will be have more memory traffic for code and pointers
so may run a bit slower. This was true back when 16 bit processors were
being replaced by 32 bit ones, etc, etc. The extra memory traffic may
also cause more cache misses. Caches issues were not a bit deal at the
previous 16 to 32 bit transition time. All of this is masked by the
64 bit processors having faster hardware than the previous generation
processors, so will only show up when on the same electronic speed
processors as may be true with compatibility modes.

>
>
>
> With pointer arrays, the compiler must often assume that the arrays
> (at least the ones of the same type) can overlap through hidden
> aliases, and this inhibits many optimizations (including, for
> example, storage into fast registers rather than slower memory).
> With other array types (static, dummy, allocatable, common,
> automatic), the compiler can assume that there are no hidden aliases
> and can optimize much more aggressively. This is why fortran has
> traditionally been faster than other languages such as C and C++ in
> which arrays are really just a different syntax for pointer
> addressing, and the pointers are assumed to be wild. That is not to
> say that it is impossible to get good performance with pointer
> arrays, but it is more effort for the programmer to recognize and
> work around the various problems.
>


This assumes the compilers of equal quality which is still a very
heroic assumption. It was not that long ago when allocatables
were slower and even when assumed shape (":"s) was slower than
the assumed size or explicit size of F77 style code.

When compiler vendors are claiming that the new version is many many
percent faster than the previous version one should understand it to
mean that compiler improvements are still underway and it is reasonable
to expect more to come. When the improvements are only incremental
then the mature state may be close at hand.

F90 is a bigger language than F77, and has been around for a shorter
period of time, so its optimizers are both harder to write and
less complete.

> $.02 -Ron Shepard

Richard E Maine

2005-04-24, 3:57 am

In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>,
Joe Hill <georgecostanz50@hotmail.com> wrote:

> We have a program that we are running on both Xeon 32 bit and Opteron 64 bit
> cpus. The program runs much faster on the 32 bit Xeon processors. The run
> time
> (wall clock) is as follows :
>
> Xeon 32 bit = .017 wall clock-hours
> Opteron 64 bit = .309 wall clock-hours


Some difference could be explained several ways, but that's a pretty
darned big difference for any of the explanations. You say you were
using the same compiler for both (or anyway, that's how I interpreted
what you said), but maybe it is just the same version number. Anyway, I
can't explain that part.

> The internal customer then examined the code and changed the way arrays are
> allocated.
>
> Old Way : [pointers]
> New Way : [allocatables]


> Changing the array allocation decreased the wall clock time to almost nothing
> on
> both types of cpus according to our internal customer.
> Can anyone explain


Others have talked about aliasing, but I'd guess that to be the wrong
explanation here. Aliasing can be important, but I wouldn't expect to
see changes quite as big as you describe except possibly in the most
contrived special cases. However...

I have personally seen *HUGE* differences between allocatable and
pointer arrays because allocatables are known at compile time to be
contiguous, whereas pointers are not. In some compilers, this causes
unnecessary copy-in/copy-out operations. That can result in performance
penalties that are almost arbitrarily large when huge arrays get copied
around just to perform trivial operations on single elements.

A good quality (in my opinion) compiler ought to notice at run-time that
the arrays are actually contiguous and don't need copying. I know that
the NAG compiler was doing that well over a decade ago. Other compilers
were slower to pick up that it was important, but I thought most of them
had it right by now.

Have you tried other compilers? I've generally not been very happy with
the PGI ones, though admittedly my gripes with it have been about
excessive bugginess instead of speed. It has a reputation for decent
speed if you can get your code to work (I couldn't say first hand as I
got tired of working around bugs before I got into speed tests).

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
Tim Prince

2005-04-24, 3:57 am


"Richard E Maine" <nospam@see.signature> wrote in message
news:nospam-0F7BD0.12463223042005@news.supernews.com...
> In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@4ax.com>,
> Joe Hill <georgecostanz50@hotmail.com> wrote:
>
>
> Some difference could be explained several ways, but that's a pretty
> darned big difference for any of the explanations. You say you were
> using the same compiler for both (or anyway, that's how I interpreted
> what you said), but maybe it is just the same version number. Anyway, I
> can't explain that part.
>
>
>
> Others have talked about aliasing, but I'd guess that to be the wrong
> explanation here. Aliasing can be important, but I wouldn't expect to
> see changes quite as big as you describe except possibly in the most
> contrived special cases. However...
>
> I have personally seen *HUGE* differences between allocatable and
> pointer arrays because allocatables are known at compile time to be
> contiguous, whereas pointers are not. In some compilers, this causes
> unnecessary copy-in/copy-out operations. That can result in performance
> penalties that are almost arbitrarily large when huge arrays get copied
> around just to perform trivial operations on single elements.

Aliasing could account for as much as a factor of 5 in performance on the
Xeon, if it makes the difference between vectorizing or not. Not as much
difference on the Opteron, but still significant, for single precision. A
larger factor might come about, if temporary arrays were allocated in an
inner loop, but can be eliminated by optimization with the new declaration.


Greg Lindahl

2005-04-24, 3:57 am

In article <mmxae.59805$VF5.16452@edtnps89>,
Gordon Sande <g.sande@worldnet.att.net> wrote:

>Assuming equal quality compilers (almost certainly not true here) the
>64 bit code will be have more memory traffic for code and pointers
>so may run a bit slower.


Remember that on the Opteron and EM64T, 64-bit mode has twice as many
registers. I've never seen a Fortran program that ran slower in 64-bit
mode. Lots of C programs do get slower in 64-bit mode, our compiler is
an example -- SGI found it was 2x slower in 64-bit mode, and we've
never even tried it. But you'd have to use an awful lot of pointers to
get a Fortran program to slow down.

By the way, while 64-bit instructions on the Opteron are sometimes one
byte bigger, there are usually fewer of them because there are fewer
register spills. And there are often fewer register-memory operations,
which also shrinks instructions. So code size generally doesn't change
dramatically between 32-bit and 64-bit on the Opteron.

As a generalization, what you said is true, but in this case, the
details reverse the generalization.

-- greg

glen herrmannsfeldt

2005-04-27, 8:58 am

Joe Hill wrote:

> We have a program that we are running on both Xeon 32 bit and Opteron 64 bit
> cpus. The program runs much faster on the 32 bit Xeon processors. The run time
> (wall clock) is as follows :


(snip)

> Can anyone explain why the Old Way code would run much slower on a 64 bit
> opteron CPU and can someone explain why the code would run much faster
> on both Xeon 32 bit and Opteron 64 bit using the allocatable statement instead
> of the pointer method.


Compile with the option to show the generated assembly
code and post that. Hopefully for a small program.

-- glen

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com