Home > Archive > Fortran > March 2007 > AMD Opteron + intel fortran (linux)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
AMD Opteron + intel fortran (linux)
|
|
|
| Hi,
We've purchased an Opteron workstation, we've been using Intel Fortran
(present version is 9.1) on several pentium based linux workstations
for several years with no problems. We've noticed that the Intel
compiler fails to run code compiled with SSE3 (-m sse3) extensions
using the 32-bit and 64-bit versions of ifort. There's a program
that was written to strip executables of a CPU check that Intel places
in the files (http://www.swallowtail.org/naughty-intel.html) . When
we run patched binaries compiled with SSE3 (64-bit) we see runtimes
fall from 3.2 sec to 2.8 sec (which will add up in the way we use our
code for research).
What is the general consensus within this forum on Intel compilers on
AMD hardware. We are very displeased that Intel HAS crippled their
compiler on non-Intel hardware.
Which compilers do people here recommend for the AMD platform?
Thank you,
jmj75
| |
| Steven G. Kargl 2007-03-12, 7:06 pm |
| In article <1173719696.056309.132450@s48g2000cws.googlegroups.com>,
"jmj75" <jmorales75@gmail.com> writes:
>
> Which compilers do people here recommend for the AMD platform?
>
I prefer gfortran, but I may also be biased because it is the
only native compiler available for FreeBSD-amd64 and I contribute
patches to gfortran.
You can see performance comparisons for various compilers at
Polyhedron's website.
--
Steve
http://troutmask.apl.washington.edu/~kargl/
| |
| Steve Lionel 2007-03-12, 7:06 pm |
| jmj75 wrote:
> Hi,
>
> We've purchased an Opteron workstation, we've been using Intel Fortran
> (present version is 9.1) on several pentium based linux workstations
> for several years with no problems. We've noticed that the Intel
> compiler fails to run code compiled with SSE3 (-m sse3) extensions
> using the 32-bit and 64-bit versions of ifort.
Have you tried compiling with -xW ? Do you know for certain that it is
the SSE3 instructions specifically that are making the difference?
Steve Lionel
Developer Products Division
Intel Corporation
Nashua, NH
User communities for Intel Software Development Products
http://softwareforums.intel.com/
Intel Fortran Support
http://developer.intel.com/software/products/support/
My Fortran blog
http://www.intel.com/software/drfortran
| |
| Sebastian Hanigk 2007-03-12, 7:06 pm |
| "jmj75" <jmorales75@gmail.com> writes:
> What is the general consensus within this forum on Intel compilers on
> AMD hardware. We are very displeased that Intel HAS crippled their
> compiler on non-Intel hardware.
As though I am in a similar situation as you, I can understand that
Intel has its reasons, they are after all not only in the business of
producing compilers but also selling processors.
Perhaps a strained analogy, but Apple does not sell its OS in a version
runnable on non-Apple hardware.
> Which compilers do people here recommend for the AMD platform?
We have a cluster of Opteron 850s (more detail under
<http://infiniband.in.tum.de/> ) with the Intel 9.0 (20050430) and PGI
5.2 and 6.2 commercial compilers, if necessary, one could install
gfortran or g95 and I'm also using Sun's Studio 11 compilers for Linux
from December 2006.
If your code depends on BLAS/LAPACK routines heavily, I would recommend
the PGI compilers, gfortran or g95 (AMD has versions of its ACML
compiled for those); with my tests, the Intel binaries ran slower than
PGI-compiled ones, but as already stated, my code mostly uses BLAS
routines.
The Sun compiler has advantages, i.e. the collect/analyze framework is
nice and - most important - it has a working ISO_C_BINDING module from
the F2003 standard.
Sebastian
| |
| Greg Lindahl 2007-03-12, 7:06 pm |
| In article <et48el$ou7$1@news.lrz-muenchen.de>,
Sebastian Hanigk <hanigk@in.tum.de> wrote:
> (AMD has versions of its ACML compiled for those);
.... and a version for PathScale, too.
-- greg
(employed by, not speaking for, QLogic/PathScale.)
| |
| joehill@hotmail.com 2007-03-12, 7:06 pm |
| Sebastian,
There is now a downloadable ACML built for the Intel compilers. I
had a program that ran very slowly with Intel when using the gcc
compiled version of ACML, but with the ACML natively compiled with
intel, it runs just as fast as PGI.
Joe
On Mon, 12 Mar 2007 20:08:40 +0100, Sebastian Hanigk
<hanigk@in.tum.de> wrote:
>"jmj75" <jmorales75@gmail.com> writes:
>
>
>As though I am in a similar situation as you, I can understand that
>Intel has its reasons, they are after all not only in the business of
>producing compilers but also selling processors.
>
>Perhaps a strained analogy, but Apple does not sell its OS in a version
>runnable on non-Apple hardware.
>
>
>We have a cluster of Opteron 850s (more detail under
><http://infiniband.in.tum.de/> ) with the Intel 9.0 (20050430) and PGI
>5.2 and 6.2 commercial compilers, if necessary, one could install
>gfortran or g95 and I'm also using Sun's Studio 11 compilers for Linux
>from December 2006.
>
>If your code depends on BLAS/LAPACK routines heavily, I would recommend
>the PGI compilers, gfortran or g95 (AMD has versions of its ACML
>compiled for those); with my tests, the Intel binaries ran slower than
>PGI-compiled ones, but as already stated, my code mostly uses BLAS
>routines.
>
>The Sun compiler has advantages, i.e. the collect/analyze framework is
>nice and - most important - it has a working ISO_C_BINDING module from
>the F2003 standard.
>
>
>Sebastian
| |
| Jan Vorbrüggen 2007-03-13, 4:15 am |
| > When we run patched binaries compiled with SSE3 (64-bit) we see runtimes
> fall from 3.2 sec to 2.8 sec (which will add up in the way we use our
> code for research).
>
> What is the general consensus within this forum on Intel compilers on
> AMD hardware. We are very displeased that Intel HAS crippled their
> compiler on non-Intel hardware.
Turn that example around: There is no check in the Intel library code, and
your program runs fine on an AMD processor, but silently produces wrong
results because AMD processors implement some corner case differently from the
Intel processors. Unlikely? Sure. Possible? Yes. So who are you going to sue
in this case, because your PhD results turn out to be crap, or something even
worse (not that that is really possible 8-)) has happened?
You solved your problem by using the patched library, you get a speed
improvement, and you take the risk of incompatibility yourself, as it should
be. So why are you complaining.
And no conspiracy theories, please. I work in a multinational that's large
enough, even if it's significant smaller than Intel. In such cases, the legal
department has veto power, I can tell you. The marketing guys just go along,
and the engineers make sure the necessary knowledge gets out.
Jan
| |
| Sebastian Hanigk 2007-03-13, 7:10 pm |
| joehill@hotmail.com writes:
> There is now a downloadable ACML built for the Intel compilers. I
> had a program that ran very slowly with Intel when using the gcc
> compiled version of ACML, but with the ACML natively compiled with
> intel, it runs just as fast as PGI.
Nice, thanks!
By the way: has onyone tips for a more structured approach to build
against multiple versions of libraries with different compilers? At the
moment, I have a Makefile for each interesting combination (only three
at the moment), but this is clearly unscalable and with the addition of
new math libraries and MPI variants, I really have to do it in a more
modularly structured way.
Sebastian
| |
| Keith Refson 2007-03-14, 8:08 am |
| jmj75 wrote:
[snip]
> There's a program that was written to strip executables of a CPU
> check that Intel places in the files
> (http://www.swallowtail.org/naughty-intel.html) . When we run
> patched binaries compiled with SSE3 (64-bit) we see runtimes fall
> from 3.2 sec to 2.8 sec
> What is the general consensus within this forum on Intel compilers on
> AMD hardware. We are very displeased that Intel HAS crippled their
> compiler on non-Intel hardware.
I don't speak for the concensus, only my own experience! Which
is that even the "crippled" Intel compiler code performance is
highly competitive with its, er, competitors. Take a look at
the Polyhedron website benchmarks (http://www.polygedron.com)
But you didn't say which intel compiler flags you used. (You
mentioned -m sse3 but that is a gcc flag and NOT an intel one).
I am told that (not 100% sure but doubtless someone will
correct me) that if you use the "-axW" flag (ie specify "use SSE2
flags if the processor has sse2) will NOT choose to run the
SSE3 variant on AMD processors, and you do not get the benefit
on opterons, but if you use "-xW" (ie "use SSE2 and damn the
consequences") then you do get the benefit of the SSE2 instruction
set on AMD Opteron processors. I can confirm the latter works (*), and
use "-xW" compiled code on Opterons for performance.
I would hazard a guess that the behaviour with respect to the "-xP"
flag for SSE3 is similar.
(*) - apart from one specific case where incorrect code is generated and
the results are incorrect - but that happens with Xeon processors
too so it's just a bug.
> Which compilers do people here recommend for the AMD platform?
Keith Refson
--
Dr Keith Refson,
Building R3
Rutherford Appleton Laboratory
Chilton
Didcot kr AT
Oxfordshire OX11 0QX isise D@T rl D.T ac D?T uk
| |
| Keith Refson 2007-03-14, 8:08 am |
| Sebastian Hanigk wrote:
> If your code depends on BLAS/LAPACK routines heavily, I would recommend
> the PGI compilers, gfortran or g95 (AMD has versions of its ACML
> compiled for those); with my tests, the Intel binaries ran slower than
> PGI-compiled ones, but as already stated, my code mostly uses BLAS
> routines.
and others noted that ACML is now available for Intel and Pathscale compilers
too. I plan to test the Intel version now, but I would be rather surprised if
it was the fastest BLAS. Intel's MKL has always benchmarked faster for me,
even on AMD processors, but is significantly outperformed by Kasushige Goto's
assembler BLAS, which is what I always recommend. It's also free of charge.
http://www.tacc.utexas.edu/resources/software/
Keith Refson
--
Dr Keith Refson,
Building R3
Rutherford Appleton Laboratory
Chilton
Didcot kr AT
Oxfordshire OX11 0QX isise D@T rl D.T ac D?T uk
| |
| Keith Refson 2007-03-14, 8:08 am |
| Sebastian Hanigk wrote:
> By the way: has onyone tips for a more structured approach to build
> against multiple versions of libraries with different compilers? At the
> moment, I have a Makefile for each interesting combination (only three
> at the moment), but this is clearly unscalable and with the addition of
> new math libraries and MPI variants, I really have to do it in a more
> modularly structured way.
If anyone else knows of a decent approach to this I'd be interested to hear
it too. Standard tools like GNU autoconf are absolutely useless for this
task.
The approach I have taken with the CASTEP code is to (a) mandate GNU make.
This allows for conditional setting of options and arguments inside the
makefile and (b) Generic makefiles with separate include files for each
separate compiler/OS combination. The include files can share conditional
code for library switching. With source/objects separated this allows for
multiple builds using the same source tree. It keeps the scalability
manageable with just a single OS/compiler dependent file for each port.
This approach is not ideal but "good enough". The one thing it can't handle
is finding the link paths for the different libraries automatically. I resort
to prompting on the first compile and storing the result in a file. Later builds
look up the file, so you only get the prompt on the first "make" for any given
combination.
Keith Refson
--
Dr Keith Refson,
Building R3
Rutherford Appleton Laboratory
Chilton
Didcot kr AT
Oxfordshire OX11 0QX isise D@T rl D.T ac D?T uk
| |
| Sebastian Hanigk 2007-03-14, 8:08 am |
| Keith Refson <kr@isise.rl.ac.uk> writes:
>
> If anyone else knows of a decent approach to this I'd be interested to hear
> it too. Standard tools like GNU autoconf are absolutely useless for this
> task.
Short update: I have searched a bit for make replacements and in the
next w or two I will look into SCons (<http://www.scons.org/> ); it's
based on Python and with builtin support for multiple environments, easy
separation of source and build directories and few more features, it
seems worth a test.
> The approach I have taken with the CASTEP code is to (a) mandate GNU make.
> This allows for conditional setting of options and arguments inside the
> makefile and (b) Generic makefiles with separate include files for each
> separate compiler/OS combination. The include files can share conditional
> code for library switching. With source/objects separated this allows for
> multiple builds using the same source tree. It keeps the scalability
> manageable with just a single OS/compiler dependent file for each
> port.
With respect to make, I have taken the approach suggested in "Recursive
Make Considered Harmful"
(<http://members.pcug.org.au/~millerp...-cons-harm.html> ),
i.e. one root Makefile, sourcing sub-Makefiles for
modules/directories. It works for me better than recursive make, but as
stated in my previous post, the whole situation is not really suited for
test and development with different build environments.
Sebastian
| |
| Sebastian Hanigk 2007-03-14, 8:08 am |
| Keith Refson <kr@isise.rl.ac.uk> writes:
> and others noted that ACML is now available for Intel and Pathscale
> compilers too. I plan to test the Intel version now, but I would be
> rather surprised if it was the fastest BLAS. Intel's MKL has always
> benchmarked faster for me, even on AMD processors, but is
> significantly outperformed by Kasushige Goto's assembler BLAS, which
> is what I always recommend. It's also free of charge.
Simple tests of DGEMM (most used routine in my code) have resulted in
the following: on Itanium processors, ifort + MKL reach a constant 95%
of theoretical peak performance with matrix size of 200 elements per
dimension and up; no major wobble. Second best efficiency-wise was ifort
+ SGI's SCSL (on Itanium) with around 85%, but more dependend on the
matrix size.
On Opteron processors, the combination of PGI and ACML resulted in an
efficiency asymptotically comparable to the SCSL, but around a size of
200, a major performance impact with following ramp-up til peak
efficiency can be observed. Cutting the curve into two around the impact
point, both look very similar, so I suspect that around that size, the
library switches internally to another set of routines.
I don't know if there is any interest in the performance data, but if
so, I can provide data for Itanium and Opteron performance of the ACML
and MKL with the Intel and PGI compilers around next w .
Sebastian
| |
| Pierre Asselin 2007-03-14, 7:12 pm |
| Keith Refson <kr@isise.rl.ac.uk> wrote:
> Sebastian Hanigk wrote:
[color=darkred]
> If anyone else knows of a decent approach to this I'd be interested to hear
> it too. Standard tools like GNU autoconf are absolutely useless for this
> task.
GNU configure scripts routinely check for third-party libraries
and adjust the build accordingly. They also accept manual overrides.
If I unpack my gnuplot tarball and run "./configure --help", I get
...
--with-PACKAGE[=ARG] use PACKAGE [ARG=yes]
--without-PACKAGE do not use PACKAGE (same as --with-PACKAGE=no)
...
--with-plot=DIR use the Unix plot library
--with-png=DIR where to find the png library
--with-gd=DIR where to find Tom Boutell's gd library
--with-gif=png 'set term gif' produces png images instead
(libgd version >= 1.8)
--with-pdf=DIR enable pdf terminal
(requires PDFLib)
and many more. Mind you, *I* have no clue how to write the
corresponding autoconf inputs. The gnuplot ones were generated by
automake and therefore are illegible.
--
pa at panix dot com
| |
| Tim Prince 2007-03-14, 7:12 pm |
| Keith Refson wrote:
>
> But you didn't say which intel compiler flags you used. (You
> mentioned -m sse3 but that is a gcc flag and NOT an intel one).
>
> if you use "-xW" (ie "use SSE2 and damn the
> consequences") then you do get the benefit of the SSE2 instruction
> set on AMD Opteron processors. I can confirm the latter works (*), and
> use "-xW" compiled code on Opterons for performance.
>
> I would hazard a guess that the behaviour with respect to the "-xP"
> flag for SSE3 is similar.
>
Why guess about this? ifort for x86-64 supports the combination -xWP,
which would execute plain SSE2 codes on Opteron, and SSE3 code on recent
Intel processors.
The thread has become long and digressive, but I didn't yet see any
quotation of actual Fortran generated SSE3 code which is claimed to
improve performance over SSE2.
| |
| Helge Avlesen 2007-03-14, 7:12 pm |
| Sebastian Hanigk <hanigk@in.tum.de> writes:
> By the way: has onyone tips for a more structured approach to build
> against multiple versions of libraries with different compilers? At the
> moment, I have a Makefile for each interesting combination (only three
> at the moment), but this is clearly unscalable and with the addition of
> new math libraries and MPI variants, I really have to do it in a more
> modularly structured way.
I prefer to build libraries separately with their respective provided
build systems, and then use my home grown dependency generator to
build the set of files that are edited more regularly with a single,
shortish (GNU) makefile. (see www.ii.uib.no/~avle/mkdep.html for
details)
on our local cluster we have 4 fortran compilers and more than 5 mpi
libraries, where only one mpi can use all compilers (Scampi) so this
gives a few combinations... I find short, well structured makefiles
using make variables to choose the desired combination of flags, to be
the most convenient way out of the mess.
Helge
| |
|
| On Mar 13, 3:21 am, Jan Vorbr=FCggen <jvorbrueg...@not-mediasec.de>
wrote:
> Turn that example around: There is no check in the Intel library code, and
> your program runs fine on an AMD processor, but silently produces wrong
> results because AMD processors implement some corner case differently fro=
m the
> Intel processors. Unlikely? Sure. Possible? Yes. So who are you going to =
sue
> in this case, because your PhD results turn out to be crap, or something =
even
> worse (not that that is really possible 8-)) has happened?
I will sue nobody, we check results before we publish them...compiler
optimizations are used with care and results are cross checked with
those obtained without them.
> You solved your problem by using the patched library, you get a speed
> improvement, and you take the risk of incompatibility yourself, as it sho=
uld
> be. So why are you complaining.
Not complaining, simply providing observations, an opinion, and posed
a question to others who use AMD systems. Please read my original
post carefully.
> And no conspiracy theories, please. I work in a multinational that's large
> enough, even if it's significant smaller than Intel. In such cases, the l=
egal
> department has veto power, I can tell you. The marketing guys just go alo=
ng,
> and the engineers make sure the necessary knowledge gets out.
I have no conspiracy theory to offer you. Just an observation on a
product that does not function on compatible hardware, I don't care
about the details on how Intel is running their business. We are
displeased with the Intel Compiler's performance on our new hardware
and will move on to another compiler for our AMD system.
jmj
| |
| Mark Mackey 2007-03-14, 7:12 pm |
| In article <45F7C744.2060302@isise.rl.ac.uk>,
Keith Refson <kr@isise.rl.ac.uk> wrote:
>
>I am told that (not 100% sure but doubtless someone will
>correct me) that if you use the "-axW" flag (ie specify "use SSE2
>flags if the processor has sse2) will NOT choose to run the
>SSE3 variant on AMD processors, and you do not get the benefit
>on opterons, but if you use "-xW" (ie "use SSE2 and damn the
>consequences") then you do get the benefit of the SSE2 instruction
>set on AMD Opteron processors. I can confirm the latter works (*), and
>use "-xW" compiled code on Opterons for performance.
Nearly correct, but not quite :). Assuming we're talking 32-bit here,
-xW makes the compiler generate code which assumes that SSE2 is
available, so if using SSE2 instructions would result in faster code in
the compiler's opinion if will do so. So yes, -xW makes the main body of
code in your program use SSE2 even on AMD chips.
However, and it's a big however, the math libraries linked in by the
compiler were all effectively compiled with -axKWNBP: they contain
multiple different versions of the same routine and choose which to use
based on runtime CPU capability detection. As a result, with the
upatched compiler, your calls to exp() on an Intel chip go to the SSE2
vectorised version of exp(), while your calls to exp() on an AMD chip go
to the 'Assume nothing and only use 386 instructions' version of exp().
So, if your program spends most of its time in its own code, -xW will do
just fine on either an Intel or an AMD chip. If you spend a lot of time
doing sin(), cos(), exp(), or anything else in libm, then patching the
compiler to remove the 'Is it an Intel chip?' test can give you big
performance increases.
In a 64-bit environment the picture is slightly different, as the 'base
level' instruction set assumed by the compiler is SSE2 (as opposed to
386 for the 32-bit compiler). As a result, even if your chip fails the
'is it made by Intel' test you'll still get SSE2 performance out of it,
and since SSE3 and so on provide only marginal performance benefits
patching the compiler to remove the check doesn't make much difference
usually.
--
Mark Mackey http://www.swallowtail.org/
code code code code code code code code code code code code code bug code co
de code code code bug code code code code code code code code code code code
code code code code code code code code code code code code code code code c
| |
|
| On Mar 14, 8:46 am, Tim Prince <timothypri...@sbcglobal.net> wrote:
>
>
> Why guess about this? ifort for x86-64 supports the combination -xWP,
> which would execute plain SSE2 codes on Opteron, and SSE3 code on recent
> Intel processors.
> The thread has become long and digressive, but I didn't yet see any
> quotation of actual Fortran generated SSE3 code which is claimed to
> improve performance over SSE2.
But below are run times, SSE2 vs SSE3, averaged across 5 runs.
compiler options: -O3 -axW
Ave. Time: 0.628 sec.
compiler options: -O3 -msse3 (patched)
Ave. Time: 0.568 sec.
compiler options: -O3 -msse3 (unpatched)
Fatal Error: This program was not built to run on the processor in
your system.
The allowed processors are: Intel Pentium 4 and compatible Intel
processors with Streaming SIMD Extensions 3 (SSE3) instruction
support.
Results with the option -O3 -xWP are the same as with -msse3 (patched
and unpatched)
Numerical results of all runs are the same.
jmj
| |
| Tim Prince 2007-03-14, 7:12 pm |
| Mark Mackey wrote:
> In article <45F7C744.2060302@isise.rl.ac.uk>,
> Keith Refson <kr@isise.rl.ac.uk> wrote:
>
> Nearly correct, but not quite :). Assuming we're talking 32-bit here,
> -xW makes the compiler generate code which assumes that SSE2 is
> available, so if using SSE2 instructions would result in faster code in
> the compiler's opinion if will do so. So yes, -xW makes the main body of
> code in your program use SSE2 even on AMD chips.
Sorry, Keith is more nearly correct. Even on 32-bit, the -xW option
generates only the SSE2 code, no run-time check. It will fail "illegal
instruction" if a run is attempted on a machine which doesn't support
all those instructions.
Keith may have -axP "use SSE3 on Intel machine which supports
it, otherwise no SSE at all" and -axW "use SSE2 on Intel P4 or newer,
otherwise no SSE." Those options aren't likely to be useful if you will
never run on a machine without full SSE2 support.
> So, if your program spends most of its time in its own code, -xW will do
> just fine on either an Intel or an AMD chip. If you spend a lot of time
> doing sin(), cos(), exp(), or anything else in libm, then patching the
> compiler to remove the 'Is it an Intel chip?' test can give you big
> performance increases.
Unlikely. If the compiler vectorizes one of those math functions, it
will use the SSE2 short vector library, and you should get at least as
much performance with an -xW compilation, as with -xP or -xT with the
cpu check disabled. -axW would generate multiple code paths, with the
vector library called only when running on the Intel machine.
>
> In a 64-bit environment the picture is slightly different, as the 'base
> level' instruction set assumed by the compiler is SSE2 (as opposed to
> 386 for the 32-bit compiler). As a result, even if your chip fails the
> 'is it made by Intel' test you'll still get SSE2 performance out of it,
> and since SSE3 and so on provide only marginal performance benefits
> patching the compiler to remove the check doesn't make much difference
> usually.
>
This refers to code compiled for 64-bit mode. 32-bit code runs the same
on either a 32- or 64-bit OS. In the 64-bit compiler, there are ways to
select non-vector SSE2 code for Opteron but vector code for Intel.
Again, the standard advice is to use -xW and get the same vector code
(where possible) for all machines.
In my usage, the main reason for selecting non-vector code is for
certain types of operations where the vector code increases compile time
and generated code size without improving run time. This tends to be
code which could have been written OK in C. I hope some readers share
the opinion that Fortran is viable even though C could have been used.
Of course, I express only my personal opinion here.
| |
|
| On Mar 14, 9:00 am, Mark Mackey <m...@chiark.greenend.org.uk> wrote:
> In article <45F7C744.2060...@isise.rl.ac.uk>,
> Keith Refson <k...@isise.rl.ac.uk> wrote:
>
>
>
>
> Nearly correct, but not quite :). Assuming we're talking 32-bit here,
> -xW makes the compiler generate code which assumes that SSE2 is
> available, so if using SSE2 instructions would result in faster code in
> the compiler's opinion if will do so. So yes, -xW makes the main body of
> code in your program use SSE2 even on AMD chips.
>
> However, and it's a big however, the math libraries linked in by the
> compiler were all effectively compiled with -axKWNBP: they contain
> multiple different versions of the same routine and choose which to use
> based on runtime CPU capability detection. As a result, with the
> upatched compiler, your calls to exp() on an Intel chip go to the SSE2
> vectorised version of exp(), while your calls to exp() on an AMD chip go
> to the 'Assume nothing and only use 386 instructions' version of exp().
>
> So, if your program spends most of its time in its own code, -xW will do
> just fine on either an Intel or an AMD chip. If you spend a lot of time
> doing sin(), cos(), exp(), or anything else in libm, then patching the
> compiler to remove the 'Is it an Intel chip?' test can give you big
> performance increases.
>
> In a 64-bit environment the picture is slightly different, as the 'base
> level' instruction set assumed by the compiler is SSE2 (as opposed to
> 386 for the 32-bit compiler). As a result, even if your chip fails the
> 'is it made by Intel' test you'll still get SSE2 performance out of it,
> and since SSE3 and so on provide only marginal performance benefits
> patching the compiler to remove the check doesn't make much difference
> usually.
>
> --
> Mark Mackey http://www.swallowtail.org/
> code code code code code code code code code code code code code bug code co
> de code code code bug code code code code code code code code code code code
> code code code code code code code code code code code code code code code c
I have been trying to surf to www.swallowtail.org for several days
without success. Is the site down?
Thanks,
Peter
| |
|
|
| Mark Mackey 2007-03-16, 7:07 pm |
| In article <4EVJh.4066$tv6.1763@newssvr19.news.prodigy.net>,
Tim Prince <timothyprince@sbcglobal.net> wrote:
>
>Unlikely. If the compiler vectorizes one of those math functions, it
>will use the SSE2 short vector library, and you should get at least as
>much performance with an -xW compilation, as with -xP or -xT with the
>cpu check disabled. -axW would generate multiple code paths, with the
>vector library called only when running on the Intel machine.
Not true. If you compile with -xW, then the compiler vectorises calls to
e.g. expf() via the math library function vmlsExp4(). vmlsExp4() calls
get_cpu_indicator() and calls vmlsExp4.A() (the generic effectively
non-vectorised 386 version), vmlsExp4.I() (the SSE version) or
vmlsExp4.L() (the SSE2 version) depending on the return value from
get_cpu_indicator(). Disassemble one of your binaries and check for
yourself! The code produced with -xW never calls vmlsExp4.L() directly:
it always goes indirectly through vmlsExp4() and thus via the CPUID
check.
If you're running an AMD chip, then get_cpu_indicator() returns 'This is
a 386 at best', so you get the (effectively non-vectorised 386 code)
vmlsExp4.A() rather than the vectorised call. The difference is hundreds
of cycles.
If you look at the benchmarks in
http://www.swallowtail.org/naughty-intel.html
you can see that compiling with -xW patched is up to 10% faster on our
code than compiling with -xW unpatched. Note that that's -x, not -ax!
>Again, the standard advice is to use -xW and get the same vector code
>(where possible) for all machines.
Certainly good advice, but not always possible if you want to run the
same binary over multiple different architectures.
--
Mark Mackey http://www.swallowtail.org/
code code code code code code code code code code code code code bug code co
de code code code bug code code code code code code code code code code code
code code code code code code code code code code code code code code code c
| |
| Mark Mackey 2007-03-16, 7:07 pm |
| In article <1173935043.130395.137520@n76g2000hsh.googlegroups.com>,
Peter <petersamsimon2@hotmail.com> wrote:
>I have been trying to surf to www.swallowtail.org for several days
>without success. Is the site down?
It's up and running as far as I'm aware.
--
Mark Mackey http://www.swallowtail.org/
code code code code code code code code code code code code code bug code co
de code code code bug code code code code code code code code code code code
code code code code code code code code code code code code code code code c
| |
|
| On Mar 16, 6:32 am, Mark Mackey <m...@chiark.greenend.org.uk> wrote:
> In article <1173935043.130395.137...@n76g2000hsh.googlegroups.com>,
>
> Peter <petersamsim...@hotmail.com> wrote:
>
> It's up and running as far as I'm aware.
>
> --
> Mark Mackey http://www.swallowtail.org/
> code code code code code code code code code code code code code bug code co
> de code code code bug code code code code code code code code code code code
> code code code code code code code code code code code code code code code c
Thanks, guys, it seems to be working fine now. I just downloaded the
Intel executable patch and am looking forward to better performance on
our Opterons.
--Peter
| |
| pantarei 2007-03-16, 7:07 pm |
| Mark Mackey wrote:
>
> Tim Prince wrote:
>
> Not true. If you compile with -xW, then the compiler vectorises calls to
> e.g. expf() via the math library function vmlsExp4(). vmlsExp4() calls
> get_cpu_indicator() and...
> If you're running an AMD chip, then get_cpu_indicator() returns 'This is
> a 386 at best', so you get the (effectively non-vectorised 386 code)
> vmlsExp4.A() rather than the vectorised call. The difference is hundreds
> of cycles.
Beware you're getting a "party line" from an Intel mouthpiece
indoctrinated to think that *cheating* rather than competing on merits
is a way to go. Actually, that's relatively minor considering Intel
"switched off" CVF compiler upon realizing their "successor" was DOA.
| |
| Paul van Delst 2007-03-16, 7:07 pm |
| pantarei wrote:
> Mark Mackey wrote:
>
> Beware you're getting a "party line" from an Intel mouthpiece
> indoctrinated to think that *cheating* rather than competing on merits
> is a way to go. Actually, that's relatively minor considering Intel
> "switched off" CVF compiler upon realizing their "successor" was DOA.
I dunno if it's just me or what, but my kill file is getting really long.....
cheers,
paulv
--
Paul van Delst Ride lots.
CIMSS @ NOAA/NCEP/EMC Eddy Merckx
| |
| Beliavsky 2007-03-16, 7:07 pm |
| On Mar 16, 4:31 pm, Paul van Delst <Paul.vanDe...@noaa.gov> wrote:
> I dunno if it's just me or what, but my kill file is getting really long.....
Pantarei sounds much like the old "bv" (B. Voh) and "kia", who were
probably also in your kill files.
| |
| Rich Townsend 2007-03-16, 7:07 pm |
| Beliavsky wrote:
> On Mar 16, 4:31 pm, Paul van Delst <Paul.vanDe...@noaa.gov> wrote:
>
>
> Pantarei sounds much like the old "bv" (B. Voh) and "kia", who were
> probably also in your kill files.
>
Ah yes, I thought I recognized the stench of bitterness and disappointment.
| |
|
|
| Jan Vorbrüggen 2007-03-22, 7:04 pm |
| > I will sue nobody,
And Intel should rely on your promises?
How do you think the urban-legendary poodle got into the operating
instructions for microwave ovens?
> I have no conspiracy theory to offer you. Just an observation on a
> product that does not function on compatible hardware,
What compatible hardware? Only AMD might know about such compatibility, and
even that is not assured. And AMD is not a party to your purchase contract
with Intel.
> We are
> displeased with the Intel Compiler's performance on our new hardware
> and will move on to another compiler for our AMD system.
That, of course, you are free to do.
Jan
|
|
|
|
|