Home > Archive > APL > February 2007 > AMD reinvents the x86
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
AMD reinvents the x86
|
|
| aleph0 2007-02-07, 7:56 am |
| AMD reinvents the x86
Tom Yager, InfoWorld
07/02/2007 10:33:48
//
Each of Barcelona's four cores incorporates a new vector math unit
referred to as SSE128 (128-bit streaming single-instruction-multiple-
data extensions). I am aware that you only do quantum physics on
w ends, but the potential for hardcore IT tasks such as encryption,
compression, real-time analysis of high volumes of streaming business
transactions, and wire-speed packet analysis is also the stuff of
science fiction. Barcelona gives floating point operations their own
schedulers (checkout lanes) and runs them twice as fast as 64-bit SSE
did. AMD claims that Barcelona's per-core floating point performance
is more than 80 percent faster than the present Opteron. Benchmark
that. And separating integer and floating-point schedulers also
accelerates this thing called virtualization, which you may notice is
a recurring theme for Barcelona.
...
...
//
http://www.pcworld.idg.com.au/index...795;fp;2;fpid;3
There's more info about energy-efficiency and Virtual Machine
efficiency..
// The CPU core flips page tables automatically and transparently.
This is another feature that's implemented for each core. //
AFAIK, production (delivery) is early H2/07.
Will any APL-Vendors be able to utilize any of these features soon
( e.g. vector math unit ) ?
| |
|
| On Feb 7, 6:17 am, "aleph0" <apl68...@tiscali.co.uk> wrote:
> AMD reinvents the x86
> Tom Yager, InfoWorld
> 07/02/2007 10:33:48
>
> //
> Each of Barcelona's four cores incorporates a new vector math unit
> referred to as SSE128 (128-bit streaming single-instruction-multiple-
> data extensions). I am aware that you only do quantum physics on
> w ends, but the potential for hardcore IT tasks such as encryption,
> compression, real-time analysis of high volumes of streaming business
> transactions, and wire-speed packet analysis is also the stuff of
> science fiction. Barcelona gives floating point operations their own
> schedulers (checkout lanes) and runs them twice as fast as 64-bit SSE
> did. AMD claims that Barcelona's per-core floating point performance
> is more than 80 percent faster than the present Opteron. Benchmark
> that. And separating integer and floating-point schedulers also
> accelerates this thing called virtualization, which you may notice is
> a recurring theme for Barcelona.
> ..
> ..
> //http://www.pcworld.idg.com.au/index.php/id;1086223795;fp;2;fpid;3
>
> There's more info about energy-efficiency and Virtual Machine
> efficiency..
> // The CPU core flips page tables automatically and transparently.
> This is another feature that's implemented for each core. //
>
> AFAIK, production (delivery) is early H2/07.
>
> Will any APL-Vendors be able to utilize any of these features soon
> ( e.g. vector math unit ) ?
Let's break the question into two parts:
Q1: WIll any APL vendor be able to utilize any of these features in
their interpreter.
Q2: Will that make any difference in performance?
A1: Perhaps. I don't see any detailed information about extended SSE
available, but if
I recall (never having written for it, so I am talking through my
non-existent hat), SSE does
not do 64-bit floating point. It may do 32-bit floating point,
which is OK for some calculations,
but I'd be loathe to do something like matrix divide in 32-bit,
especially if I cared about the
result being close to right. The stuff I read about was mostly 8-
bit or 16-bit arithmetic to
improve the performance of screenly stuff for games, etc.
A2: If it does 32-bit integer arithmetic, then you might see SOME
improvement in large integer
array ops. Ditto if it does do 64-bit floating arithmetic.
I have my doubts, since most of the time is still going to be spent
doing syntax analysis and
conformability checking, as well as dragging array-valued temps in
and out of main memory.
Compiled code eliminates most of this overhead, but I suspect the
potential for improvement there is
also going to be marginal.
I have not thought about possible uses of SSE to speed up
operations on Boolean arrays, stored
one-bit-per-element, as is done on most APL interpreters (but not
in J). Some operations,
such as population count (how many bits are 1 in this here byte?)
could make operations such
as +/boolean, compression, replicate, expand, and the like go a bit
faster.
However, there are tricks for doing such operations in regular
registers, so it may be that
the benefit of SSE is low for these cases.
Generally, you will see better speedups from clever algorithms than
from clever code. For example,
a Boolean-float matrix product (summing subsets of stocks, perhaps)
can run plenty quick
if you reorder the FOR-loops, and apply some simple algebra to the
problem. Then, you end
up with an inner product that (a) only examines each element from
the left argument[the Boolean]
once, so you avoid repeated fetch/type-convert overhead, and (b)
permits a quick check of
that element, which avoids doing an entire Scalar-Vector G and
vector-F-reduce (for x F.G y)
operation if the element is 0. This approach resulted in SHARP APL
PC [We're talking
5Mhz machines here.] running a huge matrix product of this sort in
about a minute. The APL2
system, with a naive matrix product, or perhaps one that used the
Vector Facility, on a LARGE
[Read EXPENSIVE!] mainframe, ran for about 10 minutes (before the
mainframe crashed).
Clever algorithms have the advantage of not needing to be recoded to
run on the hardware de jour.
This lets implementors get on with useful work.
Bob
|
|
|
|
|