For Programmers: Free Programming Magazines  


Home > Archive > Fortran > March 2006 > status of quadruple precision arithmetic in g95 and gfortran?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author status of quadruple precision arithmetic in g95 and gfortran?
Bart Vandewoestyne

2006-03-27, 7:02 pm

The g95 status page mentions that support for quad precision
arithmetic in g95 is `coming soon...'.

I can't find anything related to quadruple precision arithmetic
on the gfortran site.

Can somebody comment on the status of the quadruple precision
arithmetic support in both g95 and gfortran? I assume it isn't
available yet? Anybody who can tell me how far away we are from it?

Regards,
Bart

--
"Share what you know. Learn what you don't."
Tim Prince

2006-03-27, 7:02 pm

Bart Vandewoestyne wrote:
> The g95 status page mentions that support for quad precision
> arithmetic in g95 is `coming soon...'.
>
> I can't find anything related to quadruple precision arithmetic
> on the gfortran site.
>
> Can somebody comment on the status of the quadruple precision
> arithmetic support in both g95 and gfortran? I assume it isn't
> available yet? Anybody who can tell me how far away we are from it?
>

You may have to search the fortran@gcc.gnu.org list archives (to learn
about gfortran progress) yourself, as it's not clear what you want.
Surely, the question of quad arithmetic will always be target dependent.
gfortran presumably already supports it on a few targets where it's
readily available, but those aren't the most popular ones.
ia64 quad appears to depend mainly on someone doing the work on ia64.md,
where there is support for 80-bit but not 128-bit floating point. In
case you don't see the point, this is a language independent part of
gcc. I wouldn't bet on that changing, even with all the publicity about
financial commitments to gcc development.
Bart Vandewoestyne

2006-03-27, 7:02 pm

On 2006-03-27, Tim Prince <tprince@nospamcomputer.org> wrote:
>
> You may have to search the fortran@gcc.gnu.org list archives (to learn
> about gfortran progress) yourself, as it's not clear what you want.
> Surely, the question of quad arithmetic will always be target dependent.
> gfortran presumably already supports it on a few targets where it's
> readily available, but those aren't the most popular ones.


Sorry, you are right... i should specify more I guess...
If I say 'quadruple precision on i386 architectures', does that
make my question more clear then?

Actually, what I basically would like to know is when g95 will be
able to compute with the same maximum precision as for example
ifort 9.0 can do on my Linux i386 box. Suppose I use the
following numeric kinds:

integer, parameter, public :: sp = kind(1.0)
integer, parameter, public :: dp = selected_real_kind(2*precision(1.0_sp))
integer, parameter, public :: qp_preferred = &
selected_real_kind(2*precision(1.0_dp))
integer, parameter, public :: qp = (1+sign(1,qp_preferred))/2*qp_preferred+ &
(1-sign(1,qp_preferred))/2*dp

then qp is available and a higher precision then dp if i compile with
ifort 9.0 on my Debian GNU/Linux box on i386. If I compile this with g95
then qp is the same numeric kind as dp.

Regards,
Bart

--
"Share what you know. Learn what you don't."
Joost

2006-03-27, 7:02 pm

Bart,

right now, g95 doesn't support real*16, but does support real*10 on the
appropriate hardware. So I think you could use
qp_preferred = selected_real_kind(1+precision(1.0_dp))
to get real*10, but nothing more than that. An important issue is of
course that if the hardware doesn't support real*16 it is not only very
slow to compute with real*16 quantities, but also a lot of work to
implement all operations required for real*16 calculations.

Joost

Bernhard Enders

2006-03-27, 7:02 pm

Computing with quad precision is really slow by the fact that it is
software implemented, at least with Intel Fortran Compiler. It is
software implemented because the is no system (is there?) that supports
128 bits floating point arithmetics. In the case of extended precision
(80 bits fp arithmetics), several systems has hardware support for
this, that's true for IA32, EM64T, AMD64, etc. So you can expect less
performance hit from extended precision than from quad precision, when
compared to double precision.

Bernhard.

glen herrmannsfeldt

2006-03-27, 7:02 pm

Bernhard Enders <bgeneto@gmail.com> wrote:
> Computing with quad precision is really slow by the fact that it is
> software implemented, at least with Intel Fortran Compiler. It is
> software implemented because the is no system (is there?) that supports
> 128 bits floating point arithmetics.


IBM S/370 supports 128 but except for divide, which is done
in software. In ESA/390 and later divide is done in hardware.

The 360/85 was the original machine supporting this format.

VAX has H-float, but it is done through software emulation
on most models. I was told that the VAX 11/730 supports it
(in microcode), the slowest of the non-micro VAXs.

-- glen
Tim Prince

2006-03-27, 10:00 pm

Bernhard Enders wrote:
> Computing with quad precision is really slow by the fact that it is
> software implemented, at least with Intel Fortran Compiler. It is
> software implemented because the is no system (is there?) that supports
> 128 bits floating point arithmetics. In the case of extended precision
> (80 bits fp arithmetics), several systems has hardware support for
> this, that's true for IA32, EM64T, AMD64, etc. So you can expect less
> performance hit from extended precision than from quad precision, when
> compared to double precision.
>

Several architectures include hardware support for 128-bit floating
point by combinations of instructions. Among those still in production
are IA64 and IBM Power. The former has both 80-bit and 128-bit
IEEE-style, generally at most one of which is implemented with a single
set of options. The latter has a non-IEEE compliant version of somewhat
less precision.
The 80-bit arithmetic might be implemented with 128-bit storage (48 bits
unused), due to the alignment requirements for efficiency.
Bernhard Enders

2006-03-28, 7:02 pm

Thanks for your information concerning 64 bits architecture. This is a
bit OT. I would like to know where can I read more technical
information about the "alignment requirements for efficiency", i.e.,
the fact that 80 bits calculations are performed using 128 bits
registers (I have heard about this but can't remember where)? And why
in the world we don't have 128 bits arithmetics on 'popular'
architectures such as ia32 or on AMD64 if there exist 128 bits
registers on these architectures? Is it so difficult (or costly) to
implement 128 bits operations or there are no interest in doing this?
Just for information, it follows an exerpt from AMD64 architecture
manual vol. 1 showing that it has 16x128bits registers with media
instructions (why no fp support at all??):

"The AMD64 architecture provides three floating-point instruction
subsets, using three distinct register sets:
- 128-Bit Media Instructions support 32-bit single-precision and 64-bit
double-precision floating-point operations, in addition to integer
operations."

Best regards,

Bernhard.

Tim Prince

2006-03-28, 7:02 pm

Bernhard Enders wrote:
> Thanks for your information concerning 64 bits architecture. This is a
> bit OT. I would like to know where can I read more technical
> information about the "alignment requirements for efficiency", i.e.,
> the fact that 80 bits calculations are performed using 128 bits
> registers (I have heard about this but can't remember where)? And why
> in the world we don't have 128 bits arithmetics on 'popular'
> architectures such as ia32 or on AMD64 if there exist 128 bits
> registers on these architectures? Is it so difficult (or costly) to
> implement 128 bits operations or there are no interest in doing this?
> Just for information, it follows an exerpt from AMD64 architecture
> manual vol. 1 showing that it has 16x128bits registers with media
> instructions (why no fp support at all??):
>
> "The AMD64 architecture provides three floating-point instruction
> subsets, using three distinct register sets:
> - 128-Bit Media Instructions support 32-bit single-precision and 64-bit
> double-precision floating-point operations, in addition to integer
> operations."
>

The 80-bit fp calculations do use the 80-bit x87 registers. You may
find something about the recommendation for 128-bit data alignment in
the CPU manufacturers' software guides. A packed array of 80-bit data
would involve frequent multiple accesses to cache and memory, including
data which straddle cache line boundaries. Few Fortran compilers support
this format, in spite of it being supported in nearly all C compilers
for linux (but few for Windows).
Most Fortran compilers now do support the 128-bit parallel mode with
auto-vectorization, but that is a big diversion from the original topic
of quad precision. No one calls the 4 simultaneous single (or paired
double) precision operations "quad precision". The details of
implementation vary with hardware type; full width parallel operation on
Intel desktop CPUs, paired 64-bit width floating point units on AMD
desktops, splitting into a pair of closely pipelined 64-bit operations
on pentium-m, all with the same binary software code.
Herman D. Knoble

2006-03-28, 7:02 pm

You are right about the cost of QP. Skip Knoble

Program Qtime
! Sample Program to illustrate DP versus QP compute times:
! Intel Fortran V9.0-5748 on AMD Opteron 852 with O2.

integer, parameter :: QDP = selected_real_kind(30)
real(kind=QDP) :: x, sum
real :: T1,T2, Seconds
integer :: i, pulses, PPS

x=1.5_QDP
sum=0.0_QDP
CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
T1 = REAL(Pulses,QDP)/PPS

Do I=1,100000000
sum=sum+I*x
end do

CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
T2 = REAL(Pulses,QDP)/PPS
Seconds=T2-T1
print *, " Time in seconds: ",Seconds
print *, "QDP=",QDP
print *, "Sum=",Sum

end Program Qtime


Output for: integer, parameter :: QDP = selected_real_kind(15)

Time in seconds: 0.5800781
QDP= 8
Sum= 7.500000080627340E+015

Output for: integer, parameter :: QDP = selected_real_kind(30)
Time in seconds: 5.750000
QDP= 16
Sum= 7500000075000000.00000000000000000

On Tue, 28 Mar 2006 14:20:04 GMT, Tim Prince <tprince@nospamcomputer.org> wrote:

-|Bernhard Enders wrote:
-|> Thanks for your information concerning 64 bits architecture. This is a
-|> bit OT. I would like to know where can I read more technical
-|> information about the "alignment requirements for efficiency", i.e.,
-|> the fact that 80 bits calculations are performed using 128 bits
-|> registers (I have heard about this but can't remember where)? And why
-|> in the world we don't have 128 bits arithmetics on 'popular'
-|> architectures such as ia32 or on AMD64 if there exist 128 bits
-|> registers on these architectures? Is it so difficult (or costly) to
-|> implement 128 bits operations or there are no interest in doing this?
-|> Just for information, it follows an exerpt from AMD64 architecture
-|> manual vol. 1 showing that it has 16x128bits registers with media
-|> instructions (why no fp support at all??):
-|>
-|> "The AMD64 architecture provides three floating-point instruction
-|> subsets, using three distinct register sets:
-|> - 128-Bit Media Instructions support 32-bit single-precision and 64-bit
-|> double-precision floating-point operations, in addition to integer
-|> operations."
-|>
-|The 80-bit fp calculations do use the 80-bit x87 registers. You may
-|find something about the recommendation for 128-bit data alignment in
-|the CPU manufacturers' software guides. A packed array of 80-bit data
-|would involve frequent multiple accesses to cache and memory, including
-|data which straddle cache line boundaries. Few Fortran compilers support
-|this format, in spite of it being supported in nearly all C compilers
-|for linux (but few for Windows).
-|Most Fortran compilers now do support the 128-bit parallel mode with
-|auto-vectorization, but that is a big diversion from the original topic
-|of quad precision. No one calls the 4 simultaneous single (or paired
-|double) precision operations "quad precision". The details of
-|implementation vary with hardware type; full width parallel operation on
-|Intel desktop CPUs, paired 64-bit width floating point units on AMD
-|desktops, splitting into a pair of closely pipelined 64-bit operations
-|on pentium-m, all with the same binary software code.

Ian Gay

2006-03-28, 7:02 pm

Herman D. Knoble <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote in
news:4dni22t33u3i6u6ctruig5iltu9he5d6im@
4ax.com:

> You are right about the cost of QP. Skip Knoble
>
> Program Qtime
> ! Sample Program to illustrate DP versus QP compute times:
> ! Intel Fortran V9.0-5748 on AMD Opteron 852 with O2.
>
> integer, parameter :: QDP = selected_real_kind(30)
> real(kind=QDP) :: x, sum
> real :: T1,T2, Seconds
> integer :: i, pulses, PPS
>
> x=1.5_QDP
> sum=0.0_QDP
> CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
> T1 = REAL(Pulses,QDP)/PPS
>
> Do I=1,100000000
> sum=sum+I*x
> end do
>
> CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
> T2 = REAL(Pulses,QDP)/PPS
> Seconds=T2-T1
> print *, " Time in seconds: ",Seconds
> print *, "QDP=",QDP
> print *, "Sum=",Sum
>
> end Program Qtime
>
>
> Output for: integer, parameter :: QDP = selected_real_kind(15)
>
> Time in seconds: 0.5800781
> QDP= 8
> Sum= 7.500000080627340E+015
>


Why doesn't double precision produce a better result here?


> Output for: integer, parameter :: QDP = selected_real_kind(30)
> Time in seconds: 5.750000
> QDP= 16
> Sum= 7500000075000000.00000000000000000
>
> On Tue, 28 Mar 2006 14:20:04 GMT, Tim Prince
> <tprince@nospamcomputer.org> wrote:
>
> -|Bernhard Enders wrote:
>

<snip>

--
*********** To reply by e-mail, make w single in address
**************
Richard E Maine

2006-03-28, 7:02 pm

Ian Gay <gay@sfuu.ca> wrote:

> Herman D. Knoble <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote in
> news:4dni22t33u3i6u6ctruig5iltu9he5d6im@
4ax.com:

....
....[color=darkred]
>
> Why doesn't double precision produce a better result here?


Why should it produce a better result? If my quick check is correct (and
I might have misssed because it isn't very far off and my check was
pretty hasty) this goes past the limits for which IEEE double gives
perfect results. Therefore, you'll have round-off errors in the
addition. And since there are quite a lot of additions (100 million of
them), those round-off errors can work their way up by quite a few bits
from the low order one.

Are you perhaps assuming that just because double has about 15 digits of
precision, that you can count on roundoff always staying down in the
bottom few bits? If so, I suggest reading up on the subject of numerical
instability.

Sounds to me like just a typical case of life in the real world of
floatting point arithmetic.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain| experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
Herman D. Knoble

2006-03-28, 7:02 pm

On Tue, 28 Mar 2006 09:54:09 -0800, nospam@see.signature (Richard E Maine) wrote:

-|Ian Gay <gay@sfuu.ca> wrote:
-|
-|> Herman D. Knoble <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote in
-|> news:4dni22t33u3i6u6ctruig5iltu9he5d6im@
4ax.com:
-|...
-|> > Do I=1,100000000
-|> > sum=sum+I*x
-|> > end do
-|...
-|> > Sum= 7.500000080627340E+015
-|>
-|> Why doesn't double precision produce a better result here?
-|
-|Why should it produce a better result? If my quick check is correct (and
-|I might have misssed because it isn't very far off and my check was
-|pretty hasty) this goes past the limits for which IEEE double gives
-|perfect results. Therefore, you'll have round-off errors in the
-|addition. And since there are quite a lot of additions (100 million of
-|them), those round-off errors can work their way up by quite a few bits
-|from the low order one.

Richard: As always, thank you.

Ian, here are some additional notes and code that you may wish to
check out that may help you with this and other more complex cases.

First, I completely agree with Richard's analysis, namely that such sums
are not the best numerical computations. But, a loop like

sum=0
delta=(a fraction not representble in binary, like .1 for example)
do i=1,n
sum=sum+delta !method 1
end do
will accumulate the representation error (of delta).
From long time experience we know that representation
error (of decimal fractions) can also be magnified by subtracting
two nearly equal quantities; the most significant digits cancel
during the subtraction, where the least significant digits
(where variaous numerical errors can be) become the most
significant digits. The example (program fuzztest)
http://ftp.cac.psu.edu/pub/ger/fortran/hdk/eps.f90
and
http://ftp.cac.psu.edu/pub/ger/fortran/hdk/example1.txt
illustrate this somewhat dramatically.

The loop
do i=1,n
sum=sum+I*delta ! method 2
end do
may have round off error but at least will not maximize the
effect of (accumulate) representation error as the above method 1
loop does.

There's also a better way to sum as Kahn (and Giles) point out at:
http://ftp.cac.psu.edu/pub/ger/fortran/hdk/KahnSum.f90


-|
-|Are you perhaps assuming that just because double has about 15 digits of
-|precision, that you can count on roundoff always staying down in the
-|bottom few bits? If so, I suggest reading up on the subject of numerical
-|instability.
-|
-|Sounds to me like just a typical case of life in the real world of
-|floatting point arithmetic.

I agree with Richard here also. I included displaying the sum realizing that
roundoff would likely happen.

The real purpose for the posting was to illustrate the order of
magnitude difference in computation time between Double and Quadruple
precision, using the Intel compiler which supports Real*16 (and
Complex*32). I'd guess that a compiler can do a quad software implementa tion
faster than using Quad.f90: http://users.bigpond.net.au/amiller/quad.html
which uses Fortran derived Type (quad).

All the best.
Skip Knoble

Ian Gay

2006-03-28, 10:00 pm

nospam@see.signature (Richard E Maine) wrote in
news:1hcwlu7.17lxrqj1vfm668N%nospam@see.signature:

> Ian Gay <gay@sfuu.ca> wrote:
>
> ...
> ...
>
> Why should it produce a better result? If my quick check is
> correct (and I might have misssed because it isn't very far off
> and my check was pretty hasty) this goes past the limits for which
> IEEE double gives perfect results. Therefore, you'll have
> round-off errors in the addition. And since there are quite a lot
> of additions (100 million of them), those round-off errors can
> work their way up by quite a few bits from the low order one.
>


Recall that x had the value 1.5.
I was thinking that since 1.5 is exactly representable in binary
floating point, (and assuming a good compiler would represent it
exactly :-)) that the errors would be much smaller than this. On
further thought, I see that this is a (carefully constructed?)
pathological example of errors from denormazlization of the smaller
argument to a floating add.
For amusement: (qt is the op's program set for double precision)
(Windows xp on athlon)

C:\source\test>g95 -o qt qt.f95

C:\source\test>qt
Time in seconds: 0.5781
QDP= 8
Sum= 7.5000000806273400D+15

C:\source\test>g95 -O1 -o qt qt.f95

C:\source\test>qt
Time in seconds: 0.2187
QDP= 8
Sum= 7.5000000750000000D+15

If you don't specify optimization, g95 loads and stores SUM each
cycle. If you optimize, it's kept in the 80-bit floating point stack,
so you get the advantage of the longer accumulator, as well as the
speedup. (Unless you're foolish enough to force sse2 evaluation).

> Are you perhaps assuming that just because double has about 15
> digits of precision, that you can count on roundoff always staying
> down in the bottom few bits? If so, I suggest reading up on the
> subject of numerical instability.
>
> Sounds to me like just a typical case of life in the real world of
> floatting point arithmetic.
>




--
*********** To reply by e-mail, make w single in address
**************
Richard Maine

2006-03-28, 10:00 pm

Ian Gay <gay@sfuu.ca> wrote:

> nospam@see.signature (Richard E Maine) wrote in
> news:1hcwlu7.17lxrqj1vfm668N%nospam@see.signature:


[color=darkred]
> Recall that x had the value 1.5.
> I was thinking that since 1.5 is exactly representable in binary...


Just because 1.5 is exactly representable in IEEE double does not mean
that all the numbers involved in the calculation are. Specifically,
these numbers get big enough that sum is not exactly representable. But
it sounds like you now see that.

> On
> further thought, I see that this is a (carefully constructed?)
> pathological example of errors from denormazlization of the smaller
> argument to a floating add.


While I think you' have the facts right, I'm not so sure about your
evaluation of it as being carefully constructed or pathological. I'd say
it was much more like typical of floatting point roundoff issues.

--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain
Ian Gay

2006-03-28, 10:00 pm

nospam@see.signature (Richard Maine) wrote in
news:1hcx96g.g4x32u1396xjcN%nospam@see.signature:

>
> Just because 1.5 is exactly representable in IEEE double does not
> mean that all the numbers involved in the calculation are.
> Specifically, these numbers get big enough that sum is not exactly
> representable. But it sounds like you now see that.


What's amusing is that the exact result (~7.5x10^15) _is_
representable; the largest exactly-representable integer in d.p. is
2^53-1 (~9x10^15). But you still can't compute it in d.p., using that
particular loop.

--
*********** To reply by e-mail, make w single in address
**************
robert.corbett@sun.com

2006-03-28, 10:00 pm


Ian Gay wrote:
> nospam@see.signature (Richard Maine) wrote in
> news:1hcx96g.g4x32u1396xjcN%nospam@see.signature:
>
>
> What's amusing is that the exact result (~7.5x10^15) _is_
> representable; the largest exactly-representable integer in d.p. is
> 2^53-1 (~9x10^15). But you still can't compute it in d.p., using that
> particular loop.


For more information about the error analysis of floating-point
summation, see Nicholas Higham's book *Accuracy and
Stability of Numerical Algorithms.* Note that not only is 1.5
exactly representable, but the products I*x are all exactly
representable. The error is entirely in the summation.

Bob Corbett

Herman D. Knoble

2006-03-29, 8:01 am

Using Kahn's summation with Double Precision:

Program QtimeK
integer, parameter :: QDP = selected_real_kind(15)
real(kind=QDP) :: x, sum
real :: T1,T2, Seconds
integer :: i, pulses, PPS
real(kind=QDP) :: t, y, c, ks !kahan

c=0.0_QDP
ks=0.0_QDP
x=1.5_QDP
CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
T1 = REAL(Pulses,QDP)/PPS

Do I=1,100000000
! kahan sum
y = x*I - c
t = ks + y
c = (t-ks)-y
ks = t
end do
sum=ks

CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
T2 = REAL(Pulses,QDP)/PPS
Seconds=T2-T1
print *, " Time in seconds: ",Seconds
print *, "QDP=",QDP
print *, "Sum=",Sum

end Program QtimeK

The results are:
Time in seconds: 0.6953125
QDP= 8
Sum= 7.500000075000000E+015

Skip

On 28 Mar 2006 19:20:59 -0800, robert.corbett@sun.com wrote:

-|
-|Ian Gay wrote:
-|> nospam@see.signature (Richard Maine) wrote in
-|> news:1hcx96g.g4x32u1396xjcN%nospam@see.signature:
-|>
-|> >
-|> > Just because 1.5 is exactly representable in IEEE double does not
-|> > mean that all the numbers involved in the calculation are.
-|> > Specifically, these numbers get big enough that sum is not exactly
-|> > representable. But it sounds like you now see that.
-|>
-|> What's amusing is that the exact result (~7.5x10^15) _is_
-|> representable; the largest exactly-representable integer in d.p. is
-|> 2^53-1 (~9x10^15). But you still can't compute it in d.p., using that
-|> particular loop.
-|
-|For more information about the error analysis of floating-point
-|summation, see Nicholas Higham's book *Accuracy and
-|Stability of Numerical Algorithms.* Note that not only is 1.5
-|exactly representable, but the products I*x are all exactly
-|representable. The error is entirely in the summation.
-|
-|Bob Corbett

Bernhard Enders

2006-03-29, 7:02 pm

That's right, not only all the products 'I*x' are exactly representable
in binary but also is every sum like the rhs 'sum + I*x'. I'm
considering here that if two fp numbers, say A and B, are exactly
representable so is their sum, A + B. So where are the errors coming
from? Sorry, we don't have the book you mention in our library. Thanks,

Bernhard.

robert.corbett@sun.com wrote:
> Ian Gay wrote:
>
> For more information about the error analysis of floating-point
> summation, see Nicholas Higham's book *Accuracy and
> Stability of Numerical Algorithms.* Note that not only is 1.5
> exactly representable, but the products I*x are all exactly
> representable. The error is entirely in the summation.
>
> Bob Corbett


Richard E Maine

2006-03-29, 7:02 pm

Bernhard Enders <bgeneto@gmail.com> wrote:

> I'm
> considering here that if two fp numbers, say A and B, are exactly
> representable so is their sum, A + B.


That is trivially false.... as in the case here. For a simpler example,
consider the numbers 2**100 and 2**(-100), both of which are exactly
representable. Their sum is not.

You are neglecting the fundamental matter of finite precision. If your
statement were true, then there would never be any rounding error from
addition. After all, every floating pojnt number is an exact
representation of some value (maybe not the value that was intended, but
of some value). Thus, if your thesis were true, every sum A+B would give
an exact answer. You would be implying that the only rounding errors
were in the initial approximations of A and B, with none introduced by
the addition. It ain't so.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain| experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
Steven G. Kargl

2006-03-29, 7:02 pm

In article <1143651388.813075.315630@i40g2000cwc.googlegroups.com>,
"Bernhard Enders" <bgeneto@gmail.com> writes:
> That's right, not only all the products 'I*x' are exactly representable
> in binary but also is every sum like the rhs 'sum + I*x'. I'm
> considering here that if two fp numbers, say A and B, are exactly
> representable so is their sum, A + B. So where are the errors coming
> from? Sorry, we don't have the book you mention in our library. Thanks,
>


Google for "goldberg floating"

Read the document more than once.

--
Steve
http://troutmask.apl.washington.edu/~kargl/
Herman D. Knoble

2006-03-29, 7:02 pm

Some on-line resources are listed at:
http://www.personal.psu.edu/hdk/for...#Floating-Point

Here's another version of the program that terminates when the
sum is incorrect. It shows what happens, I believe, namely
that a DP register is not wide enough to accommodate an exact
sum after the number of digits needed exceed DP capacity.

Skip


Program Qtime
! Platform: Athalon, Intel Fortran 9.x with -O0
integer, parameter :: DP = selected_real_kind(15)
integer, parameter :: QP = selected_real_kind(30)
real(kind=DP) :: x, sum
real(kind=QP) :: t
real :: T1,T2, Seconds
integer :: i, pulses, PPS

print *, "DP=",DP
print *, "QP=",QP
x=1.5_DP
sum=0.0_DP
CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
T1 = REAL(Pulses,QP)/PPS

Do I=1,100000000
sum=sum+I*x
t=i
if (sum /= ((t*(t+1))/2)*x) then
print *, "I=",i," Sum=",sum
print *, "Sum should be: ",(t*(t+1)/2)*x
stop "sum rundoff"
end if
end do

CALL SYSTEM_CLOCK(COUNT=Pulses,COUNT_RATE=PPS
)
T2 = REAL(Pulses,QP)/PPS
Seconds=T2-T1
print *, " Time in seconds: ",Seconds
print *, "Sum=",Sum

end Program Qtime

Results are:

DP= 8
QP= 16
I= 77490641 Sum= 4.503599640061142E+015
Sum should be: 4503599640061141.50000000000000000
sum rundoff




On 29 Mar 2006 08:56:28 -0800, "Bernhard Enders" <bgeneto@gmail.com> wrote:

-|That's right, not only all the products 'I*x' are exactly representable
-|in binary but also is every sum like the rhs 'sum + I*x'. I'm
-|considering here that if two fp numbers, say A and B, are exactly
-|representable so is their sum, A + B. So where are the errors coming
-|from? Sorry, we don't have the book you mention in our library. Thanks,
-|
-|Bernhard.
-|
-|robert.corbett@sun.com wrote:
-|> Ian Gay wrote:
-|> > nospam@see.signature (Richard Maine) wrote in
-|> > news:1hcx96g.g4x32u1396xjcN%nospam@see.signature:
-|> >
-|> > >
-|> > > Just because 1.5 is exactly representable in IEEE double does not
-|> > > mean that all the numbers involved in the calculation are.
-|> > > Specifically, these numbers get big enough that sum is not exactly
-|> > > representable. But it sounds like you now see that.
-|> >
-|> > What's amusing is that the exact result (~7.5x10^15) _is_
-|> > representable; the largest exactly-representable integer in d.p. is
-|> > 2^53-1 (~9x10^15). But you still can't compute it in d.p., using that
-|> > particular loop.
-|>
-|> For more information about the error analysis of floating-point
-|> summation, see Nicholas Higham's book *Accuracy and
-|> Stability of Numerical Algorithms.* Note that not only is 1.5
-|> exactly representable, but the products I*x are all exactly
-|> representable. The error is entirely in the summation.
-|>
-|> Bob Corbett

robert.corbett@sun.com

2006-03-29, 10:00 pm

Bernhard Enders wrote:

> That's right, not only all the products 'I*x' are exactly representable
> in binary but also is every sum like the rhs 'sum + I*x'. I'm
> considering here that if two fp numbers, say A and B, are exactly
> representable so is their sum, A + B. So where are the errors coming
> from?


As I said, the errors are all in the summation. All of the roundings
occur when the products are added to the running total.

> Sorry, we don't have the book you mention in our library. Thanks,


I'm sorry to sound like an advertisement, but Higham's book *Accuracy
and Stability of Numerical Algorithms* is one of the most important
books for anyone dealing with floating-point to read. Sun's Numerical
Computation Guide is available as a free download from
http://docs.sun.com, and it covers the basics, but Higham's book is the
full course. It has a lot of pages devoted specifically to the
analysis
of summation.

Bob Corbett

Joost

2006-03-30, 4:00 am


robert.corbett@sun.com wrote:
> Bernhard Enders wrote:
>
>
> As I said, the errors are all in the summation. All of the roundings
> occur when the products are added to the running total.


That actually makes me wonder. Is there any Fortran compiler that
implements one of the more accurate ways of summing numbers (Kahan's
sum if I understand correctly) for the SUM intrinsic ?

Other question, can the use of Kahan's sum actually lead to lower
precision results for either special inputs, or hardware that e.g.
doesn't use IEEE.

Joost

James Giles

2006-03-30, 4:00 am

Joost wrote:
....
> Other question, can the use of Kahan's sum actually lead to lower
> precision results for either special inputs, or hardware that e.g.
> doesn't use IEEE.


If the compiler make *inconsistent* use of extended precision,
Kahan's method can actually be less precise than the simple
sum. This is because Kahan's algorithm is keeping track of
the discrepancy between the simple sum and what should be
calculated and adding it back in at each step. If the discrepancy
is calculated to the precision of the declared variables, but the
running sum is kept to some extended precision you would
essentially be adding the discrepancy twice. Other combinations
of inconsistent precision may have similarly bad results. If
the compiler *consistently* uses the declared precision, or
*consistently* uses the *same* extended precision throughout,
Kahan's algorithm does good work. It essentially does about
as good as double the precision that's actually used.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


Joost

2006-03-30, 4:00 am

That is useful to know. That means that I might implement
'ACCURATE_SUM' but not without having an automatic check in the code
that it actually works as expected.

Joost

robert.corbett@sun.com

2006-03-30, 4:00 am


Joost wrote:

> That actually makes me wonder. Is there any Fortran compiler that
> implements one of the more accurate ways of summing numbers (Kahan's
> sum if I understand correctly) for the SUM intrinsic ?


Sun Fortran used to use double-precision to accumulate
single-precision sums for the SUM intrinsic. Sun fell behind
on enough benchmarks because of it that we were forced to
go back to using single-precision summation. Using Kahan's
summation technique is slower than using extended precision.

> Other question, can the use of Kahan's sum actually lead to lower
> precision results for either special inputs, or hardware that e.g.
> doesn't use IEEE.


I imagine there might be cases where the straightforward
summation would, just by chance, be spot on, but the Kahan
sum would be off by a little. Such cases should be unusual.
It would take more analysis than I care to do right now to be
sure such cases exist.

Bob Corbett

Jan Vorbrüggen

2006-03-30, 4:00 am

> You are right about the cost of QP. Skip Knoble
> Time in seconds: 0.5800781
> QDP= 8
> Time in seconds: 5.750000
> QDP= 16


Only slower by an order of magnitude? That's really very good for a software
emulation!

Jan
robert.corbett@sun.com

2006-03-30, 8:01 am

Joost wrote:

> That is useful to know. That means that I might implement
> 'ACCURATE_SUM' but not without having an automatic check in the code
> that it actually works as expected.


How would you check the result?

There are ways to implement summation so that only a
single rounding error appears in the final result. Those
techniques tend to be slow without hardware assist.

Bob Corbett

Joost

2006-03-30, 8:01 am

> Joost wrote:
>
>
> How would you check the result?


I would call accurate_sum with some array for which I know the
'correct' result. Anything off by more than an expected amount would
mean that accurate_sum is somehow 'miscompiled'. I suppose that a short
array of 3 values could do the trick ((/1.0D0,epsilon(1.0D0)/2,-1.0D0/)
?). Of course, this wouldn't test all invocations of accurate_sum, but
for example once at program startup, our after building the library.

>
> There are ways to implement summation so that only a
> single rounding error appears in the final result. Those
> techniques tend to be slow without hardware assist.
>
> Bob Corbett


glen herrmannsfeldt

2006-03-30, 10:00 pm

James Giles <jamesgiles@worldnet.att.net> wrote:

> If the compiler make *inconsistent* use of extended precision,
> Kahan's method can actually be less precise than the simple
> sum. This is because Kahan's algorithm is keeping track of
> the discrepancy between the simple sum and what should be
> calculated and adding it back in at each step.



x87 implementations are well known for inconsistent use.

It is expensive to save/reload data frequently, and there
aren't enough x87 registers to keep complicated expressions
completely in registers.

The original 8087 was supposed to be designed to have an
infinite stack, with registers being spilled to memory when
needed. It turned out not to be possible to actually write
the needed software in the end. I don't know if the problems
were fixed in later 80x87 implementations, but it doesn't seem
that it has been done yet, as far as I know.

-- glen
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com