For Programmers: Free Programming Magazines  


Home > Archive > Fortran > December 2004 > small numerical differences in floating point result between wintel and Sun/SPARC









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author small numerical differences in floating point result between wintel and Sun/SPARC
JS

2004-12-14, 3:57 am

We have the same floating point intensive C++ program that runs on
Windows on Intel chip and on Sun Solaris on SPARC chips. The program
reads the exactly the same input files on the two platforms. However,
they generate slightly different results for floating point numbers.

Are they really supposed to generate exactly the same results? I
guess so because both platforms are supposed to be IEEE floating point
standard (754?) compliant. I have turned on the Visual C++ compile
flags which will make sure the Windows produce standard compliant code
(the /Op flags). However, they still produce different results. I
suspect that this may be due to a commerical mathematical library that
we use which can't be compiled using /Op option. If I had recompiled
everything using /Op option, the two should have produced the same
results.

Am I right?

Thanks a lot.
Steven G. Kargl

2004-12-14, 3:57 am

In article <stksr09qteq79uh39nrg8q2i7tti8eqvgo@4ax.com>,
JS <someone@somewhere.com> writes:
> We have the same floating point intensive C++ program that runs on
> Windows on Intel chip and on Sun Solaris on SPARC chips. The program
> reads the exactly the same input files on the two platforms. However,
> they generate slightly different results for floating point numbers.
>


What does this have to do with Fortran (or C)? Don't
cross post off-topic threads.

PS: Google on "floating goldberg"

--
Steve
Dik T. Winter

2004-12-14, 3:57 am

In article <stksr09qteq79uh39nrg8q2i7tti8eqvgo@4ax.com> JS <someone@somewhere.com> writes:
....
> Am I right?


No, you are wrong. The slight differences are there because the Intel
chips calculate expressions with 80 bits of precision, the Sparc uses
64 bits of precision. Both are allowed by the IEEE standard.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
James D. Veale

2004-12-14, 3:57 am

I have a customer who runs a simulation on both a PC and a UNIX box,
results are written to an ASCII data file. He has used
TFuzSerf, a fuzzy number file comparison, to verify similar results.

Demonstration versions and contact information are available
at the Complite File Comparison Family web page, at ...

http://world.std.com/~jdveale

This fuzzy number comparison utility allows you to specify both absolute
and relative tolerances(ranges) to numbers. ASCII numbers are
recognized and treated as true numbers, not just character strings.
As a result 2-digit and 3-digit exponents are handled automatically.

Please feel free to contact me if I can answer any questions.

Jim Veale

JS <someone@somewhere.com> writes:
>We have the same floating point intensive C++ program that runs on
>Windows on Intel chip and on Sun Solaris on SPARC chips. The program
>reads the exactly the same input files on the two platforms. However,
>they generate slightly different results for floating point numbers.


>Are they really supposed to generate exactly the same results? I
>guess so because both platforms are supposed to be IEEE floating point
>standard (754?) compliant. I have turned on the Visual C++ compile
>flags which will make sure the Windows produce standard compliant code
>(the /Op flags). However, they still produce different results. I
>suspect that this may be due to a commerical mathematical library that
>we use which can't be compiled using /Op option. If I had recompiled
>everything using /Op option, the two should have produced the same
>results.


>Am I right?


>Thanks a lot.

Gordon Burditt

2004-12-14, 3:57 am

>We have the same floating point intensive C++ program that runs on
>Windows on Intel chip and on Sun Solaris on SPARC chips. The program
>reads the exactly the same input files on the two platforms. However,
>they generate slightly different results for floating point numbers.
>
>Are they really supposed to generate exactly the same results? I


I don't believe the == operator applied to calculated floating-point
results is ever required to return 1. Nor does it have to be consistent
about it. An implementation like that probably won't sell well, but
ANSI C allows it.

int i;
double d1, d2;
... put a value in d1 and d2 ...

for (i = 0; (d1 + d2) == (d2 + d1); i++)
/* empty */ ;
printf("Iterations: %d\n", i);

There's nothing wrong with the code printing "Iterations: 13" here.

>guess so because both platforms are supposed to be IEEE floating point
>standard (754?) compliant.


Just because the hardware is IEEE floating point doesn't mean
the compiler has to keep the intermediate values in 80-bit long
double or has to lop off the extra precision consistently.


>I have turned on the Visual C++ compile
>flags which will make sure the Windows produce standard compliant code
>(the /Op flags). However, they still produce different results. I
>suspect that this may be due to a commerical mathematical library that
>we use which can't be compiled using /Op option. If I had recompiled
>everything using /Op option, the two should have produced the same
>results.


C offers no guarantee that an Intel platform will produce the
same results as an Intel platform with the same CPU serial number
and the same compiler serial number.

>Am I right?


No.

Gordon L. Burditt
Herman D. Knoble

2004-12-14, 9:01 pm

First, we assuming that the program in question here is not displaying
or writing results past the precision of the corresponding variables.

How many "places" (mantissa bits) are you talking about here?
If the results differ by only the last mantissa bit or two I would not
worry about this. If results differ by more than the least significant
bit or two, there could be reason to suspect compiler options, like /OP
(or, for example, XLF under AIX the compiler option -qfloat=nomaf).

For the case where results differ by more that a mantissa bit or two,
it could also be possible that the code itself is not completely
numerically stable. One of the ways to determine if a model is stable
is in fact to perturb the input (or platform) slightly and
compare results.

Skip

On Mon, 13 Dec 2004 22:00:03 -0500, JS <someone@somewhere.com> wrote:

-|We have the same floating point intensive C++ program that runs on
-|Windows on Intel chip and on Sun Solaris on SPARC chips. The program
-|reads the exactly the same input files on the two platforms. However,
-|they generate slightly different results for floating point numbers.
-|
-|Are they really supposed to generate exactly the same results? I
-|guess so because both platforms are supposed to be IEEE floating point
-|standard (754?) compliant. I have turned on the Visual C++ compile
-|flags which will make sure the Windows produce standard compliant code
-|(the /Op flags). However, they still produce different results. I
-|suspect that this may be due to a commerical mathematical library that
-|we use which can't be compiled using /Op option. If I had recompiled
-|everything using /Op option, the two should have produced the same
-|results.
-|
-|Am I right?
-|
-|Thanks a lot.


Herman D. (Skip) Knoble, Research Associate
(a computing professional for 38 years)
Email: SkipKnobleLESS at SPAMpsu dot edu
Web: http://www.personal.psu.edu/hdk
Penn State Information Technology Services
Academic Services and Emerging Technologies
Graduate Education and Research Services
Penn State University
214C Computer Building
University Park, PA 16802-21013
Phone:+1 814 865-0818 Fax:+1 814 863-7049
Ron Shepard

2004-12-14, 9:01 pm

In article <pan.2004.12.14.10.23.56.235000@netactive.co.uk>,
Lawrence Kirby <lknews@netactive.co.uk> wrote:

> In C (and again presumably C++) d = a + b +c
> is equivalent to d = (a + b) + c. A C optimiser cannot rearrange this
> unless it can prove the result is consistent with this for all possible
> input values. That's rarely possible in floating point.


I'm not sure when this changed, but at one time, a C compiler was
free to do just about anything it wanted to arithmetic expressions.
It was not even required to respect parentheses, so if you had an
expression like "a+(b+c)" it was legal, according to the language,
to evaluate it as if it had been written "(a+b)+c". This was a
terrible situation for floating point work, since much of it
involves controlling the order of evaluation with parentheses.
Fortran, as far as I know, has always been required to respect
parentheses, although the order of evaluation is otherwise fairly
arbitrary to allow for optimizations and sharing of subexpression
results between statements.

$.02 -Ron Shepard
James Giles

2004-12-14, 9:01 pm

Ron Shepard wrote:
> In article <pan.2004.12.14.10.23.56.235000@netactive.co.uk>,
> Lawrence Kirby <lknews@netactive.co.uk> wrote:
>
>
> I'm not sure when this changed, but at one time, a C compiler was
> free to do just about anything it wanted to arithmetic expressions.
> It was not even required to respect parentheses, so if you had an
> expression like "a+(b+c)" it was legal, according to the language,
> to evaluate it as if it had been written "(a+b)+c". [...]


The second C standard (C99) is where it changed. I don't have the
final document of the first C standard (C90), but I have a copy of
one of the last drafts before it was approved. In it, the above
regrouping was specifically mentioned. In fact, there was a
nearly identical example. They do point out a way to avoid the
regrouping of expressions. If you write "a+ +(b+c)" the sum on
the right must be done first (the unary "+" has precedence, the
space before it is required in this context so as not to mistake it
for a "++"operator.)

This was, of course, appalling. The C99 standard changed to the
present rule where no rearrangement is permitted at all (though
primaries, like function calls, may still occur in any order).
This may seem like overreacting to their previous mistake.
Keeping the integrity of parenthesis would have been sufficient
while still allowing the implementation some optimization room.
I think they were forced to their present rule by the existence
of their short-circuit operators and the need to clearly specify
their meaning.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


websnarf@gmail.com

2004-12-14, 9:01 pm

Sun and Intel should produce identical results for
negation,+,-,*,/,sqrt() operations, because this is required for IEEE
754 compliance, and both implement double and float on the same sized
floating point values (older x86 compilers used to use 80bit for all FP
temporaries which could definately be a source of differences, but VC++
doesn't do this anymore). You should also expect identical output for
things like fprem(), ceil(), floor(), modf(), and so on. All of these
operations have well known and finite ways of being calculated to the
best and closest possible result, which is what the IEEE 754 specifies.
Results can be different if you put the processors into different
rounding modes -- I thought that the ANSI C specification was supposed
to specify a consistent rounding mode, so there shouldn't be any
differences because of that, but I could be wrong.

The problem is everything else. sin(), cos(), log(), exp(), atanh()
etc, ... these guys are not even consistent between Athlons and
Pentiums. The reason is that there is no known finite algorithm for
computing these guys to the exactly corrected rounded result. There
are plenty of ways to compute them within an accuracy of "1 ulp"
(i.e., off by at most one unit in the last siginificant bit.)

But I am not the foremost expert on this stuff (and it looks like the
other posters here know even less than me.) This question is more
appropriate for a group like comp.arch.arithmetic where there are
plenty of posters there who are expert on this.
--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Robert Corbett

2004-12-14, 9:01 pm

In article <GRHvd.124998$7i4.87975@bgtnsc05-news.ops.worldnet.att.net>,
James Giles <jamesgiles@worldnet.att.net> wrote:
>Ron Shepard wrote:
>
>The second C standard (C99) is where it changed. I don't have the
>final document of the first C standard (C90), but I have a copy of
>one of the last drafts before it was approved. In it, the above
>regrouping was specifically mentioned. In fact, there was a
>nearly identical example. They do point out a way to avoid the
>regrouping of expressions. If you write "a+ +(b+c)" the sum on
>the right must be done first (the unary "+" has precedence, the
>space before it is required in this context so as not to mistake it
>for a "++"operator.)


The change was made in the first C standard, C89. The
unary + operator you describe was indeed specified as
you say in late drafts of the C89 standard, but it was
dropped about a year and a half before the C89 standard
was published. I believe that the winning argue in favor
of strictly enforcing the implicit associativity of
operators was that it made no difference in integer
expressions, and not much floating-point was done in C.

Sincerely,
Bob Corbett
James Giles

2004-12-15, 3:57 am

Robert Corbett wrote:
....
> The change was made in the first C standard, C89. The
> unary + operator you describe was indeed specified as
> you say in late drafts of the C89 standard, but it was
> dropped about a year and a half before the C89 standard
> was published. I believe that the winning argue in favor
> of strictly enforcing the implicit associativity of
> operators was that it made no difference in integer
> expressions, and not much floating-point was done in C.



Well, it does make a difference on integers. If you write
a+(b+c) and the compiler were to regroup that as (a+b)+c,
the second expression might overflow where the first did
not. It still doesn't make a difference if your integers are
all the same size unsigned type. Nor if they're the same
size 2's complement signed integers - provided your platform
doesn't raise exceptions for overflow. I don't know any
standardized languages that actually require 2's complement
or that require overflow to be ignored. And, C doesn't
require that all the operands be the same size (several
other languages don't either). So, yes it can still make a
difference for C in integer expressions.

And, I do still think it's overkill: preserving the integrity of
parenthesis should be sufficient.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


Gerry Thomas

2004-12-15, 3:57 am


"JS" <someone@somewhere.com> wrote in message
news:stksr09qteq79uh39nrg8q2i7tti8eqvgo@
4ax.com...
> We have the same floating point intensive C++ program that runs on
> Windows on Intel chip and on Sun Solaris on SPARC chips. The program
> reads the exactly the same input files on the two platforms. However,
> they generate slightly different results for floating point numbers.
>
> Are they really supposed to generate exactly the same results?


For fp intensive programs, whatever the language, no.

> I
> guess so because both platforms are supposed to be IEEE floating point
> standard (754?) compliant. I have turned on the Visual C++ compile
> flags which will make sure the Windows produce standard compliant code
> (the /Op flags). However, they still produce different results. I
> suspect that this may be due to a commerical mathematical library that
> we use which can't be compiled using /Op option. If I had recompiled
> everything using /Op option, the two should have produced the same
> results.
>
> Am I right?
>


In all probability, no, :-(.

For an update to the over-hackneyed Goldberg variations, see Parker,
Pierce, and Eggert , "Monte Carlo Arithmetic: exploiting randomness in
floating-point arithmetic," Comp. Sci. & Eng., pp. 58-68, July 2000.
..> Thanks a lot.

--
You're Welcome,
Gerry T.
______
"Competent engineers rightly distrust all numerical computations and s
corroboration from alternative numerical methods, from scale models, from
prototypes, from experience... ." -- William V. Kahan.



JS

2004-12-15, 3:57 am



On Tue, 14 Dec 2004 08:20:41 -0500, Herman D. Knoble
<SkipKnobleLESS@SPAMpsu.DOT.edu> wrote:

>First, we assuming that the program in question here is not displaying
>or writing results past the precision of the corresponding variables.
>
>How many "places" (mantissa bits) are you talking about here?


All the calculations are done in double. The output result has
precision (in the format of 999999999.666666) that is well within the
precision limit of double. So this is not due not incorrect precision
used for displaying or outputting the result.

>If the results differ by only the last mantissa bit or two I would not
>worry about this. If results differ by more than the least significant
>bit or two, there could be reason to suspect compiler options, like /OP
>(or, for example, XLF under AIX the compiler option -qfloat=nomaf).
>


Some posters don't seem to know about /Op option. From Microsoft doc:

"By default, the compiler uses the coprocessor’s 80-bit registers to
hold the intermediate results of floating-point calculations. This
increases program speed and decreases program size. However, because
the calculation involves floating-point data types that are
represented in memory by less than 80 bits, carrying the extra bits of
precision (80 bits minus the number of bits in a smaller
floating-point type) through a lengthy calculation can produce
inconsistent results.

With /Op, the compiler loads data from memory prior to each
floating-point operation and, if assignment occurs, writes the results
back to memory upon completion. Loading the data prior to each
operation guarantees that the data does not retain any significance
greater than the capacity of its type."

Thus the difference I got doesn't seem to come from the 80-bit vs 64
bit issue mentioned by some posters.

>For the case where results differ by more that a mantissa bit or two,
>it could also be possible that the code itself is not completely
>numerically stable. One of the ways to determine if a model is stable
>is in fact to perturb the input (or platform) slightly and
>compare results.
>
>Skip
>
>On Mon, 13 Dec 2004 22:00:03 -0500, JS <someone@somewhere.com> wrote:
>
>-|We have the same floating point intensive C++ program that runs on
>-|Windows on Intel chip and on Sun Solaris on SPARC chips. The program
>-|reads the exactly the same input files on the two platforms. However,
>-|they generate slightly different results for floating point numbers.
>-|
>-|Are they really supposed to generate exactly the same results? I
>-|guess so because both platforms are supposed to be IEEE floating point
>-|standard (754?) compliant. I have turned on the Visual C++ compile
>-|flags which will make sure the Windows produce standard compliant code
>-|(the /Op flags). However, they still produce different results. I
>-|suspect that this may be due to a commerical mathematical library that
>-|we use which can't be compiled using /Op option. If I had recompiled
>-|everything using /Op option, the two should have produced the same
>-|results.
>-|
>-|Am I right?
>-|
>-|Thanks a lot.
>
>
> Herman D. (Skip) Knoble, Research Associate
> (a computing professional for 38 years)
> Email: SkipKnobleLESS at SPAMpsu dot edu
> Web: http://www.personal.psu.edu/hdk
> Penn State Information Technology Services
> Academic Services and Emerging Technologies
> Graduate Education and Research Services
> Penn State University
> 214C Computer Building
> University Park, PA 16802-21013
> Phone:+1 814 865-0818 Fax:+1 814 863-7049


Gerry Thomas

2004-12-15, 3:57 am


"JS" <someone@somewhere.com> wrote in message
news:nk4vr0ttmj0rrk9ujlg0s49nluph7slt34@
4ax.com...

[Stuff from the ever-helpful Skip Knoble elided without prejudice]

>
> Some posters don't seem to know about /Op option.


Not me, and a bunch besides!

FYI, most clf's posters are rabidly anti Microsoft and could give a toss
what it has to say about anything, least of all in respect to it's own
products, :-).

> From Microsoft doc: (sic DOC!)


> "By default, the compiler uses the coprocessor's 80-bit registers to
> hold the intermediate results of floating-point calculations. This
> increases program speed and decreases program size. However, because
> the calculation involves floating-point data types that are
> represented in memory by less than 80 bits, carrying the extra bits of
> precision (80 bits minus the number of bits in a smaller
> floating-point type) through a lengthy calculation can produce
> inconsistent results.
>
> With /Op, the compiler loads data from memory prior to each
> floating-point operation and, if assignment occurs, writes the results
> back to memory upon completion. Loading the data prior to each
> operation guarantees that the data does not retain any significance
> greater than the capacity of its type."
>


IIRC (and I'm sure I do) this is a verbatim quote from Microsoft's Fortran
PowerStation documentation (sic DOC!). This product commands the least of
respect in these environs. (Word to the wise, look into VC++ ~2 for what
you s!)

> Thus the difference I got doesn't seem to come from the 80-bit vs 64
> bit issue mentioned by some posters.


How come? I've no idea why you're beginning to remind me of Ozmandias but
I'm wondering if it's that Alexander the Wimp movie I endured on the
wend.

--
You're Welcome,
Gerry T.
______
"Things are not what they seem; or, to be more accurate, they are not only
what they seem, but very much else besides." -- Aldous Huxley.



James Giles

2004-12-15, 3:57 am

Richard Maine wrote:
> gordon@hammy.burditt.org (Gordon Burditt) writes:
>
>
> I'd argue that the same thing applies to Fortran (ok, except that
> I'd talk about == returning .true. instead of 1). There are probably
> people who would disagree with me, but that would be my interpretation
> of the standard.


I'd argue that any language that claims IEEE compliance must
get a true result under some predictable circumstances. Not as
many such circumstances as originally intended by the IEEE
committee! But, *some* such circumstances exist. For example,
I think that 0.1 == 0.1 must be true, since the IEEE standard
places minimum requirements on the precision of decimal
to binary conversion. Of course, whether the value of a literal
constitutes a "calculated" result is maybe a matter of opinion.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


James Giles

2004-12-15, 3:57 am

Rich Townsend wrote:
....
> While we're on this note, is there anything in the standard that
> guarantees that
>
> a == TRANSFER(TRANSFER(a, 0), 0.)



Result Value. If the physical representation of the result has
the same length as that of SOURCE, the physical representation
of the result is that of SOURCE. If the physical representation
of the result is longer than that of SOURCE, the physical
representation of the leading part is that of SOURCE and
the remainder is processor dependent. If the physical
representation of the result is shorter than that of SOURCE,
the physical representation of the result is the leading part
of SOURCE. If D and E are scalar variables such that the
physical representation of D is as long as or longer than that
of E, the value of TRANSFER (TRANSFER (E, D), E) shall
be the value of E. IF D is an array and E is an array of rank one,
the value of TRANSFER (TRANSFER (E, D), E, SIZE (E))
shall be the value of E.

If A is default real, then the second to the last sentence above applies.
So, yes it's guaranteed.

I just finished posting a question about TRABSFER to the j3 mailing
list. Talk about coincidence, I had the appropriate quote from the
standard still in my clipboard.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


Gordon Burditt

2004-12-15, 3:57 am

>>> I don't believe the == operator applied to calculated floating-point
>
>I'd argue that any language that claims IEEE compliance must
>get a true result under some predictable circumstances. Not as
>many such circumstances as originally intended by the IEEE
>committee! But, *some* such circumstances exist. For example,


For example, d == d where d is not a NaN.

>I think that 0.1 == 0.1 must be true, since the IEEE standard
>places minimum requirements on the precision of decimal
>to binary conversion.


But does it place MAXIMUM requirements? 0.1 is an infinite repeating
decimal in binary. Any extra precision on one side but not the other
is likely to cause a mismatch.

>Of course, whether the value of a literal
>constitutes a "calculated" result is maybe a matter of opinion.


Ok, I didn't define "calculated result", but the intent was something
like "the result of a floating-point binary operator such as +, -, *, /".
Although multiplication by 0.0 should come out exact.

0.1 == 0.1 has a much better chance than, say, (0.1 + 0.0) == (0.1 + 0.0) .

Gordon L. Burditt
James Giles

2004-12-15, 3:57 am

Gordon Burditt wrote:
....
....[color=darkred]
>
> But does it place MAXIMUM requirements? 0.1 is an infinite repeating
> decimal in binary. Any extra precision on one side but not the other
> is likely to cause a mismatch.


The IEEE standard requires that, for numbers in this range, the
result must correctly rounded to the target precision. Since they're
both the same type and KIND (hence, the same target precision),
and since there is but one value that's closest to one tenth that's
representable as an IEEE single precision number, they must
give the same result.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


Gerry Thomas

2004-12-15, 3:57 am


"James Giles" <jamesgiles@worldnet.att.net> wrote in message
news:24Pvd.1095905$Gx4.30684@bgtnsc04-news.ops.worldnet.att.net...

[...]
>
> The IEEE standard requires that, for numbers in this range, the
> result must correctly rounded to the target precision. Since they're
> both the same type and KIND (hence, the same target precision),
> and since there is but one value that's closest to one tenth that's
> representable as an IEEE single precision number, they must
> give the same result.
>


How profound!

--
You're Welcome,
Gerry T.
______
"Facts are meaningless. You could use facts to prove anything that's even
remotely true." -- Homer Simpson.


James Van Buskirk

2004-12-15, 3:57 am

"James Giles" <jamesgiles@worldnet.att.net> wrote in message
news:RMOvd.126249$7i4.82612@bgtnsc05-news.ops.worldnet.att.net...

> Result Value. If the physical representation of the result has
> the same length as that of SOURCE, the physical representation
> of the result is that of SOURCE. If the physical representation
> of the result is longer than that of SOURCE, the physical
> representation of the leading part is that of SOURCE and
> the remainder is processor dependent. If the physical
> representation of the result is shorter than that of SOURCE,
> the physical representation of the result is the leading part
> of SOURCE. If D and E are scalar variables such that the
> physical representation of D is as long as or longer than that
> of E, the value of TRANSFER (TRANSFER (E, D), E) shall
> be the value of E. IF D is an array and E is an array of rank one,
> the value of TRANSFER (TRANSFER (E, D), E, SIZE (E))
> shall be the value of E.


Gack! I was afraid of this... shows what I get for not participating
in the standardization process... What if E is type default INTEGER
and D is type default REAL so that they have the same storage size
and the bit pattern of E corresponds to an SNAN? There have been
popular processors that by default silently convert SNANs to QNANs
when passing them through their FPUs. This would change one bit
of E in this case. Also what if D has holes in it, like a user-
defined type with internal padding for component alignment or
external padding for aligning arrays of the structure? Or how
about 80-bit REALs on x86 machines which occupy 128 bits in memory
leaving 48 bits of 'hole'?

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


Ron Shepard

2004-12-15, 8:57 am

In article <1103067514.028386.32300@z14g2000cwz.googlegroups.com>,
websnarf@gmail.com wrote:

> Sun and Intel should produce identical results for
> negation,+,-,*,/,sqrt() operations, because this is required for IEEE
> 754 compliance, and both implement double and float on the same sized
> floating point values (older x86 compilers used to use 80bit for all FP
> temporaries which could definately be a source of differences, but VC++
> doesn't do this anymore).


I don't have a reference handy for the details, but I seem to
remember reading that the only way to guarantee correct results, to
the last bit, for floating point multiplication was to compute an
intermediate result that has twice the number of mantissa bits as
the final result, and then round as appropriate to the final
precisionm. Anything less than twice the number of bits will round
incorrectly for some input values. Presumably an 80-bit
intermediate will round correctly more often than simply keeping
track of a single guard bit, but it would still be incorrect in the
last bit in some cases.

$.02 -Ron Shepard
James Giles

2004-12-15, 8:57 am

James Van Buskirk wrote:
> "James Giles" <jamesgiles@worldnet.att.net> wrote in message
> news:RMOvd.126249$7i4.82612@bgtnsc05-news.ops.worldnet.att.net...
>
>
> Gack! I was afraid of this... shows what I get for not participating
> in the standardization process... What if E is type default INTEGER
> and D is type default REAL so that they have the same storage size
> and the bit pattern of E corresponds to an SNAN? There have been
> popular processors that by default silently convert SNANs to QNANs
> when passing them through their FPUs. [...]


Why would TRANSFER pass data through the FPU? Indeed,
TRANSFER is usually a "do nothing" operation (it merely alters
the compile-time type attributes in order to permit, say, assignment
without conversion). Without performing operations that have
real semantic meaning, bits is just bits.

> [...] Also what if D has holes in it, like a user-
> defined type with internal padding for component alignment or
> external padding for aligning arrays of the structure? Or how
> about 80-bit REALs on x86 machines which occupy 128 bits in memory
> leaving 48 bits of 'hole'?


Again, why would anything untoward happen unless there's some
intervening operation on the result of the TRANSFER?

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


James Van Buskirk

2004-12-15, 8:57 am

"James Giles" <jamesgiles@worldnet.att.net> wrote in message
news:FlSvd.1096863$Gx4.783358@bgtnsc04-news.ops.worldnet.att.net...

> Again, why would anything untoward happen unless there's some
> intervening operation on the result of the TRANSFER?


Well, how about the case where where the source is discontiguous
and its size and holes are incommensurable with the mold's size
and holes? It seems to me that any compiler, unless it special-
cased the transfer back to the source, would end up copying the
source to contiguous memory during the transfer process. The
example in the standard document seems to require that you put
something meaningful in the holes of the mold, but there are
contexts where holes are inviolate. Like for example, one could
define an ISA where the address space has a certain amount of
extensibility built in. Every pointer would then have to have
some unused bits and the processor would be absolutely required
to bomb any program that tried to store nonzero bits in this
'extensible' area. Programs would consider, say, the low 32
bits to be address and the high 32 bits to be reserved. An
implementation of TRANSFER that filled in holes as described
would put nonzero bits into the high half of addresses in a
way that wouldn't be easy to remove, so the results of TRANSFER
would be at least more dangerous than usual for manipulating
addresses in this case. Maybe it's workable in an intuitive
way, but it makes me somewhat uneasy just thinking about it.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


Christian Bau

2004-12-15, 8:57 am

In article <nk4vr0ttmj0rrk9ujlg0s49nluph7slt34@4ax.com>,
JS <someone@somewhere.com> wrote:

> On Tue, 14 Dec 2004 08:20:41 -0500, Herman D. Knoble
> <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote:
>
>
> All the calculations are done in double. The output result has
> precision (in the format of 999999999.666666) that is well within the
> precision limit of double. So this is not due not incorrect precision
> used for displaying or outputting the result.
>
>
> Some posters don't seem to know about /Op option. From Microsoft doc:
>
> "By default, the compiler uses the coprocessor¹s 80-bit registers to
> hold the intermediate results of floating-point calculations. This
> increases program speed and decreases program size. However, because
> the calculation involves floating-point data types that are
> represented in memory by less than 80 bits, carrying the extra bits of
> precision (80 bits minus the number of bits in a smaller
> floating-point type) through a lengthy calculation can produce
> inconsistent results.
>
> With /Op, the compiler loads data from memory prior to each
> floating-point operation and, if assignment occurs, writes the results
> back to memory upon completion. Loading the data prior to each
> operation guarantees that the data does not retain any significance
> greater than the capacity of its type."


That isn't enough. First, the x86 FPU must be switched to 64 bit mode,
otherwise you will occasionally get different result due to double
rounding (about 1 in 2000 chance for a single random operation). The
load/store will make sure that overflows and underflows are reproduced
(let x = 1e300. (x * x) / x should be Inf / x = Inf. Without storing and
reloading the result of x * x, x * x = 1e600, (x * x) / x = 1e300. You
still run into trouble because underflow in a multiplication and
division can give different results due to double rounding. You can
always switch to Java + strict FP mode to get reproducible results.
(Which will be damned slow but it will get everything right, that is why
nonstrict FP mode had to be introduced. It is slow and gives sometimes
different results).
James Giles

2004-12-15, 8:57 am

James Van Buskirk wrote:
> "James Giles" <jamesgiles@worldnet.att.net> wrote in message
> news:FlSvd.1096863$Gx4.783358@bgtnsc04-news.ops.worldnet.att.net...
>
>
> Well, how about the case where where the source is discontiguous
> and its size and holes are incommensurable with the mold's size
> and holes? [...]


I wouldn't even expect that to work, much less yield predictable
results. I'm afraid I'm a simplistic sort that expects raw consecutive
bits to simply be interpreted differently after TRANSFER. If
TRANSFER does more than that, I don't want to hear about it.
I mean, what I use it for (and rarely at that) is things like copying
an exact bit pattern of a REAL into an INTEGER so I can clear the
least significant bit and then copy it back into the REAL. Or,
incrementing the exponent part without changing the significand.
Or, maybe something with characters that I don't recall at the
moment. Etc.

In fact, I'm afraid I always make sure that the SOURCE and the
MOLD on TRANSFER calls are the same size (in bits), or that
MOLD is an array and the size of SOURCE is an integral multiple
of the size of an element MOLD.

But, as I say, I don't think we need to worry about integers that
happen to have NaN bit patterns when interpreted as floats, or
other such issues. They will only arise when (and if) the result
of TRANSFER is actually used in a float operation.

(It's late - I had to retype that last sentence three times. I'll
check back on this thread in about ten hours.)

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


James Van Buskirk

2004-12-15, 8:57 am

"James Giles" <jamesgiles@worldnet.att.net> wrote in message
news:BbTvd.1097101$Gx4.553006@bgtnsc04-news.ops.worldnet.att.net...

> But, as I say, I don't think we need to worry about integers that
> happen to have NaN bit patterns when interpreted as floats, or
> other such issues. They will only arise when (and if) the result
> of TRANSFER is actually used in a float operation.


Exactly this is a recurring problem. In the original Pentium
Classic, the widest registers you could access without RDMSR/
WRMSR were its x87 FP stack. Many times people would try to
move data around 64 bits at a time via FLD QWORD PTR x[8*ecx]/
FSTP QWORD PTR y[8*ecx] sequences. They would then be surprised
because some of the data would have one bit changed. The right
way to use the FP stack to move data, of course, was to use
FILD QWOR PTR x[8*ecx]/FISTP QWORD PTR y[8*ecx] because SNANs
would never get loaded into the FP registers in this way. Even
though the instruction timings were worse for the latter
sequence, if the data transfer was between memory locations
this was quite unimportant and using the FP stack with a
carefully unrolled and scheduled loop could be the fastest
way to move big aligned chunks of data around with that processor.
I would have thought it desirable when designing a low-level
primitive like TRANSFER that one might as well assume that the
user knows what he's doing so that for example an implementation
could assume that a user knows what kind of data he's TRANSFERring
into a floating point mold so that it would be OK for the Fortran
processor to touch it with its FPU. The TRANSFER back and forth
clause in the standard document seems to require otherwise,
though. It seems to put some kind of requirement on what a
Fortran processor has to do with the holes in data structures
under some circustances -- what if the result of TRANSFER is
used as an actual argument, then the associated dummy argument...

But maybe I am thinking about this all wrong, and the word 'shall'
in the standard is saying the the Fortran programmer had better
not write expressions where the values and appropriately preserved?

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

P.S. conversion from integer to real and resulting TRANSFER of
the real back to integer is the fastest way on some processors
to implement a LEADZ function. I'm tired too, so working out
an example for IEEE-754 single precision representation is
left to the early risers to enjoy with their worms.


Tim Prince

2004-12-15, 8:57 am


"Christian Bau" <christian.bau@cbau.freeserve.co.uk> wrote in message
news:christian.bau-01DCCE.08543615122004@slb-newsm1.svr.pol.co.uk...
> In article <nk4vr0ttmj0rrk9ujlg0s49nluph7slt34@4ax.com>,
> JS <someone@somewhere.com> wrote:
>


>
> That isn't enough. First, the x86 FPU must be switched to 64 bit mode,
> otherwise you will occasionally get different result due to double
> rounding (about 1 in 2000 chance for a single random operation). The
> load/store will make sure that overflows and underflows are reproduced
> (let x = 1e300. (x * x) / x should be Inf / x = Inf. Without storing and
> reloading the result of x * x, x * x = 1e600, (x * x) / x = 1e300. You
> still run into trouble because underflow in a multiplication and
> division can give different results due to double rounding.

That Microsoft doc is misleading, given that they always (at least since
VS6)
set the FPU to 53-bit precision, which I assume is what you mean by
"64-bit mode." I couldn't find any reference in the thread as to whether
SSE
code generation was invoked (no such option in VS6, but presumably preferred
in later versions).


Tim Prince

2004-12-15, 8:57 am


"Ron Shepard" <ron-shepard@NOSPAM.comcast.net> wrote in message
news:ron-shepard-B32758.01483815122004@comcast.dca.giganews.com...
> In article <1103067514.028386.32300@z14g2000cwz.googlegroups.com>,
> websnarf@gmail.com wrote:
>
>
> I don't have a reference handy for the details, but I seem to
> remember reading that the only way to guarantee correct results, to
> the last bit, for floating point multiplication was to compute an
> intermediate result that has twice the number of mantissa bits as
> the final result, and then round as appropriate to the final
> precisionm. Anything less than twice the number of bits will round
> incorrectly for some input values. Presumably an 80-bit
> intermediate will round correctly more often than simply keeping
> track of a single guard bit, but it would still be incorrect in the
> last bit in some cases.

The 80-bit multiplication presumably is done with "sticky bit" which records
whether any bits beyond the guard bits have been dropped. So, it can
guarantee to round correctly, without retaining all those bits in an
intermediate result. Rounding to 80 bits and then rounding again to 64 bits
produces occasionally incorrect double rounding. This problem is avoided by
the tactics of several commercial compilers of setting 53-bit rounding mode,
or using SSE code generation.


Lawrence Kirby

2004-12-15, 3:59 pm

On Tue, 14 Dec 2004 20:34:14 -0500, JS wrote:

>
>
> On Tue, 14 Dec 2004 08:20:41 -0500, Herman D. Knoble
> <SkipKnobleLESS@SPAMpsu.DOT.edu> wrote:
>
>
> All the calculations are done in double. The output result has
> precision (in the format of 999999999.666666) that is well within the
> precision limit of double. So this is not due not incorrect precision
> used for displaying or outputting the result.


Can you create a small program that demonstrates the problem? Perhaps a
little summation loop or something. With a particular example it would be
possible to pin down the source of any inconsistency.

....

> Some posters don't seem to know about /Op option. From Microsoft doc:
>
> "By default, the compiler uses the coprocessor's 80-bit registers to
> hold the intermediate results of floating-point calculations. This
> increases program speed and decreases program size. However, because
> the calculation involves floating-point data types that are
> represented in memory by less than 80 bits, carrying the extra bits of
> precision (80 bits minus the number of bits in a smaller
> floating-point type) through a lengthy calculation can produce
> inconsistent results.


I think somebody else pointed out that recent Microsoft compilers set the
FPU to 53 bits precision anyway. Another thing to consider is rounding
mode. They may well be the same but this needs to be checked.

Also I have an inherent mistrust of options like /Op. gcc certainly
doesn't get everything right with its version of that option, maybe other
compilers do but it can be hard on x86 especially and I wouldn't trust
it without some verification.

Lawrence
Dik T. Winter

2004-12-16, 3:57 am

In article <pan.2004.12.15.13.46.45.172000@netactive.co.uk> Lawrence Kirby <lknews@netactive.co.uk> writes:
....
> Also I have an inherent mistrust of options like /Op. gcc certainly
> doesn't get everything right with its version of that option, maybe other
> compilers do but it can be hard on x86 especially and I wouldn't trust
> it without some verification.


Indeed. Is an expression like a * b + c indeed calculated with 53 bits
mantissa only?
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
JS

2004-12-16, 3:57 am

How can you switch FPU to 64 bit mode? Thanks.

On Wed, 15 Dec 2004 08:54:36 +0000, Christian Bau
<christian.bau@cbau.freeserve.co.uk> wrote:


>That isn't enough. First, the x86 FPU must be switched to 64 bit mode,
>otherwise you will occasionally get different result due to double
>rounding (about 1 in 2000 chance for a single random operation). The
>load/store will make sure that overflows and underflows are reproduced
>(let x = 1e300. (x * x) / x should be Inf / x = Inf. Without storing and
>reloading the result of x * x, x * x = 1e600, (x * x) / x = 1e300. You
>still run into trouble because underflow in a multiplication and
>division can give different results due to double rounding. You can
>always switch to Java + strict FP mode to get reproducible results.
>(Which will be damned slow but it will get everything right, that is why
>nonstrict FP mode had to be introduced. It is slow and gives sometimes
>different results).


JS

2004-12-16, 3:57 am

Sorry, I hasn't been able to create a small program that demonstrates
the problem.

>
>Can you create a small program that demonstrates the problem? Perhaps a
>little summation loop or something. With a particular example it would be
>possible to pin down the source of any inconsistency.
>
>...
>


James Van Buskirk

2004-12-16, 3:57 am

"James Giles" <jamesgiles@worldnet.att.net> wrote in message
news:j81wd.1099991$Gx4.615663@bgtnsc04-news.ops.worldnet.att.net...

> No one so far even talked about dummy arguments or assumed
> shape. However, the shape of the MOLD is irrelevant. The result
> of TRANSFER is always either a scalar or a rank one array (a
> contiguous one). Only the type and array-hood of the MOLD
> argument is relevant to the use of TRANSFER. So, the situation
> JvB mentioned (I repeat below) that I was responding to would
> require a change to the definition of TRANSFER:


[color=darkred]
> And, I'll repeat that I wouldn't expect that to work, or produce
> repeatable results. The holes in the MOLD are not relevant
> (though "padding' in a single element of the MOLD, if it's
> a derived type, would be). The definition of TRANSFER would
> have to be changed in order for discontinuities and holes in
> MOLD to become relevant.


Unfortunately I didn't define my terms sufficiently in the
above quoted paragraph. By 'discontiguous' I meant to include
also the case of a substring of a character array. By 'size'
I was thinking about the size of an individual element of the
source or mold arrays -- really I should have been more
specific here. By 'holes' I meant the holes introduced as
a result of internal padding for structure component alignment
or external padding for record alignment within an array of
structures; I thought I had made this clear. Obviously an
example of some sort is required for us to focus our discussion
on:

module mykinds
implicit none
integer, parameter :: dp = selected_real_kind(15,300)
integer, parameter :: ik1 = selected_int_kind(2)
type type_3
integer a1_1
integer(ik1) :: a1_5
real(dp) a2_1
integer a3_2
end type type_3
type type_5
real(dp) a1_1
integer(ik1) a2_1(3)
real(dp) a3_1
integer a4_1(2)
real(dp) a5_1
end type type_5
end module mykinds

program transfer_ex
use mykinds
implicit none
type(type_3) t3(3)
integer(ik1), allocatable :: view(:)
integer size_3
character(80) fmt
type(type_5) t5(2)
integer size_5

t3 = type_3(-1,-1,-huge(1.0_dp),-1)
size_3 = size(transfer(t3,(/1_ik1/)))
write(*,'(a,i0)') ' size(t3) = ', size_3
allocate(view(size_3))
view = transfer(t3,view)
fmt = '(8(1x,z2.2))'
write(*,fmt) view
deallocate(view)
t5 = type_5(0,0,0,0,0)
size_5 = size(transfer(t5,(/1_ik1/)))
write(*,'(/a,i0)') ' size(t5) = ', size_5
allocate(view(size_5))
view = transfer(transfer(t3,t5),view)
write(*,fmt) view
deallocate(view)
size_5 = size(transfer(transfer(t3(1:3:2),t5),(/1_ik1/)))
write(*,'(/a,i0)') ' next size = ', size_5
allocate(view(size_5))
view = transfer(transfer(t3(1:3:2),t5),view)
write(*,fmt) view
end program transfer_ex

Output with g95:

size(t3) = 72
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00

size(t5) = 80
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
00 00 00 00 00 02 02 00

next size = 80
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 23 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
91 0D 00 00 FF FF FF FF
FF 00 23 00 FF FF FF FF
FF FF EF FF FF FF FF FF
00 00 00 00 00 00 00 00

Output with lf95:

size(t3) = 72
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00

size(t5) = 80
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
5C 50 72 6F 67 72 61 6D

next size = 80
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
5C 50 72 6F 67 72 61 6D

Output with cvf:

size(t3) = 72
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00

size(t5) = 80
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
00 00 00 00 00 00 00 00

next size = 80
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
FF FF FF FF FF 00 00 00
FF FF FF FF FF FF EF FF
FF FF FF FF 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00

It does look like these compilers are all copying hole data
faithfully...

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


Tim Prince

2004-12-16, 3:57 am


"Dik T. Winter" <Dik.Winter@cwi.nl> wrote in message
news:I8sJ8q.B2B@cwi.nl...
> In article <pan.2004.12.15.13.46.45.172000@netactive.co.uk> Lawrence Kirby

<lknews@netactive.co.uk> writes:
> ...
other[color=darkred]
>
> Indeed. Is an expression like a * b + c indeed calculated with 53 bits
> mantissa only?

Yes, because 53-bit precision mode is set, not on account of /Op. floats
also will be done with effective promotion of intermediates to double,
unless SSE is selected.


Tim Prince

2004-12-16, 3:57 am


"Jentje Goslinga" <goslinga@telus.net> wrote in message
news:41C10364.4080208@telus.net...
>
>
> Lawrence Kirby wrote:
> <snip>
>
the[color=darkred]
>
> I doubt whether the compiler or more likely the linker ever
> inserts code manipulating the precision or rounding bits.

This has been the expected convention for commercial Windows compilers since
MSVC 6.0, that 53-bit precision is set initially by each program. Likewise,
many compilers for various architectures have had compile time switches to
insert or link in code which sets gradual or abrupt underflow, ever since
the distinction was invented.
> I assume that the OS sets the FPU to a known state whenever
> the OS boots up but I am not sure whether it does that
> whenever a program starts or exits and it may be different
> between various Windows's or AMD.

Yes, 64-bit Windows is responsible for setting the FPU state before passing
control to a program. Certain versions of freebsd may also do so. 32-bit
Windows, and both 32- and 64-bit linux, simply let it default. I know of no
OS which checks the CPU brand to determine the behavior.
> For the program designer, it is safest to set the FPU to
> the precision and rounding mode he wants at the start of the
> program but this requires ASM to the best of my knowledge.

The IEEE standard has always recommended provision of intrinsic functions to
accomplish this in a portable way, but those took may years to become
incorporated in language standards.
>
> The FPU precision setting does not affect floating point
> addition, subtraction and multiplication but only
> division and of the transcendentals it affects only the
> square root, see my other post in this thread.

Didn't see your other post. Precision setting certainly does affect
numerical results. True, it doesn't affect the speed of those operations on
popular CPU models.


Tim Prince

2004-12-16, 4:06 pm


"Jentje Goslinga" <goslinga@telus.net> wrote in message
news:41C1390B.8060102@telus.net...
>
> Tim Prince wrote:
>
> <earlier post snipped>
> passing
> 32-bit
>
> So you inherit the previous program's FPU settings.
> Hmmmm...

No, default FPU settings are defined by the hardware, and don't depend on
what came before.


Merrill & Michele

2004-12-20, 3:59 pm


"JS" <someone@somewhere.com> wrote in message
news:e8o1s0t43aemputvbudau75v4rbp0nn8sd@
4ax.com...[color=darkred]
> Sorry, I hasn't been able to create a small program that demonstrates
> the problem.
>

Does the Standard have anything to say about machine epsilon? MPJ


James Giles

2004-12-20, 8:57 pm

Victor Bazarov wrote:
> Merrill & Michele wrote:

....
>
> Which Standard do you mean to ask about? You cross-posted to three
> different language newsgroups. The C++ Standard says that the member
> 'epsilon' of 'std::numeric_limits' returns the difference between 1
> and the least value greater than 1 that is representable. It does not
> say specify the actual value returned from 'epsilon'. The C Standard
> does give the maximum acceptable values of 'FLT_EPSILON', 'DBL_EPSILON'
> and 'LDBL_EPSILON', but your implementation is free to provide its own,
> smaller values. And I have no idea about the Fortran Standard.



Fortran has a generic intrinsic function called EPSILON. The
following code prints the values of the machine epsilon for single
and double precision:

Print *, epsilon(1.0), epsilon(1.0d0)
End

The return value is b^(1-p), where b is the base of the floating point
numbers (usually 2) and p is the number of base-b digits in the significand
(including any hidden normalization digit). For IEEE single, the result
is 2^(-23). For IEEE double, the result is 2^(-52). Note that the result
doesn't depend on the argument's value, only it's type attributes, so
EPSILON(0.0) returns the same result as EPSILON(1.0), since both
arguments are default precision reals. If the implementation were to
support additional real precisions (like quad, or double-extended)
the same inquiry function could be applied to them. Similarly, if
an implementation were to support the proposed IEEE decimal
floating point, the same inquiry function could be applied to those
as well. The result is in the same floating-point precision (and
base) as the argument.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com