Code Comments
Programming Forum and web based access to our favorite programming groups.On Mar 23, 11:12 am, Razii <DONTwhatever...@hotmail.com> wrote: > I said > nothing about flags in the last post. As was pointed out in your other thread, these are important details that you should not leave out; especially in this particular benchmark (which is not I/O bound like the last one, and is heavily affected by compiler optimizations). Jason
Post Follow-up to this message<jason.cipriani@gmail.com> wrote in message news:0d80b694-8609-4f80-9aeb- e55bcae9a8b0@s8g2000prg.googlegroups.com... > (I am sorry my point > wasn't clear, I had meant to show that the compiler can generate very > fast code for you, in particular -ffast-math [which does not target > specific hardware features] on GCC moreso than the the architecture- > specific options). Here are some key missing timings with GCC and VS, Razii: $ g++ -O2 smooth.cpp 8677.623 ms $ g++ -O2 -ffast-math smooth.cpp 1077.232 ms $ g++ -O2 -ffast-math -funroll-loops smooth.cpp 919.622 ms With CL 14 (VS2005): 7275.611 ms No platform-specific optimizations were used. Note the order of magnitude speed up using -ffast-math and -funroll-loops with GCC, which generates code that can still be run on the least-common- denominator of Intel-compatible platforms. Also note that -O2 provides slightly better performance than -O3. Just goes to show why I shouldn't be using -O3, I guess :-) Jason
Post Follow-up to this messageOn Mar 23, 1:18 pm, "jason.cipri...@gmail.com" <jason.cipri...@gmail.com> wrote: > Also note that -O2 provides slightly better performance than -O3. Just > goes to show why I shouldn't be using -O3, I guess :-) I take that back, I got it down to 903ms with -O3 over -O2. Perhaps my machine was just in a bad mood yesterday. I guess I shouldn't post in this thread for a while, I'm getting a little uncomfortable with my post ratio here. :-( Jason
Post Follow-up to this messageOn Mar 23, 5:00=A0pm, Razii <DONTwhatever...@hotmail.com> wrote: > On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <no.s...@here.dude> wrote: > > I used the proper flag /O2 in vc++. Also, when you are deploying a > commercial software, you will have to use flags that target the > least-common-denominator processor. That's a divantage of C++ vs > JIT language. The JIT compiler knows what processor it is running on, > and can generate code specifically for that processor. Thus, I won't > use anything other than /O2 for c++... because, as I said, when you > are deploying a commercial software, you will have to use flags that > target the least-common-denominator processor anyway. You are using Visual C++. You should use at least the following flags, in order to enable whole program optimization and link-time code generation : cl /O2 /GL prog.cpp /link /ltcg You should make again all your benchmarks with at least those options enabled. By the way, as I said earlier, you should use ints instead of doubles. It will save you from a lot of troubles. Alexandre Courpron.
Post Follow-up to this messageOn Mar 23, 1:18 pm, "jason.cipri...@gmail.com" <jason.cipri...@gmail.com> wrote: > With CL 14 (VS2005): > 7275.611 ms With /O2 and LTCG enabled!
Post Follow-up to this messageOn 23 mar, 09:58, "jason.cipri...@gmail.com" <jason.cipri...@gmail.com> wrote: > Also, wrt James' original post: / 3 ; > I am not sure what you would expect in either language. I > believe that James had been expecting it to cache [i+1] in an > fpu register, and use that instead of accessing the value the > next time through. Exactly. It's a very common optimization. > As it stands right now, VS at least (I didn't check GCC) > generates assembler instructions that are more equivalent to > this: > fpu_register =3D src[i - 1] > fpu_register +=3D src[i] > fpu_register +=3D src[i + 1] > fpu_register /=3D 3 > dest[i - 1] =3D fpu_register > Using fld, fadd, fdiv, and fstp on Intel machines. It never > loads src[i + 1] anyways. I have not tested this or done any > research, but I suspect this is still a bit faster than: > fpu_register =3D src[i - 1] > fpu_register +=3D other_fpu_register > other_fpu_register =3D src[i + 1] > fpu_register +=3D other_fpu_register > fpu_register /=3D 3 > dest[i - 1] =3D fpu_register The fastest solution would pre-charge the first two values, and only read one new value each time through the loop. (Don't forget that memory bandwidth will be the limiting factor for this type of loop.) The Intel's stack architecture may make optimizing this somewhat more difficult, but it should still be possible. You'll end up with more instructions, but less memory accesses and better run time. But of course, you can't do this in C++, because dest[ i -1 ] might modify one of the values you're holding in register. (Of course, some compilers do generate two versions of the loop, with code which checks for aliasing first, and uses one or the other, depending on whether aliasing is present or not. But this isn't the usual case.) -- James Kanze (GABI Software) email:james.kanze@gmail.com Conseils en informatique orient=E9e objet/ Beratung in objektorientierter Datenverarbeitung 9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34
Post Follow-up to this messageOn Sun, 23 Mar 2008 10:29:44 -0700 (PDT), James Kanze <james.kanze@gmail.com> wrote: > >Exactly. It's a very common optimization. Post an example that I can test where aliasing is a problem. In the example that I posted, c++ was faster.
Post Follow-up to this messageOn Sun, 23 Mar 2008 10:23:07 -0700 (PDT), courpron@gmail.com wrote:
>You should make again all your benchmarks with at least those options
>enabled
What all other bench marks? :) I only posted one IO example. This
post was not a benchmark but a question to Kanze to prove by posting
an example (that I can test) where aliasing in c++ makes optimizing
arrays hard (or impossible). The example that he gave (which I used),
for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
c++ was in fact faster. Obviously aliasing was no issue in this case,
as he claimed.
Post Follow-up to this messageOn Mar 23, 11:40 am, Jon Harrop <use...@jdh30.plus.com> wrote:
>
> GCC's -ffast-math option breaks semantics, so it is not a valid optimization.[/col
or]
Only sometimes; and it's a valid optimization. Specifically, in this
case, the results are identical. Mostly, in my experience, you start
to lose precision with -ffast-math when you start doing things beyond
simple arithmetic, such as sqrt() and cos(), or when you get into the
realm of overflows and NaNs.
In case anybody is curious the Intel compiler yields similar results
to VS, and to GCC with SSE3 enabled (but no -ffast-math), which is the
expected results:
icl /Ox /QxP /Qipo /Qunroll-aggressive smooth.cpp
Was about 7400 ms for me. With:
icl /Ox /QxP /Qipo /Qprec-div- /Qunroll-aggressive smooth.cpp
Dropping it down to 1100 ms (ICC's /Qprec-div- is similar in spirit to
GCC's -ffast-math).
Following are 3 source files and a Makefile, I used MinGW GCC 3.4.5;
you will want to implement your own tick()/tock() functions; the
windows.h #include is only for those. The output, for me, is:
$ ./smooth.exe
no -ffast-math: 8796.27
-ffast-math: 923.052
1e-014
delta: 0
they are precisely equal.
==== Makefile ====
CFLAGS = -O2 -funroll-loops
.PHONY: clean
smooth.exe: smooth_main.cpp smooth_nofm.cpp smooth_fm.o
g++ $(CFLAGS) smooth_main.cpp smooth_nofm.cpp smooth_fm.o -o $@
smooth_fm.o: smooth_fm.cpp
g++ $(CFLAGS) -ffast-math -c $<
clean:
rm -f smooth_fm.o smooth.exe
==== smooth_fm.cpp ====
void smooth_fm (double *dest, double const *src, int len) {
for (int i = 1 ; i < len - 1 ; i++ )
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
==== smooth_nofm.cpp ====
void smooth_nofm (double *dest, double const *src, int len) {
for (int i = 1 ; i < len - 1 ; i++ )
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
==== smooth_main.cpp ====
#include <algorithm>
#include <ctime>
#include <iostream>
#include <windows.h>
using namespace std;
LARGE_INTEGER s_tick;
void smooth_nofm (double *, double const *, int);
void smooth_fm (double *, double const *, int);
double tick (void) {
QueryPerformanceCounter(&s_tick);
}
double tock (const string &msg) {
LARGE_INTEGER now, freq;
QueryPerformanceCounter(&now);
QueryPerformanceFrequency(&freq);
cout << msg << ": " <<
((double)(now.QuadPart - s_tick.QuadPart) /
(double)(freq.QuadPart / 1000LL)) << endl;
}
void fill (double *src, int len ) {
srand(time(NULL));
for (int i = 0; i < len; ++ i)
src[i] = rand();
}
int main () {
const int len = 50000;
double src_array1[len];
double src_array2[len];
double dest_array[len];
double fm, nofm;
fill(src_array1, len);
copy(src_array1, src_array1 + len, src_array2);
tick();
for (int i = 0; i < 10000; i++)
smooth_nofm(dest_array, src_array1, len);
tock("no -ffast-math");
nofm = dest_array[0];
tick();
for (int i = 0; i < 10000; i++)
smooth_fm(dest_array, src_array2, len);
tock("-ffast-math");
fm = dest_array[0];
cout << 0.00000000000001 << endl;
cout << "delta: " << (fm - nofm) << endl;
if (fm == nofm)
cout << "they are precisely equal." << endl;
return 0;
}
==== END ====
Post Follow-up to this messageOn Mar 23, 2:55 pm, Razii <DONTwhatever...@hotmail.com> wrote: > [snip] Razii, you are benchmarking compiler optimization techniques, not language differences. Again, just as with your I/O hardware benchmarks, your tests have two many variables in them to be used as simply a comparison between C++ and Java. Jason
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.