Code Comments
Programming Forum and web based access to our favorite programming groups.On Sun, 23 Mar 2008 10:23:07 -0700 (PDT), courpron@gmail.com wrote: >cl /O2 /GL prog.cpp /link /ltcg And there was no speed difference between cl /O2 /GL prog.cpp /link /ltcg and cl /O2 /prog.cpp in this case.
Post Follow-up to this messageOn Sun, 23 Mar 2008 13:22:59 -0500, Razii <DONTwhatevere3e@hotmail.com> wrote: >And there was no speed difference between > >cl /O2 /GL prog.cpp /link /ltcg And the time I am getting for (VC++) is Time smooth(): 5391 ms Time smooth(): 5375 ms Time smooth(): 5453 ms Time smooth(): 5421 ms For Java C:\>java -server Arrays Time smooth(): 5610 ms C:\>java -server Arrays Time smooth(): 5625 ms C:\>java -server Arrays Time smooth(): 5562 ms Note that I am using server VM (-server). The client VM is a bit slower in this case After changing double to int, (vc++ with cl /O2 /GL prog.cpp /link /ltcg) Time smooth(): 1000 ms Time smooth(): 1000 ms Time smooth(): 1000 ms and C:\>java -server Arrays Time smooth(): 2541 ms
Post Follow-up to this messagejason.cipriani@gmail.com wrote: > If you are going for blinding speed be sure to use proper optimization > flags; otherwise you are only calculating benchmarks for poorly > optimized code, which, like most of your benchmarks, is meaningless: > > $ g++ -O0 smooth.cpp > 9884.867 ms > > $ g++ -O3 smooth.cpp > 8791.123 ms > > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp > 1207.944 ms GCC's -ffast-math option breaks semantics, so it is not a valid optimization. I would not use it in a comparison against Java unless you know that accuracy is not required for the given task. > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll- > loops smooth.cpp > 1084.385 ms > > This is your code slightly modified to use QueryPerformanceCounter for > timing (using Windows). Most of the speed up is a result of -ffast- > math. > > Based on the quality and rigor of the other tests I've seen you do, > I'd guess that you missed a few optimization flags on the Java side as > well. I have altered all implementations of this benchmark to perform progressive blurring from one array to the other and back rather than overwriting the destination array repeatedly. This slightly improved performance in all cases. On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get best-of-three results: g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2 ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench java: 4.016 javac Arrays.java; time java Arrays On the same hardware running 32-bit Windows XP Pro, I get: F#: 3.970 fsc -O3 This shows an insignificant difference in performance which, if nothing else, is interesting because the second C++ program uses bounds checking. I was also surprised to see OCaml do so well on this benchmark because it makes no attempt to hoist bounds checks. F# on .NET typically beats all open source compilers on Linux on this kind of benchmark (float array intensive). C# presumably would too. The bench2 program is the following, more idiomatic, C++ version that uses bounds checking: #include <vector> #include <iostream> #include <cstdlib> #include <ctime> void smooth (std::vector<double> &dest, std::vector<double> &src) { for (int i = 1; i < src.size() - 1; i++) dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ; } void fill (std::vector<double> &src) { srand((unsigned)std::time(0)); for (int i = 0; i < src.size(); ++ i ) src.at(i) = rand(); } int main() { const int len = 50000; std::vector<double> src_array(len); std::vector<double> dest_array(len); fill(src_array); clock_t start=clock(); for (int i = 0; i < 5000; i++) { smooth(dest_array, src_array); smooth(src_array, dest_array); } clock_t endt=clock(); std::cout <<"Time smooth(): " << double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n"; return 0; } Here is my OCaml equivalent: let smooth dst src = let n = Array.length dst in for i=1 to n-2 do dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3. done let () = let n = 50000 in let src = Array.init n (fun _ -> Random.float 1.) in let dst = Array.create n 0. in let t = Sys.time() in for i=1 to 5000 do smooth dst src; smooth src dst; done; Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t) -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?u
Post Follow-up to this messagejason.cipriani@gmail.com wrote: > If you are going for blinding speed be sure to use proper optimization > flags; otherwise you are only calculating benchmarks for poorly > optimized code, which, like most of your benchmarks, is meaningless: > > $ g++ -O0 smooth.cpp > 9884.867 ms > > $ g++ -O3 smooth.cpp > 8791.123 ms > > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp > 1207.944 ms GCC's -ffast-math option breaks semantics, so it is not a valid optimization. I would not use it in a comparison against Java unless you know that accuracy is not required for the given task. > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll- > loops smooth.cpp > 1084.385 ms > > This is your code slightly modified to use QueryPerformanceCounter for > timing (using Windows). Most of the speed up is a result of -ffast- > math. > > Based on the quality and rigor of the other tests I've seen you do, > I'd guess that you missed a few optimization flags on the Java side as > well. I have altered all implementations of this benchmark to perform progressive blurring from one array to the other and back rather than overwriting the destination array repeatedly. This slightly improved performance in all cases. On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get best-of-three results: g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2 ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench java: 4.016 javac Arrays.java; time java Arrays On the same hardware running 32-bit Windows XP Pro, I get: F#: 3.970 fsc -O3 This shows an insignificant difference in performance which, if nothing else, is interesting because the second C++ program uses bounds checking. I was also surprised to see OCaml do so well on this benchmark because it makes no attempt to hoist bounds checks. F# on .NET typically beats all open source compilers on Linux on this kind of benchmark (float array intensive). C# presumably would too. The bench2 program is the following, more idiomatic, C++ version that uses bounds checking: #include <vector> #include <iostream> #include <cstdlib> #include <ctime> void smooth (std::vector<double> &dest, std::vector<double> &src) { for (int i = 1; i < src.size() - 1; i++) dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ; } void fill (std::vector<double> &src) { srand((unsigned)std::time(0)); for (int i = 0; i < src.size(); ++ i ) src.at(i) = rand(); } int main() { const int len = 50000; std::vector<double> src_array(len); std::vector<double> dest_array(len); fill(src_array); clock_t start=clock(); for (int i = 0; i < 5000; i++) { smooth(dest_array, src_array); smooth(src_array, dest_array); } clock_t endt=clock(); std::cout <<"Time smooth(): " << double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n"; return 0; } Here is my OCaml equivalent: let smooth dst src = let n = Array.length dst in for i=1 to n-2 do dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3. done let () = let n = 50000 in let src = Array.init n (fun _ -> Random.float 1.) in let dst = Array.create n 0. in let t = Sys.time() in for i=1 to 5000 do smooth dst src; smooth src dst; done; Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t) -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?u
Post Follow-up to this messageOn Mar 23, 6:54=A0pm, Razii <DONTwhatever...@hotmail.com> wrote:
> On Sun, 23 Mar 2008 10:29:44 -0700 (PDT), James Kanze
>
> <james.ka...@gmail.com> wrote:
>
>
> Post an example that I can test where aliasing is a problem. In the
> example that I posted, c++ was faster.
The smooth function must not be inlined. You can control this inlining
property in C++ but not in java afaik. Then, you can't directly
compare the performances of both language this way if you disable
inlining just for one. But I can show you the difference between 2 C++
programs, one caring about pointer aliasing and one applying no
aliasing optimization, by following those steps :
1. we will use a simpler version of the function smooth ( of course
the name "smooth" is now meaningless ).
2. we will make it not inlined in visual C++ to disable static
analysis of pointer aliasing
3. we won't pass the length as a parameter to the smooth, but use it
as a global variable. The compiler will do loop unrolling and caching
array access will be easier.
4. We must make a dummy call to the smooth function with dest_array as
an argument for both the first and the second parameter of foo. Global
analysis then won't detect that all calls to foo have no aliasing
issue.
Here is the modified version of your program :
#include <iostream>
#include <ctime>
//#define NO_ALIASING_OPTIMIZATION
const int len =3D 50000;
__declspec(noinline)
#ifdef NO_ALIASING_OPTIMIZATION
void smooth (int* dest, int * src )
#else
void smooth (int* __restrict dest, int * __restrict src )
#endif
{
for ( int i =3D 0 ; i < len - 2 ; i++ )
dest[ i ] =3D src[ i ] + src[ i + 1 ] + src[ i + 2 ];
}
void fill (int* src)
{
for (int i =3D 0 ; i < len ; ++ i )
src[i] =3D i;
}
int main()
{
int src_array [len] =3D {0} ;
int dest_array [len] =3D {0};
fill(src_array);
smooth (dest_array, dest_array); // dummy call
clock_t start=3Dclock();
for (int i =3D 0; i < 10000; i++)
smooth (dest_array, src_array);
clock_t endt=3Dclock();
std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";
// doesn't work without the following cout on vc++
std::cout << dest_array [0] ;
return 0;
}
Uncomment the line :
//#define NO_ALIASING_OPTIMIZATION
to see the performance gain with the no aliasing optimization.
Alexandre Courpron.
Post Follow-up to this messageOn Mar 23, 9:41=A0pm, courp...@gmail.com wrote:
> On Mar 23, 6:54=A0pm, Razii <DONTwhatever...@hotmail.com> wrote:
>
>
>
>
>
> The smooth function must not be inlined. You can control this inlining
> property in C++ but not in java afaik. Then, you can't directly
> compare the performances of both language this way if you disable
> inlining just for one. But I can show you the difference between 2 C++
> programs, one caring about pointer aliasing and one applying no
> aliasing optimization, by following those steps :
>
> 1. we will use a simpler version of the function smooth ( of course
> the name "smooth" is now meaningless ).
> 2. we will make it not inlined in visual C++ to disable static
> analysis of pointer aliasing
> 3. we won't pass the length as a parameter to the smooth, but use it
> as a global variable. The compiler will do loop unrolling and caching
> array access will be easier.
> 4. We must make a dummy call to the smooth function with dest_array as
> an argument for both the first and the second parameter of foo. Global
> analysis then won't detect that all calls to foo have no aliasing
> issue.
>
> Here is the modified version of your program :
>
> #include <iostream>
> #include <ctime>
>
> //#define NO_ALIASING_OPTIMIZATION
>
> const int len =3D 50000;
>
> __declspec(noinline)
> #ifdef NO_ALIASING_OPTIMIZATION
> void smooth (int* =A0 dest, =A0int * =A0src )
> #else
> void smooth (int* =A0__restrict dest, =A0int * __restrict src )
> #endif
> {
> =A0 =A0 for ( int i =3D 0 ; i < len - 2 ; i++ )
> =A0 =A0 =A0 =A0dest[ i ] =3D src[ i ] + src[ i + 1 ] + src[ i + 2 ];
>
> }
>
> void fill (int* src)
> {
> =A0 =A0 for (int i =3D 0 ; i < len ; ++ i )
> =A0 =A0 =A0 =A0 src[i] =3D i;
>
> }
>
> int main()
> {
>
> int src_array [len] =3D {0} ;
> int dest_array [len] =3D {0};
>
> fill(src_array);
>
> smooth (dest_array, dest_array); // dummy call
>
> clock_t start=3Dclock();
>
> for (int i =3D 0; i < 10000; i++)
> smooth (dest_array, src_array);
>
> clock_t endt=3Dclock();
>
> =A0 =A0 =A0 =A0 std::cout <<"Time smooth(): " <<
> =A0 =A0 =A0 =A0 =A0 double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";
>
> // doesn't work without the following cout on vc++
> std::cout << dest_array [0] ;
>
> return 0;
>
> }
>
> Uncomment the line :
> //#define NO_ALIASING_OPTIMIZATION
> to see the performance gain with the no aliasing optimization.
>
> Alexandre Courpron.
There is one mistake in the above code, in the smooth function:
instead of :
#ifdef NO_ALIASING_OPTIMIZATION
it should be :
#ifndef NO_ALIASING_OPTIMIZATION
the whole smooth function then should be :
__declspec(noinline)
#ifndef NO_ALIASING_OPTIMIZATION
void smooth (int* dest, int * src )
#else
void smooth (int* __restrict dest, int * __restrict src )
#endif
{
for ( int i =3D 0 ; i < len - 2 ; i++ )
dest[ i ] =3D src[ i ] + src[ i + 1 ] + src[ i + 2 ];
}
Alexandre Courpron.
Post Follow-up to this messageOn Mar 23, 4:41 pm, courp...@gmail.com wrote: > The smooth function must not be inlined. You can control this inlining > property in C++ but not in java afaik. Then, you can't directly > compare the performances of both language this way if you disable > inlining just for one. But I can show you the difference between 2 C++ > programs, one caring about pointer aliasing and one applying no > aliasingoptimization, by following those steps : What version of GCC are you using? I had already tried this earlier on the original post; the version with and without __restrict not only produced identical times, but generated identical code as well (diff'ing objdump disassembly). I am using MinGW GCC 3.4.5; it's could be the case that this version ignores __restrict. I also had tested a version that inlined smooth(), and used the same array for input and output -- using the first half for output, with index offsets inside the loop, something to the effect of: double buf[len * 2]; int i, r; copy(src, src + len, buf + len); for (r = 0; r < 10000; ++ r) for (i = 1 ; i < len - 1 ; i++ ) buf[ i - 1 ] = (buf[ len + i - 1 ] + buf[ len + i ] + buf[ len+ i + 1 ]) / 3 ; It gave identical timings as well. Jason
Post Follow-up to this messageIn article < MIednRYljpBPQ3vanZ2dnUVZ_rLinZ2d@comcast .com>, andreytarasevich@hotmail.com says... [ ... ] > I don't really see much sense in comparing Java performance with C++ > performance. What might be more interesting is the effect of the 'restrict ' > specifier supported by C99 compilers (and many C89/90 and C++ compielers a s an > extension), which is intended to assist compiler in performing exactly thi s kind > of optimizations. Of, if you're really interested in C++, you could throw in a comparison use a valarray, which was designed to give the same sort of assurance. Unfortunately, I don't know of anybody who seems to have gone to much (if any) trouble to optimize valarray at all -- rather the contrary, it was an idea that even its own creator admits was in the wrong place at the wrong time, so it's ignored almost to death, so to speak. -- Later, Jerry. The universe is a figment of its own imagination.
Post Follow-up to this messageOn Mar 23, 10:28=A0pm, "jason.cipri...@gmail.com"
<jason.cipri...@gmail.com> wrote:
> On Mar 23, 4:41 pm, courp...@gmail.com wrote:
>
>
> What version of GCC are you using? I had already tried this earlier on
> the original post; the version with and without __restrict not only
> produced identical times, but generated identical code as well
> (diff'ing objdump disassembly). I am using MinGW GCC 3.4.5; it's could
> be the case that this version ignores __restrict.
>
> I also had tested a version that inlined smooth(), and used the same
> array for input and output -- using the first half for output, with
> index offsets inside the loop, something to the effect of:
>
> =A0 double buf[len * 2];
> =A0 int i, r;
> =A0 copy(src, src + len, buf + len);
>
> =A0 for (r =3D 0; r < 10000; ++ r)
> =A0 =A0 for (i =3D 1 ; i < len - 1 ; i++ )
> =A0 =A0 =A0 buf[ i - 1 ] =3D (buf[ len + i - 1 ] +
> =A0 =A0 =A0 =A0 =A0 =A0 =A0buf[ len + i ] + buf[ len+ i + 1 ]) / 3 ;
>
> It gave identical timings as well.
>
> Jason
You must make sure that :
- the smooth function is not inlined
- the compiler applies loop unrolling, to make the caching process
easier.
I quickly test the modified program I gave earlier, with GCC 4.3 .
Apparently, GCC has difficulty to do partial loop unrolling, I then
reduced the number of iterations in the smooth function. I might
investigate this issue tomorrow. Anyway, here is the program for GCC
that should give different results depending on the use of the
"restrict" keyword :
#include <iostream>
#include <ctime>
//#define NO_ALIASING_OPTIMIZATION
const int len =3D 20;
__attribute__((noinline))
#ifndef NO_ALIASING_OPTIMIZATION
void smooth (int* dest, int * src )
#else
void smooth (int* __restrict dest, int * __restrict src )
#endif
{
for ( int i =3D 0 ; i < 17 ; ++i )
dest[ i ] =3D src[ i ] + src[ i + 1 ] + src[ i + 2 ];
}
void fill (int* src)
{
for (int i =3D 0 ; i < len ; ++ i )
src[i] =3D i;
}
int main()
{
int src_array [len] =3D {0} ;
int dest_array [len] =3D {0};
fill(src_array);
smooth (dest_array, dest_array); // dummy call
clock_t start=3Dclock();
for (int i =3D 0; i < 100000000; i++)
smooth (dest_array, src_array);
clock_t endt=3Dclock();
std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";
// doesn't work without the following cout on vc++
std::cout << dest_array [0] ;
return 0;
}
Uncommenting the line
//#define NO_ALIASING_OPTIMIZATION
should make the compiler apply no aliasing optimizations.
Alexandre Courpron.
Post Follow-up to this messageRazii wrote: > On Sun, 23 Mar 2008 15:40:44 +0000, Jon Harrop <usenet@jdh30.plus.com> > wrote: > > > Make sure you are using server VM to run it ... I am. > that is, java -server Arrays On my machine it is the default. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?u
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.