Code Comments
Programming Forum and web based access to our favorite programming groups.
From an old post by James Kanze
On Apr 9 2003, 5:42 pm, ka...@gabi-soft.de (James Kanze) wrote:
>When I pass an "array" to a function in C/C++, I actually pass
>a pointer to the first element. And the compiler, when it compiles the
>function, only sees pointers -- it has no way of determining any
>relationship between them. Consider, for example, a smoothing function
>(in C or earlier C++):
>
> void
> smooth( double* dest,
> double const* src,
> size_t len )
> {
> for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
> dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
> }
> }
>
>
>The compiler cannot arrange to use the src[ i + 1 ] value from the
>preceding pass through the loop as the src[ i ] value in the current
>pass, since the write to dest[ i - 1 ] might have changed it. In Java,
>it can, since two arrays are either identical or disjoint.
>
>
>This sort of code shows up frequently. In benchmarks from Java
>suppliers comparing Java to C++, of course:-). But also in any number
>of applications dealing with physical measures: meteorology, geophysical
>research (oil prospection), etc.
Out of curiosity, I tried to test if the above is true. It didn't make
any difference. In fact, C++ was a bit faster (not by much, just 6%).
Probably due to array bound check in Java, if there is in indeed an
issue with C++ arrays, overall there is no difference.
==c++==
#include <iostream>
#include <cstdlib>
#include <ctime>
void fill (double* src, int len);
void smooth (double* dest,
double const* src,
int len );
int main()
{
const int len = 50000;
double src_array [len] = {0};
double dest_array [len] = {0};
fill(src_array, len);
clock_t start=clock();
for (int i = 0; i < 10000; i++)
smooth (dest_array, src_array, len);
clock_t endt=clock();
std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";
// doesn't work without the following cout on vc++
std::cout << dest_array [0] ;
return 0;
}
void smooth (double* dest, double const* src, int len )
{
for (int i = 1 ; i < len - 1 ; i++ ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
}
void fill (double* src, int len)
{
srand((unsigned)std::time(0));
for (int i = 0 ; i < len ; ++ i ) {
src[i] = rand();
}
}
==== java ========
import java.util.*;
public class Arrays{
public static void main (String[] arg)
{
final int LEN = 50000;
double[] src_array = new double [LEN];
fill(src_array, LEN);
double[] dest_array = new double [LEN];
long start = System.currentTimeMillis();
for (int i = 0; i < 10000; i++)
smooth(dest_array, src_array, LEN);
long end = System.currentTimeMillis();
System.out.println("Time smooth(): " + (end - start) + " ms");
}
static void smooth (double[] dest, double[] src, int len )
{
for (int i = 1 ; i < len - 1 ; i++ ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
}
static void fill (double[] src, int len)
{
for (int i = 0 ; i < len; i++)
src[i] = Math.random();
}
}
Post Follow-up to this messageOn Mar 23, 4:08 am, Razii <DONTwhatever...@hotmail.com> wrote: > From an old post by James Kanze > > On Apr 9 2003, 5:42 pm, ka...@gabi-soft.de (James Kanze) wrote: > > > > > > > > Out of curiosity, I tried to test if the above is true. It didn't make > any difference. In fact, C++ was a bit faster (not by much, just 6%). > Probably due to array bound check in Java, if there is in indeed an > issue with C++ arrays, overall there is no difference. If you are going for blinding speed be sure to use proper optimization flags; otherwise you are only calculating benchmarks for poorly optimized code, which, like most of your benchmarks, is meaningless: $ g++ -O0 smooth.cpp 9884.867 ms $ g++ -O3 smooth.cpp 8791.123 ms $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp 1207.944 ms $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll- loops smooth.cpp 1084.385 ms This is your code slightly modified to use QueryPerformanceCounter for timing (using Windows). Most of the speed up is a result of -ffast- math. Based on the quality and rigor of the other tests I've seen you do, I'd guess that you missed a few optimization flags on the Java side as well. Jason
Post Follow-up to this messageOn Mar 23, 4:35 am, "jason.cipri...@gmail.com" <jason.cipri...@gmail.com> wrote: > If you are going for blinding speed be sure to use proper optimization > flags; otherwise you are only calculating benchmarks for poorly > optimized code, which, like most of your benchmarks, is meaningless: > > $ g++ -O0 smooth.cpp > 9884.867 ms > > $ g++ -O3 smooth.cpp > 8791.123 ms > > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp > 1207.944 ms > > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll- > loops smooth.cpp > 1084.385 ms I left one out: $ g++ -03 -mfpmath=sse -msse3 smooth.cpp 7379.883 ms Compare to -O3 without having GCC generate SSE instructions for you. Jason
Post Follow-up to this messageAlso, wrt James' original post: > I am not sure what you would expect in either language. I believe that James had been expecting it to cache [i+1] in an fpu register, and use that instead of accessing the value the next time through. As it stands right now, VS at least (I didn't check GCC) generates assembler instructions that are more equivalent to this: fpu_register = src[i - 1] fpu_register += src[i] fpu_register += src[i + 1] fpu_register /= 3 dest[i - 1] = fpu_register Using fld, fadd, fdiv, and fstp on Intel machines. It never loads src[i + 1] anyways. I have not tested this or done any research, but I suspect this is still a bit faster than: fpu_register = src[i - 1] fpu_register += other_fpu_register other_fpu_register = src[i + 1] fpu_register += other_fpu_register fpu_register /= 3 dest[i - 1] = fpu_register Which I believe is what the original expected "optimization" was. I have removed comp.lang.java.programmer from this one, since it is unrelated to Java. Jason
Post Follow-up to this messageOn Sun, 23 Mar 2008 01:35:06 -0700 (PDT), "jason.cipriani@gmail.com" <jason.cipriani@gmail.com> wrote: >If you are going for blinding speed be sure to use proper optimization >flags; otherwise you are only calculating benchmarks for poorly >optimized code, which, like most of your benchmarks, is meaningless: Where did you get that I am not using proper optimizing flag? I said nothing about flags in the last post. >$ g++ -O0 smooth.cpp >9884.867 ms I am using vc++ >I'd guess that you missed a few optimization flags on the Java side as >well. There are no optimization flags for java compiler or vm. Post proof and links.
Post Follow-up to this messageRazii wrote: > On Sun, 23 Mar 2008 01:35:06 -0700 (PDT), "jason.cipriani@gmail.com" > <jason.cipriani@gmail.com> wrote: > > > Where did you get that I am not using proper optimizing flag? I said > nothing about flags in the last post. > > > I am using vc++ > Well, then, you can't make sweeping statements about C++ vs. Java. You can only make statements about *a particular implementation* of C++ vs. Java.
Post Follow-up to this messagered floyd wrote: > Well, then, you can't make sweeping statements about C++ vs. Java. You > can only make statements about *a particular implementation* of C++ vs. > Java. This is one of the procedural flaws common to all benchmarks. Generally speaking, one can only hope for full disclosure of the conditions for a benchmark, and that those conditions approximate some meaningful aspect of one's actual business need. This does not mean that benchmarks are useless, only that one cannot decry them for limitations inherent to the benchmark process itself. -- Lew
Post Follow-up to this messageOn Sun, 23 Mar 2008 15:45:21 GMT, red floyd <no.spam@here.dude> wrote: >Well, then, you can't make sweeping statements about C++ vs. Java. You >can only make statements about *a particular implementation* of C++ vs. >Java. I used the proper flag /O2 in vc++. Also, when you are deploying a commercial software, you will have to use flags that target the least-common-denominator processor. That's a divantage of C++ vs JIT language. The JIT compiler knows what processor it is running on, and can generate code specifically for that processor. Thus, I won't use anything other than /O2 for c++... because, as I said, when you are deploying a commercial software, you will have to use flags that target the least-common-denominator processor anyway.
Post Follow-up to this messageOn Mar 23, 9:08=A0am, Razii <DONTwhatever...@hotmail.com> wrote: > From an old post by James Kanze > > On Apr 9 2003, 5:42 pm, ka...@gabi-soft.de (James Kanze) wrote: > > > > > > i + 1 ]) / 3 ; > > > > Out of curiosity, I tried to test if the above is true. It didn't make > any difference. In fact, C++ was a bit faster (not by much, just 6%). > Probably due to array bound check in Java, if there is in indeed an > issue with C++ arrays, overall there is no difference. The issue C++ has with arrays is known as pointer aliasing. C has the keyword "restrict" to deal with this problem but not C++. The "restrict" keyword should be added in the next C++ standard. Most C++ compilers currently propose their own "restrict" keywords though. Anyway, C++ has std::valarray. Arrays constructed with std::valarray are free from aliasing. So, in the absolute, the Java language has no performance edge over the current version of the C++ language in this domain. However, even if theorically, std:valarray is free from pointer aliasing, in practice, this isn't always the case, depending on the compiler. > if there is in indeed an issue with C++ arrays, overall there is no differ=[/color ] ence. With the right optimization flags, the smooth function would probably get inlined, and static analysis of pointer aliasing should happen, allowing the C++ compiler to perform no aliasing optimizations. > Probably due to array bound check in Java A good java compiler should have applied bounds checking elimination in this case especially since the smooth function should be inlined. As a side note, your program should handle ints instead of doubles, to focus more on the performance of array access. Alexandre Courpron.
Post Follow-up to this messageOn Mar 23, 12:00 pm, Razii <DONTwhatever...@hotmail.com> wrote: > On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <no.s...@here.dude> wrote: > > I used the proper flag /O2 in vc++. Also, when you are deploying a > commercial software, you will have to use flags that target the > least-common-denominator processor. That's a divantage of C++ vs > JIT language. There are three things that are not correct about this: 1) The -ffast-math flag to GCC does not use any special processor instructions. It simply removes certain assumptions, which is safe when you know they aren't true. This is the flag with the major performance increase. Do not confuse "targeting the least-common- denominator processor" with "not using an ideal compiler for the application". Whether you compile your commercial code with GCC or VC+ + is irrelevant to what machine you are targetting. If your goal is to target the least-common-denominator, then you would leave off, for example, the SSE optimizations. However, in my example, it was -ffast- math that was responsible for the largest speedup. Target platform is not relevant. 2) It's more of a consequence of only distributing the least-common- denominator than of using C++. That's a choice you can make. It would be completely reasonable, for example, for a vendor of high- performance research applications to distribute SSE-optimized builds. E.g. Folding@Home (non-commercial, http://folding.stanford.edu) maintains a GPU-optimized version of their software as well as a "normal" one; and their non-GPU client checks for various processor extensions at run-time (such as SSE) and uses optimized code accordingly. This is the same with 32- and 64-bit applications. This is also common in OpenGL application development; where applications can look for specific OpenGL features and take advantage of them, rather than always only targeting the least-common-denominator of graphics hardware. The same thing applies to Java wrt OpenGL. Depending on how you *want* to look at it, an Intel machine with SSE support is just as different from an Intel machine without SSE as it is from a PowerPC with AltiVec (same with multicore machines). It just so happens that you can easily target the least-common-denominator for those Intel machines, but nothing stops you from releasing platform- optimized builds. So it is not a di
vantage of C++ vs. anything; it is a di
vantage of sticking to the last-common-denominator. 3) A JIT language has an extra layer of abstraction before the hardware. This makes it difficult to perform these particular kinds of optimizations, anyways. It's apples and oranges, the Java JIT compiler is a completely different kind of beast than a native code compiler, the optimizations available for a JIT compiler to make and the approaches it can take to making them are too different from those of a native compiler to make a valid comparison. They're just different tools. What you are talking about is a difference between targeting the least- common-denominator of platforms, and targeting specific platforms -- which is an issue present no matter what language you are using. You are not talking about a difference between any two specific languages. Furthermore, what *I* was talking about is a difference between different compilers, targeting the same platform (I am sorry my point wasn't clear, I had meant to show that the compiler can generate very fast code for you, in particular -ffast-math [which does not target specific hardware features] on GCC moreso than the the architecture- specific options). > The JIT compiler knows what processor it is running on, > and can generate code specifically for that processor. True, and a valid point. However, depending on the nature of what you are compiling, you easily could lose a lot of information about the original intention of the code in the original optimizing pass that limits how effective processor-specific optimizations made by the JIT compiler can be. For example (this is a very specific example but please extend it to general cases): let's say you have some Java code that, I don't know, multiplies the elements of two large arrays of floating-point values together in a loop. You compile it to bytecode, and in the process javac unrolls that loop to optimize a bit. Then you run it on a platform that, say, provides a specific instruction for multiplying large arrays of floating-point values together very quickly. If the original compiler had targeted this platform to begin with, it could have seen your access pattern and used that optimized instruction. Instead it unrolled the loop, a least-common-denominator optimization. Now the JIT compiler sees the unrolled loop, and because it does not see the sequential access pattern that was very apparent in the original code, it fails to analyze the unrolled loop and determine that it should be "re-rolled" and use the optimized instruction instead. Now this particular example might not be that great, I know, because you could conceivably construct a JIT compiler that can recognize and re-roll unrolled loops; but the general point is what I said: you can potentially lose a lot of information by compiling code that you can not recover, and this information could be critical to performing the best optimizations off the bat. Consequently, the JIT compiler must turn to other techniques for optimization, such as caching compiled code for faster startup times, etc. Again, it's a whole different beast. > Thus, I won't > use anything other than /O2 for c++... because, as I said, when you > are deploying a commercial software, you will have to use flags that > target the least-common-denominator processor anyway. I addressed that above: Again, 1) some optimizations, like -ffast-math or -funroll-loops (and be careful, of course, you can't blindly use these, you have to make sure they make sense for your code), do not target a specific processor, and 2) there is no rule that says you have to target the least-common-denominator processor; and a heavy duty research application, such as what James Kanze *originally* cited, would most certainly take advantage of special features on the target hardware. Jason
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.