For Programmers: Free Programming Magazines  


Home > Archive > Unix Programming > March 2006 > removing a loop cause it to go at half the speed?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author removing a loop cause it to go at half the speed?
tom fredriksen

2006-03-15, 6:58 pm

Hi

I was doing a simple test of the speed of a "maths" operation and when I
tested it I found that removing the loop that initialises the data array
for the operation caused the whole program to spend twice the time to
complete. If the loop is included it takes about 7.48 seconds to
complete, but when removed it takes about 11.48 seconds.

Does anybody have a suggestion as to why this is so and whether I can
trust the results of the code as it is below?

(I tried declaring total and data as unsigned, because of time taken on
overflows as someone suggested, but it did not affect it)

/tom


The code was compiled on linux 2.6.3 with gcc 3.3.2 and glibc 2.2.3
compiler arguments are "-O2 -Wall -D_LARGEFILE64_SOURCE -std=gnu99"

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

int main(int argc, char *argv[])
{
int total = 0;
int count = 65500;
signed int data[count];

/* Initialising array loop */
for(int c=0; c<count; c++) {
data[c]=1+(int) (2000000000.0*rand()/(RAND_MAX+1.0));
}

struct timeval start_time;
struct timeval end_time;

gettimeofday(&start_time, NULL);

for(int d=0; d<50000; d++) {
for(int c=0; c<count; c++) {
total += data[c];
}
}
gettimeofday(&end_time, NULL);

double t1=(start_time.tv_sec*1000)+(start_time.tv_usec/1000.0);
double t2=(end_time.tv_sec*1000)+(end_time.tv_usec/1000.0);

printf("Elapsed time (ms): %.6lf\n", t2-t1);
printf("Total: %u\n", total);

return(0);
}
Bjorn Reese

2006-03-15, 6:58 pm

tom fredriksen wrote:

> tested it I found that removing the loop that initialises the data array
> for the operation caused the whole program to spend twice the time to
> complete. If the loop is included it takes about 7.48 seconds to
> complete, but when removed it takes about 11.48 seconds.
>
> Does anybody have a suggestion as to why this is so and whether I can
> trust the results of the code as it is below?

[...]
> gettimeofday(&start_time, NULL);


You are measuring absolute time, not the time your program is using the
CPU, so everything else that is running while you measure is included in
the number; virtual memory swapping is also included.

Use getrusage() instead of gettimeofday().

--
mail1dotstofanetdotdk
tom fredriksen

2006-03-15, 6:58 pm

Bjorn Reese wrote:
> tom fredriksen wrote:
>
> [...]
>
> You are measuring absolute time, not the time your program is using the
> CPU, so everything else that is running while you measure is included in
> the number; virtual memory swapping is also included.
>
> Use getrusage() instead of gettimeofday().


I see your point, but its not really a problem, even if I set the
priority to the highest the difference is almost the same.

/tom
moi

2006-03-15, 6:58 pm

tom fredriksen wrote:
> Hi
>
> I was doing a simple test of the speed of a "maths" operation and when I
> tested it I found that removing the loop that initialises the data array
> for the operation caused the whole program to spend twice the time to
> complete. If the loop is included it takes about 7.48 seconds to
> complete, but when removed it takes about 11.48 seconds.
>
> Does anybody have a suggestion as to why this is so and whether I can
> trust the results of the code as it is below?
>


My guess would be: your initialisation fills the memory cache.
(all your data will probably fit into the L2 cache)
Without the initialisation, 1/4 (?) of the array-accesses will result in
a cache miss, and the processor will have to wait for the bus to grab
the data from RAM. (you could increase the arraysize to test this
hypothesis)

HTH,
AvK
tom fredriksen

2006-03-15, 6:58 pm

moi wrote:
> tom fredriksen wrote:
>
> My guess would be: your initialisation fills the memory cache.
> (all your data will probably fit into the L2 cache)
> Without the initialisation, 1/4 (?) of the array-accesses will result in
> a cache miss, and the processor will have to wait for the bus to grab
> the data from RAM. (you could increase the arraysize to test this
> hypothesis)


I think I figured it out, it seems that the array initialisation
statement contains float numbers, causing the entire program to be
optimised with float operations instead of integer operations as
in_cksum should be. I tested it with removing the offending statment,
ith then gave the same results as if I removed the loop.

Does anybody have a suggestion as to how to make the rand() statement
operate on integers only?

for(int c=0; c<count; c++) {
/* data[c]=1.0+(unsigned int) (2000000000.0*rand()/(RAND_MAX+1.0)); */
data2[c] = data[c];
}

(I used the data2 array at the end of the program to make usre it was
not optimised away.)

/tom
moi

2006-03-15, 6:58 pm

moi wrote:

BTW: A similar effect could also be caused by the MMU, which has to
map in the pages (probably COW from /dev/zero)

BTW2: I would not allocate a big array like that on the stack, but
that's a matter of taste, I guess.

AvK
moi

2006-03-15, 6:58 pm

tom fredriksen wrote:
> moi wrote:


>
> I think I figured it out, it seems that the array initialisation
> statement contains float numbers, causing the entire program to be
> optimised with float operations instead of integer operations as
> in_cksum should be. I tested it with removing the offending statment,
> ith then gave the same results as if I removed the loop.
>
> Does anybody have a suggestion as to how to make the rand() statement
> operate on integers only?


You are .
man 3 rand

AvK
tom fredriksen

2006-03-15, 6:58 pm

moi wrote:
> tom fredriksen wrote:
>
>
> You are .
> man 3 rand


I know, but gcc outputs code with 90% float instructions because of the
rand statement. Or is it just using integer instructions to do integer
operations?

/tom
tom fredriksen

2006-03-16, 3:57 am

tom fredriksen wrote:
> moi wrote:
>
> I know, but gcc outputs code with 90% float instructions because of the
> rand statement. Or is it just using integer instructions to do integer
> operations?


Sorry, meant float instructions to to integer operations

/tom
Nils O. Selåsdal

2006-03-16, 7:56 am

tom fredriksen wrote:
> Bjorn Reese wrote:
>
> I see your point, but its not really a problem, even if I set the
> priority to the highest the difference is almost the same.

Try measuring without compiling with optimizations. Who knows
what optimizations the compiler applies in the different
situations for the different code. It might eliminate loops
entierly in one of the situations, or unroll them.
P at the assembly.
Maxim Yegorushkin

2006-03-16, 7:56 am


Nils O. Sel=E5sdal wrote:

[]

> Try measuring without compiling with optimizations. Who knows
> what optimizations the compiler applies in the different
> situations for the different code. It might eliminate loops
> entierly in one of the situations, or unroll them.
> P at the assembly.


I tryed running his code with one modification: I made the
initialization loop run only when argc > 1. Having done that it ran
faster without the initialization loop, rather than with. I used gcc
4=2E0.2.

tom fredriksen

2006-03-16, 7:56 am

Maxim Yegorushkin wrote:
> Nils O. Sel=E5sdal wrote:
>=20
> []
>=20
>=20
> I tryed running his code with one modification: I made the
> initialization loop run only when argc > 1. Having done that it ran
> faster without the initialization loop, rather than with. I used gcc
> 4.0.2.


I rewrote the code a bit and had a look at the assembler and found that=20
the rand() statement is causing the problem, with it (test_orig) the=20
compiler produces float operations for the entire program which is=20
processed faster, while with a guaranteed integer operation it did not=20
(test_int), and if both functions are in the program at the same time=20
both produce the quick code while if I remove the call to test_orig() it =

produces the slow code. So the summary is that because of the rand code=20
the compiler optimises the entire program to use float operations...

/tom


#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>


void test_orig()
{
unsigned int total =3D 0;
int count =3D 65500;
unsigned int data[count];
struct timeval start_time;
struct timeval end_time;
int c,d;
double t1, t2;

for(c=3D0; c<count; c++) {
data[c]=3D1.0+(unsigned int) (2000000000.0*rand()/(RAND_MAX+1.0));
}

gettimeofday(&start_time, NULL);

for(d=3D0; d<50000; d++) {
for(c=3D0; c<count; c++) {
total +=3D data[c];
}
}
gettimeofday(&end_time, NULL);

t1=3D(start_time.tv_sec*1000)+(start_time.tv_usec/1000.0);
t2=3D(end_time.tv_sec*1000)+(end_time.tv_usec/1000.0);

printf("Elapsed time (ms): %.6lf\n", t2-t1);
printf("Total: %u\n", total);
}


void test_int()
{
unsigned int total =3D 0;
int count =3D 65500;
unsigned int data[count];
unsigned int data2[count];
struct timeval start_time;
struct timeval end_time;
int c,d;
double t1, t2;
=09
for(c=3D0; c<count; c++) {
data2[c] =3D data[c];
}

=09
gettimeofday(&start_time, NULL);

for(d=3D0; d<50000; d++) {
for(c=3D0; c<count; c++) {
total +=3D data[c];
}
}
gettimeofday(&end_time, NULL);

t1=3D(start_time.tv_sec*1000)+(start_time.tv_usec/1000.0);
t2=3D(end_time.tv_sec*1000)+(end_time.tv_usec/1000.0);

printf("Elapsed time (ms): %.6lf\n", t2-t1);
printf("Total: %u\n", total);

for(c=3D0; c<100; c++) {
printf("data2: %u ", data2[c]);
}
printf("\n");
}



int main(int argc, char *argv[])
{
/* test_orig(); */

test_int();
=09
return(0);
}
Maxim Yegorushkin

2006-03-16, 7:56 am


tom fredriksen wrote:
> Maxim Yegorushkin wrote:
>
> I rewrote the code a bit and had a look at the assembler and found that
> the rand() statement is causing the problem, with it (test_orig) the
> compiler produces float operations for the entire program which is
> processed faster, while with a guaranteed integer operation it did not
> (test_int), and if both functions are in the program at the same time
> both produce the quick code while if I remove the call to test_orig() it
> produces the slow code. So the summary is that because of the rand code
> the compiler optimises the entire program to use float operations...


I think the problem is in the code. It has too many variables known at
compile time so that a compiler tries to optimize it heavily. Try
passing the size of the array and the number of iterations as command
line arguments.

Logan Shaw

2006-03-18, 6:58 pm

moi wrote:
> My guess would be: your initialisation fills the memory cache.
> (all your data will probably fit into the L2 cache)
> Without the initialisation, 1/4 (?) of the array-accesses will result in
> a cache miss, and the processor will have to wait for the bus to grab
> the data from RAM.


I thought about that, and I think it should be a real effect. But I
don't see how the numbers add up: the data is only 256K total (65500
integers, which I will assume are 4 bytes each), and the time difference
is 4 seconds. It shouldn't take 4 seconds to read 256K into the cache;
if it does, that is one really slow computer!

- Logan
moi

2006-03-19, 7:56 am

Logan Shaw wrote:
> moi wrote:
>
>
>
> I thought about that, and I think it should be a real effect. But I
> don't see how the numbers add up: the data is only 256K total (65500
> integers, which I will assume are 4 bytes each), and the time difference
> is 4 seconds. It shouldn't take 4 seconds to read 256K into the cache;
> if it does, that is one really slow computer!
>
> - Logan


Yes, it seems to be a "real" effect.
Also, I misread the original post (which was rather confusing); I
thought it meant that initialisation resulted in *faster* code.
Also, I failed to see the nested loops in the "summing" loop.
(which would mean the initialisation loop is only 50k ins size, and the
adding takes 50k **2. )

I did some testing myself, (NB removed the C++ style declarations, moved
the array into global scope, replaced the getutime by getrusage(),
replaced the compile-time conditional by a run-time one, and added code
for int-only initialisation)


The effect appears to be linux-related. (I tested on linux/x86, and
True64/(both Dec cc && gcc), but I am omitting True64 here, since it is
not affected)


IIRC linux has a special flag in the process-structure, which is set
whenever the process uses the FPU. It allows the context switch to be
cheaper (The FPU registers don't have to be saved), in cases where the
process does*not* dirty the FPU.

So I altered the program a bit, I even initialised using zero or random
ints, but the results still confuse me. (maybe some strange interaction
with the memorycache &| scheduler ?)

Rsesults(I don't care about the Total or integer overflow, just the CPU
times :-):

Linux> cc -O6 cache.c
cache.c: In function `main':
cache.c:41: warning: integer overflow in expression
Linux> ./a.out
NOT Initialising...
Real: 8821.240967 User: 8553.699000 System: 3.000000 Cs: 874
Total= 0
Linux> ./a.out 0
Initialising(zero)...
Real: 43712.779053 User: 41974.618000 System: 16.997000 Cs: 4270
Total= 0
Linux> ./a.out 11
Initialising(int)...
Real: 47870.093994 User: 46008.005000 System: 9.999000 Cs: 4721
Total= -1019967296
Linux> ./a.out f
Initialising(float)...
Real: 45223.701172 User: 43268.422000 System: 38.994000 Cs: 4564
Total= 634642000
Linux>

The changed program:


#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/resource.h>


#define COUNT 65500
int count = COUNT;
int data[COUNT];


double tv2ms(struct timeval *tv);
int main(int argc, char *argv[])
{
int total = 0;
int dd,cc;
double t1,t2;
double u1,u2;
double s1,s2;
long cs1,cs2;
struct timeval start_time;
struct timeval end_time;
struct rusage start_usage;
struct rusage end_usage;
if (argv[1]) {
if (*argv[1] == 'f') { /* Initialising using casts */
printf("Initialising(float)...\n");
for(cc=0; cc<count; cc++) {
data[cc]=1+(int) (2000000000.0*rand()/(RAND_MAX+1.0));
}
}
else if (*argv[1] == '0') { /* Initialising using zero */
printf("Initialising(zero)...\n");
for(cc=0; cc<count; cc++) {
data[cc]= 0;
}
}
else { /* Initialising using ints */
printf("Initialising(int)...\n");
for(cc=0; cc<count; cc++) { /* This may overflow, but I don't
care */
data[cc]=1+(2000000000*rand()/(RAND_MAX+1) );
}
}
}
else {
printf("NOT Initialising...\n");
}
gettimeofday(&start_time, NULL);
getrusage(RUSAGE_SELF, &start_usage);


for(dd=0; dd<50000; dd++) {
for(cc=0; cc<count; cc++) {
total += data[cc];
}
}


gettimeofday(&end_time, NULL);
getrusage(RUSAGE_SELF, &end_usage);


t1 = tv2ms(&start_time);
t2 = tv2ms(&end_time);


u1 = tv2ms(&start_usage.ru_utime);
u2 = tv2ms(&end_usage.ru_utime);


s1 = tv2ms(&start_usage.ru_stime);
s2 = tv2ms(&end_usage.ru_stime);


cs1 = start_usage.ru_nivcsw;
cs2 = end_usage.ru_nivcsw;


printf("Real: %.6lf", (double) t2-t1);
printf(" User: %6lf" , (double) u2-u1);
printf(" System: %6lf" , (double) s2-s1);
printf(" Cs: %ld\n" , cs2-cs1);
printf("Total= %d\n", total);


return 0;
}
double tv2ms(struct timeval *tv)
{
return tv->tv_sec*1000.0 + tv->tv_usec/1000.0;
}

HTH,
AvK


moi

2006-03-19, 7:56 am


Ok, I think I found it.
IMHO it is just the copy on write at work.
For a non-initialised (stack or BSS) array, the loader
(or whoever) just makes it point to a map of /dev/zero
with the COW bit set. The initialisation (the first and only
memory-write) dirties this memory, causing a pagefault, remapping of
memorypage, etc. (I am not sure about the L1/L2 cache, they may be
involved, if they apply to physical pages, which they do ;-)

I added some code to initialise only one entry per memorypage, and the
results were similar to those that initialise with int.

HTH,
AvK

Result:

Linux> ./a.out 0
Initialising(zero)...
Real: 41112.656982 User: 39580.983000 System: 10.999000 Cs: 3981
Total= 0
Linux> ./a.out 1
Initialising(pagesize=4096)...
Real: 44131.698975 User: 42474.543000 System: 11.998000 Cs: 4285
Total= 0
Linux> ./a.out 2
Initialising(2*pagesize=4096)...
Real: 9466.657959 User: 9149.609000 System: 3.000000 Cs: 935
Total= 0


Code snippet:

#include <unistd.h>
long pagesize;
pagesize = sysconf(_SC_PAGESIZE);

else if (*argv[1] == '1') { /* Initialising 1 in every page */
printf("Initialising(pagesize=%ld)...\n", pagesize);
for(cc=0; cc<count; cc += pagesize/sizeof data[0] ) {
data[cc]= 0;
}
}
else if (*argv[1] == '2') { /* Initialising 1 in every 2 pages */
printf("Initialising(2*pagesize=%ld)...\n", pagesize);
for(cc=0; cc<count; cc += 2*pagesize/sizeof data[0] ) {
data[cc]= 0;
}
}
moi

2006-03-26, 7:01 pm

tom fredriksen wrote:
> Maxim Yegorushkin wrote:
>
[color=darkred]
>
> void test_int()
> {
> unsigned int total = 0;
> int count = 65500;
> unsigned int data[count];
> unsigned int data2[count];
> struct timeval start_time;
> struct timeval end_time;
> int c,d;
> double t1, t2;
>
> for(c=0; c<count; c++) {
> data2[c] = data[c];
> }
>
>
> gettimeofday(&start_time, NULL);
>
> for(d=0; d<50000; d++) {
> for(c=0; c<count; c++) {
> total += data[c];
> }
> }

[snip]

Now why are you initialising with uninitialised data ?
Why are you "initialising" data2[], but *using* data[] ?
What are you trying to prove ?

See my post later in this thread. It all seems to happen because your
array is copy-on-write mapped to /dev/zero. Because the
data[] is never written, all its pages map to the same physical address
,so it all fits nicely in your L1 cache, and COW never hits you.

AvK

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com