For Programmers: Free Programming Magazines  


Home > Archive > Fortran > November 2007 > Pure curiosity about dot_product timing results









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Pure curiosity about dot_product timing results
jiho

2007-11-23, 7:09 pm

Hello all,

Trying to acquire some solid fortran basics I am doing the 3 days
course from the university of Liverpool:
http://www.liv.ac.uk/HPC/F90CourseEasy3DayHomePage.html
Seems old but solid. Anyway, there's an exercise using dtime to time a
hand-made, do based, dot product and compare it to the intrinsic dot
product function. Of course the expected result is that the intrinsic
function should be faster. However I see just the opposite. Consider
the following code:

program dotProduct

implicit none
integer, parameter :: n=2000000
real :: A(n),B(n)
real :: x(2), time1, time2, time3, time4, dotProd=0
integer :: i

call random_number(A)
call random_number(B)

! dtime reports the time from the start of the program
time1 = dtime(x)

! My hand made dot product
do i = 1, size(A), 1
dotProd = dotProd + A(i)*B(i)
end do
time2 = dtime(x)

! Sum based dot product
dotProd = sum(A*B)
time3 = dtime(x)

! dot_product intrinsic
dotProd = dot_product(A,B)
time4 = dtime(x)

print*, "Mine:", time2-time1, dotProd
print*, "Sum based:", time3-time2, dotProd
print*, "Intrinsic:",time4-time3, dotProd

! Not really what's expected: my own crappy computation works faster
than the dot_product intrinsic
end program dotProduct

The times I get are (reliably):
$ ./a.out
Mine: 9.5649958E-03 499762.6
Sum based: 1.0604002E-02 499762.6
Intrinsic: 1.0584995E-02 499762.6

So just by pure curiosity: does anyone have an idea of why?

Thanks in advance.
Tobias Burnus

2007-11-23, 7:09 pm

Hi,

On Nov 24, 12:56 am, jiho <jo.iris...@gmail.com> wrote:
> ! Not really what's expected: my own crappy computation works faster
> than the dot_product intrinsic
>
> The times I get are (reliably):
> $ ./a.out
> Mine: 9.5649958E-03 499762.6
> Sum based: 1.0604002E-02 499762.6
> Intrinsic: 1.0584995E-02 499762.6


Well, those numbers are pretty close. Here (x86-64 with gfortran and -
O3 or -O0) I get essentially the same numbers for all variants:

- Mine: 1.60010010E-02 499762.63
Sum based: 1.60010010E-02 499762.63
Intrinsic: 1.60009861E-02 499762.63
- Mine: 1.20000020E-02 499762.63
Sum based: 1.60010010E-02 499762.63
Intrinsic: 1.60010010E-02 499762.63
- Mine: 1.60009973E-02 499762.63
Sum based: 1.60010010E-02 499762.63
Intrinsic: 1.20010003E-02 499762.63

Besides, the compiler has do to at the end the same as you do in the
loop. Thus in general, I'd expect that the speed is comparably.
Additionally, the loop as that fast that the timing is unreliable (see
above). In general I'd advice to use the version which is most
readable and not a most-tuned version if it is less readable/
maintainable. (If the less readably version makes the *whole* program
much faster, one can consider to use it nonetheless.)

Some compilers have the possibility to use a special library for some
intrinsics; e.g. gfortran has the option -fexternal-blas, which
generates calls to the BLAS library instead of using the compiler's
version (e.g. for matmul). If you have a specially optimized BLAS
version, this makes the program faster. Otherwise the compiler
implementation is presumably not worse than e.g. the Netlib BLAS
version or a hand crafted version.

Tobias
mecej4

2007-11-23, 10:11 pm

jiho wrote:

> Hello all,
>
> Trying to acquire some solid fortran basics I am doing the 3 days
> course from the university of Liverpool:
> http://www.liv.ac.uk/HPC/F90CourseEasy3DayHomePage.html
> Seems old but solid. Anyway, there's an exercise using dtime to time a
> hand-made, do based, dot product and compare it to the intrinsic dot
> product function. Of course the expected result is that the intrinsic
> function should be faster. However I see just the opposite. Consider
> the following code:
>
> program dotProduct
>
> implicit none
> integer, parameter :: n=2000000
> real :: A(n),B(n)
> real :: x(2), time1, time2, time3, time4, dotProd=0
> integer :: i
>
> call random_number(A)
> call random_number(B)
>
> ! dtime reports the time from the start of the program
> time1 = dtime(x)
>
> ! My hand made dot product
> do i = 1, size(A), 1
> dotProd = dotProd + A(i)*B(i)
> end do
> time2 = dtime(x)
>
> ! Sum based dot product
> dotProd = sum(A*B)
> time3 = dtime(x)
>
> ! dot_product intrinsic
> dotProd = dot_product(A,B)
> time4 = dtime(x)
>
> print*, "Mine:", time2-time1, dotProd
> print*, "Sum based:", time3-time2, dotProd
> print*, "Intrinsic:",time4-time3, dotProd
>
> ! Not really what's expected: my own crappy computation works faster
> than the dot_product intrinsic
> end program dotProduct
>
> The times I get are (reliably):
> $ ./a.out
> Mine: 9.5649958E-03 499762.6
> Sum based: 1.0604002E-02 499762.6
> Intrinsic: 1.0584995E-02 499762.6
>
> So just by pure curiosity: does anyone have an idea of why?
>
> Thanks in advance.


You need more precision in your timing than dtime() is capable of providing,
as this shows (your code on my PC):

Mine: 0.0E+0 499946.62
Sum based: 0.0E+0 499946.62
Intrinsic: 0.0E+0 499946.62

There was a thread a few months ago about using the RDTSC instruction that
modern Intel and AMD CPUs provide.

-- mecej4
Wade Ward

2007-11-24, 4:29 am



"mecej4" <mecej4@operamail.com> wrote in message
news:47479a20$0$24343$4c368faf@roadrunne
r.com...
> jiho wrote:
[color=darkred]
> You need more precision in your timing than dtime() is capable of
> providing,
> as this shows (your code on my PC):
>
> Mine: 0.0E+0 499946.62
> Sum based: 0.0E+0 499946.62
> Intrinsic: 0.0E+0 499946.62
>
> There was a thread a few months ago about using the RDTSC instruction that
> modern Intel and AMD CPUs provide.

I can never remember all the specifics of timing from one syntax to the
next, except difftime in C. So I loop over the code in question 10**n
times, such that I can count it out in seconds or minutes.

dotproduct might not be as fast as something as a homecooked alternative,
but it sure is handy.
--
wade ward

wade@zaxfuuq.net
435 -838-7760



----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Craig Dedo

2007-11-24, 4:29 am

"Wade Ward" <wade@zaxfuuq.net> wrote in message
news:1195877901_3489@sp12lax.superfeed.net...
>
>
> "mecej4" <mecej4@operamail.com> wrote in message
> news:47479a20$0$24343$4c368faf@roadrunne
r.com...
>
> I can never remember all the specifics of timing from one syntax to the next,
> except difftime in C. So I loop over the code in question 10**n times, such
> that I can count it out in seconds or minutes.
>
> dotproduct might not be as fast as something as a homecooked alternative, but
> it sure is handy.
> --
> wade ward
> wade@zaxfuuq.net
> 435 -838-7760


Have you considered using the Fortran 95 standard intrinsic subroutine
CPU_TIME in place of the non-standard dtime? CPU_TIME takes one REAL argument,
of any kind. If you use double precision, or if available, quad precision, you
may get more accurate results.

--
Craig Dedo
17130 W. Burleigh Place
P. O. Box 423
Brookfield, WI 53008-0423
Voice: (262) 783-5869
Fax: (262) 783-5928
Mobile: (414) 412-5869
E-mail: <cdedo@wi.rr.com> or <craig@ctdedo.com>

jiho

2007-11-24, 4:29 am

On 24 nov, 05:46, "Craig Dedo" <cd...@wi.rr.com> wrote:
> "Wade Ward" <w...@zaxfuuq.net> wrote in message
>
>
>
>
>
>
> Have you considered using the Fortran 95 standard intrinsic subroutine
> CPU_TIME in place of the non-standard dtime? CPU_TIME takes one REAL argument,
> of any kind. If you use double precision, or if available, quad precision, you
> may get more accurate results.


Thank you very much to everyone for your answer. What I gather from
them is that it is mainly a problem of timing precision. The book I
took this from is old (1997) so the machines this code ran on were of
course much slower which make the results more reliable. As suggested,
I tried to do each computation a hundred times to increase precision
(but still use dtime). I also tested a vectorized version of my
computation (the product between A and B is done outside of the loop):
res = A*B
do j = 1, size(A), 1
dotProd = dotProd + res(j)
end do
The results are:
$ ./a.out
Mine: 4.761365
Mine vectorized: 9.469275
Sum based: 5.286102
Intrinsic: 5.240202
..... and reliable on repeats

$ gfortran -O3 ArrayMultiplyTry.f90
$ ./a.out
Mine: 2.062371
Mine vectorized: 5.794862
Sum based: 2.067438
Intrinsic: 2.064124
..... idem

So of course all results are close because they all do basically the
same thing but a simple loop, with no vectorization whatsoever is
faster than all the other methods (not by much when optimized, I
agree). That seems really odd. What surprises me the most is that the
"vectorized" version is always much slower while I would expect vector
operations to be faster (and intrinsic functions to be deeply
optimized). Strange.
Thanks for your time anyway. I am not planning to do timing code so I
am not after the best timing method but I will need speed in
processing large arrays in the code I am planning to write so I was
interested by such discrepancies between the expected timing and the
observed one.

Cheers,

JiHO
James Van Buskirk

2007-11-24, 8:07 am

"jiho" <jo.irisson@gmail.com> wrote in message
news:eaa986ef-21e1-416a-bec6-6730b8b25d3d@e10g2000prf.googlegroups.com...

> Thank you very much to everyone for your answer. What I gather from
> them is that it is mainly a problem of timing precision. The book I
> took this from is old (1997) so the machines this code ran on were of
> course much slower which make the results more reliable. As suggested,
> I tried to do each computation a hundred times to increase precision
> (but still use dtime). I also tested a vectorized version of my
> computation (the product between A and B is done outside of the loop):
> res = A*B
> do j = 1, size(A), 1
> dotProd = dotProd + res(j)
> end do
> The results are:
> $ ./a.out
> Mine: 4.761365
> Mine vectorized: 9.469275
> Sum based: 5.286102
> Intrinsic: 5.240202
> .... and reliable on repeats


> $ gfortran -O3 ArrayMultiplyTry.f90
> $ ./a.out
> Mine: 2.062371
> Mine vectorized: 5.794862
> Sum based: 2.067438
> Intrinsic: 2.064124
> .... idem


A problem with your benchmark is that your vector length if 2.0e6
so that each vector is 8.0 MB, for a total of 16.0 MB accessed for
one dot_product. If you went down to the store today, probably
the most cache the clerk could offer you is 8.0 MB, on a Q6700
http://www.intel.com/products/proce...cifications.htm
class processor. Even on that processor I think you have to be
running multiple threads to access more than 4.0 MB in one process.
Hopefully someone from Intel can correct any inaccuracies. Intel's
compiler has a no-effort multiprocessor option but I am not sure
whether it could be effective in this case even on a Q6700 with
vector lengths reduced to 1.0e6.

Given this, your benchmark is running out of memory rather than
out of cache, so just about all you will be seeing is the time
it takes to touch memory. In most versions you touch 16 MB to
read in both arrays, but in your 'vectorized' version you read
in all 16 MB, then write out 8 MB and read in 8 MB again. In
writing out the 8 MB the code may request that each write be
accompanied by a read, which takes your memory access up to
40 MB, for about a 5:2 time ratio. The first compilation has
a 2:1 ratio, so I suspect non-temporal writes (i.e. without the
extra 8 MB of reads) but the second compilation has a larger
ratio. I'm not sure what is going on in the second version in
that the compiler seems to be more clever about how it's
accessing memory. One thing you can do is to put each outer
loop body in a subroutine so that hopefully the compiler can't
use a single memory access to serve for multiple iterations of
a loop.

OK, so here's a full implementation of all that along with an
attempt to see what happens if you prefetch a chunk at a time:

C:\Program Files\Microsoft Visual Studio 8\James\clf\dot_time>type
dot_body.f90
! My hand made dot product
subroutine mine(A,B,n,dotProd)
implicit none
integer n
real A(n)
real B(n)
real dotProd
integer i

dotProd = 0
do i = 1, size(A), 1
dotProd = dotProd + A(i)*B(i)
end do
end subroutine mine

! Sum based dot product
subroutine sum_based(A,B,n,dotProd)
implicit none
integer n
real A(n)
real B(n)
real dotProd

dotProd = sum(A*B)
end subroutine sum_based

! dot_product intrinsic
subroutine intrinsic(A,B,n,dotProd)
implicit none
integer n
real A(n)
real B(n)
real dotProd

dotProd = dot_product(A,B)
end subroutine intrinsic

subroutine L1(A,B,n,dotProd)
implicit none
integer n
real A(n)
real B(n)
real dotProd
integer i
integer, parameter :: nb = 1000
real C(nb)
real D(nb)

dotProd = 0
do i = 1, size(A), nb
C(1:min(i+nb-1,size(A))-i+1) = A(i:min(i+nb-1,size(A)))
B(1:min(i+nb-1,size(A))-i+1) = B(i:min(i+nb-1,size(A)))
dotProd = dotProd+dot_product( &
C(1:min(i+nb-1,size(A))-i+1), &
D(1:min(i+nb-1,size(A))-i+1))
end do
end subroutine L1

C:\Program Files\Microsoft Visual Studio 8\James\clf\dot_time>ifort /QxT /O2
/c
dot_body.f90
Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, Version 9.1
Build 20061104
Copyright (C) 1985-2006 Intel Corporation. All rights reserved.

C:\Program Files\Microsoft Visual Studio
8\James\clf\dot_time\dot_body.f90(11) :
(col. 4) remark: LOOP WAS VECTORIZED.
C:\Program Files\Microsoft Visual Studio
8\James\clf\dot_time\dot_body.f90(24) :
(col. 14) remark: LOOP WAS VECTORIZED.
C:\Program Files\Microsoft Visual Studio
8\James\clf\dot_time\dot_body.f90(35) :
(col. 14) remark: LOOP WAS VECTORIZED.
C:\Program Files\Microsoft Visual Studio
8\James\clf\dot_time\dot_body.f90(51) :
(col. 7) remark: LOOP WAS VECTORIZED.
C:\Program Files\Microsoft Visual Studio
8\James\clf\dot_time\dot_body.f90(52) :
(col. 7) remark: LOOP WAS VECTORIZED.
C:\Program Files\Microsoft Visual Studio
8\James\clf\dot_time\dot_body.f90(52) :
(col. 7) remark: LOOP WAS VECTORIZED.
C:\Program Files\Microsoft Visual Studio
8\James\clf\dot_time\dot_body.f90(53) :
(col. 25) remark: LOOP WAS VECTORIZED.

C:\Program Files\Microsoft Visual Studio 8\James\clf\dot_time>type
dot_time.f90
program dotProduct

implicit none
integer, parameter :: n=2000000
! integer, parameter :: n=200000
real :: A(n),B(n)
real :: time1, time2, time3, time4, time5
real dotProd
integer j
integer, parameter :: Niter = 100
! integer, parameter :: Niter = 1000
real overall ! Write out result so the compiler
! doesn't try to cheat us
external mine
external sum_based
external intrinsic
external L1

call random_seed() ! Randomizes the seed on many compilers
call random_number(A)
call random_number(B)
overall = 0

! Using CPU_time
call cpu_time(time1)

! My hand made dot product
do j = 1, Niter
call mine(A,B,n,dotProd)
overall = overall+dotProd
end do
call cpu_time(time2)

! Sum based dot product
do j = 1, Niter
call sum_based(A,B,n,dotProd)
overall = overall+dotProd
end do
call cpu_time(time3)

! dot_product intrinsic
do j = 1, Niter
call intrinsic(A,B,n,dotProd)
overall = overall+dotProd
end do
call cpu_time(time4)

! L1 cache attempt
do j = 1, Niter
call L1(A,B,n,dotProd)
overall = overall+dotProd
end do
call cpu_time(time5)

print*, "Mine:", time2-time1
print*, "Sum based:", time3-time2
print*, "Intrinsic:",time4-time3
print*, "L1 stuff:",time5-time4
print*, "The number:",overall

end program dotProduct

C:\Program Files\Microsoft Visual Studio 8\James\clf\dot_time>ifort /QxT /O2
dot
_time.f90 dot_body.obj
Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, Version 9.1
Build 20061104
Copyright (C) 1985-2006 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 8.00.40310.39
Copyright (C) Microsoft Corporation. All rights reserved.

-out:dot_time.exe
-subsystem:console
dot_time.obj
dot_body.obj

C:\Program Files\Microsoft Visual Studio 8\James\clf\dot_time>dot_time
Mine: 0.2500000
Sum based: 0.2500000
Intrinsic: 0.2500000
L1 stuff: 0.3437500
The number: 1.5009331E+08

Just to show what the effect of switching from memory to cache is in
this instance I decreased the vector length down to 200000 so that
the 1.6 MB fits in my 4.0 MB L2 while at the same time raising the
number of iterations to 1000 so that the total flop count was about
the same:

C:\Program Files\Microsoft Visual Studio 8\James\clf\dot_time>dot_time
Mine: 7.8125000E-02
Sum based: 9.3750000E-02
Intrinsic: 7.8125000E-02
L1 stuff: 0.2187500
The number: 1.4987749E+08

As can be seen, it's slow to run from memory. My 'L1' attempt didn't
do any good, either.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


glen herrmannsfeldt

2007-11-24, 8:07 am

jiho wrote:

> On 24 nov, 05:46, "Craig Dedo" <cd...@wi.rr.com> wrote:



Compiler optimization of such DO loops has been going on
for over 50 years now. (It was one of the things that the
first Fortran compiler was expected to do well.)

One thing that was removed from Fortran since the first compiler
in the 1950's is the FREQUENCY statement. This was intended to
help the compiler determine how many times a DO loop will
be executed.
[color=darkred]

This one could require a temporary array, but most likely
the compiler can figure it out and loop pretty much like
you did.
[color=darkred]

Again, a compiler could figure this out and generate loops,
or it could call a subroutine to do it.
(snip)
[color=darkred]

It won't do any better than the times that the OS supplies, which
is unlikely to be more than single precision for most current OS.
[color=darkred]
> Thank you very much to everyone for your answer. What I gather from
> them is that it is mainly a problem of timing precision. The book I
> took this from is old (1997) so the machines this code ran on were of
> course much slower which make the results more reliable. As suggested,
> I tried to do each computation a hundred times to increase precision
> (but still use dtime). I also tested a vectorized version of my
> computation (the product between A and B is done outside of the loop):


> res = A*B
> do j = 1, size(A), 1
> dotProd = dotProd + res(j)
> end do


This likely requires storing A*B in res, which is often
slower than keeping the intermediates in registers.
I believe even the CRAY vector machines could do this as a
pipelined operation without creating the intermediate vector.

> The results are:
> $ ./a.out
> Mine: 4.761365
> Mine vectorized: 9.469275


Storing into res is slower.

> Sum based: 5.286102
> Intrinsic: 5.240202
> .... and reliable on repeat


> $ gfortran -O3 ArrayMultiplyTry.f90
> $ ./a.out
> Mine: 2.062371
> Mine vectorized: 5.794862
> Sum based: 2.067438
> Intrinsic: 2.064124
> .... idem
>
> So of course all results are close because they all do basically the
> same thing but a simple loop, with no vectorization whatsoever is
> faster than all the other methods (not by much when optimized, I
> agree). That seems really odd. What surprises me the most is that the
> "vectorized" version is always much slower while I would expect vector
> operations to be faster (and intrinsic functions to be deeply
> optimized). Strange.


The intrinsic functions are internally optimized, but getting in
and out can be slower. Even more, many processors now have a
multiply and add instruction that can speed this up even more.

> Thanks for your time anyway. I am not planning to do timing code so I
> am not after the best timing method but I will need speed in
> processing large arrays in the code I am planning to write so I was
> interested by such discrepancies between the expected timing and the
> observed one.


The oversimplified rule that I usually use is that simple expressions
are faster as array expressions, while more complicated ones are
slower. Often, like your vectorized version, they require more work
to be done for the same result.

-- glen


glen herrmannsfeldt

2007-11-24, 8:07 am

James Van Buskirk wrote:
(snip)

(snip)
[color=darkred]
> A problem with your benchmark is that your vector length if 2.0e6
> so that each vector is 8.0 MB, for a total of 16.0 MB accessed for
> one dot_product. If you went down to the store today, probably
> the most cache the clerk could offer you is 8.0 MB, on a Q6700
> http://www.intel.com/products/proce...cifications.htm
> class processor.


Yes, but if it did fit in cache his benchmark would be invalid.

That is, the first loop would bring the data into cache and the
second and third would use the cached data and run unfairly fast.
(Or the first unfairly slow.)

If the purpose is to benchmark data in cache (and it fit in
cache on the given processor) one could use one loop to get
it into cache before timing the different methods.

The effects of cache and virtual storage greatly complicate
benchmarks. Quick, what is wrong with this benchmark:

real, allocatable:: a(:), b(:)
real start,end
integer i,n
n=10000000
allocate (a(n),b(n))
a=1
call cpu_time(start)
do i=1,10
a=b
enddo
call cpu_time(end)
print *,end-start
stop
end

Hint: According to top it runs with 80M virtual and 38M real
memory using 99% CPU.

-- glen



jiho

2007-11-24, 7:10 pm

Wow, thanks for the very detailed answers. I had no idea that this odd
result would prove to have so complicated basis. I know how to use a
computer but I don't the the details of how it works and I guess this
is what is biting me when I try to come a little closer to the
machine.

[...]
>
> The intrinsic functions are internally optimized, but getting in
> and out can be slower. Even more, many processors now have a
> multiply and add instruction that can speed this up even more.


I understood however that:
- when n (number of elements in the arrays) was high I was more
benchmarking memory access than anything else
- my simple do loop was the most basic calculation
- the "vectorized" version was slow because it needed to write an
intermediate object
- the sum based and dot_product versions basically end up doing the
same thing that the do loop with a little overhead needed to get in
and out of the functions

>
> The oversimplified rule that I usually use is that simple expressions
> are faster as array expressions, while more complicated ones are
> slower. Often, like your vectorized version, they require more work
> to be done for the same result.


Thanks for this simple rule (I need it!). In the future I'll try to
avoid methods which need intermediate objects and go simple when
needed.

Thanks again to everyone.

JiHO
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com