Home > Archive > Fortran > June 2005 > gains from vectorization
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
gains from vectorization
|
|
| student 2005-06-07, 4:02 pm |
| I am working with a g95 compiler and I need to increase the efficiency
of my code. one of the things that I came across while browsing the
archives of this group is vectorization. However, after enabling that
option using the flag -ftree-vectorize, I don't seem to be making any
significant gains in performance. Can anyone suggest something? Is it
also something that's processor dependent? If that's the case how would
I know if my processor can do that pipelining operation? uname -a on my
system gives the following output:
Linux s5 2.6.5-7.151-smp #1 SMP Fri Mar 18 11:31:21 UTC 2005 i686 i686
i386 GNU/Linux
| |
| Steven G. Kargl 2005-06-07, 8:58 pm |
| In article <1118168296.434772.39590@g47g2000cwa.googlegroups.com>,
"student" <adarsh@stat.tamu.edu> writes:
>
>
>
>
> Also, does g95 come with a profiler/debugger? If yes, is there an
> instruction manual for that?
I don't use g95, but it's close enough to gfortran that I'll
provide an answer. You've identified your system as some version
of linux, so try
g95 -pg -o prog prog.f90
gprof -l -b prog.gmon | more
man gprof
--
Steve
http://troutmask.apl.washington.edu/~kargl/
| |
| Ronald Benedik 2005-06-07, 8:58 pm |
|
"student" <adarsh@stat.tamu.edu> schrieb im Newsbeitrag
news:1118166052.339063.314050@z14g2000cwz.googlegroups.com...
>I am working with a g95 compiler and I need to increase the efficiency
> of my code. one of the things that I came across while browsing the
> archives of this group is vectorization.
> Linux s5 2.6.5-7.151-smp #1 SMP Fri Mar 18 11:31:21 UTC 2005 i686 i686
> i386 GNU/Linux
Take a look at /proc/cpuinfo to know what cpu is in your system.
The files also provides the processor flags like SSE and 3dnow
which are essential for vectorisation. Common single instruction
multiple data processors only support 32 and 64 bit data types.
| |
| student 2005-06-07, 8:58 pm |
|
> vectorization certainly won't help any kind of code, so it is important
> to understand what part of the code is consuming most CPU time.
> (compile the code with -pg, run it, and examine the output of gprof
> executable_name gmon.out).
the flat profile as generated by gmon is as follows. now, the total
computation time on my system was ~25mins = 1500sec. This profiler
shows that the intrinsic function matmul itself took 2899.24 seconds.
how do I interpret that? also, if I look at the percent time column,
intrinsic matmul is taking the most of the time. are there matrix
multiplication routines that are faster than the intrinsic ones? there
are some multiplications with diagonal matrices in my code. should it
increase the efficiency of the code if I write a seperate routine for
multiplications involving diagonal matrices?
---------------------
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ks/call Ks/call name
43.59 2899.24 2899.24
_g95_matmul22_r4r4
9.77 3549.37 650.13 300000 0.00 0.00 matrix_MP_ludcmp_
8.14 4091.03 541.66 _g95_spread
6.04 4492.47 401.44 1 0.40 1.96 MAIN_
5.91 4885.51 393.04
_g95_section_array
5.02 5219.56 334.05 7200000 0.00 0.00
nrutil_MP_outerprod_r__
2.80 5405.94 186.38 7200000 0.00 0.00 matrix_MP_lubksb_
2.75 5588.73 182.80 _g95_transpose
2.21 5735.72 146.98
_g95_dot_product_r4
2.04 5871.72 136.00 150000 0.00 0.00
mcmc_MP_gibbsnorm_
1.41 5965.81 94.09 300000 0.00 0.00
matrix_MP_pdsymminv_
1.32 6053.45 87.64 _g95_bump_element
1.11 6127.09 73.64 80750000 0.00 0.00
random_MP_random_normal__
0.85 6183.61 56.53 _g95_random_4
0.85 6240.11 56.50 xorshf96
0.79 6292.59 52.47 5691750 0.00 0.00
nrutil_MP_swap_rv__
0.52 6326.93 34.34 insert_mem
0.51 6360.89 33.96 malloc
0.37 6385.62 24.73 300000 0.00 0.00
matrix_MP_identity_
0.32 6407.11 21.49 free
0.32 6428.43 21.32 _g95_maxvald1_r4
0.26 6446.00 17.57 _g95_maxloc_r4
0.25 6462.96 16.96
_g95_init_assumed_shape
0.23 6478.44 15.48 get_user_mem
0.23 6493.64 15.20 section_size
0.21 6507.59 13.95 compare
0.21 6521.25 13.66 delete_treap
0.18 6533.13 11.88 _g95_rand
0.18 6544.86 11.73 free_user_mem
0.18 6556.50 11.64
_g95_init_multipliers
0.16 6567.10 10.60
_g95_allocate_array
0.15 6577.04 9.94
_g95_array_from_section
0.14 6586.56 9.52
_g95_deallocate_array
0.11 6593.94 7.38 _g95_xorshift128
0.11 6601.30 7.36 7200000 0.00 0.00
nrutil_MP_imaxloc_r__
0.08 6606.59 5.29 initialize_memory
0.07 6611.44 4.85 _g95_size
0.07 6615.90 4.47 delete_root
0.06 6619.65 3.75 largebin_index
0.04 6622.64 2.98 _g95_write_real
0.04 6622.64 2.98 _g95_write_real
0.04 6625.48 2.85 put_field
0.03 6627.80 2.31 _g95_temp_array
0.03 6629.81 2.01
malloc_consolidate
0.02 6631.45 1.65 _g95_temp_alloc
0.02 6632.94 1.49 huge
0.02 6634.34 1.39 rotate_left
0.02 6635.49 1.16 _g95_huge_4
0.01 6636.27 0.77 _g95_temp_free
0.01 6637.01 0.74 get_field
0.01 6637.70 0.69 100000 0.00 0.00
matrix_MP_normsquare_
0.01 6638.39 0.69
_g95_list_formatted_write
0.01 6639.03 0.64 215596 0.00 0.00 ignlgi_
0.01 6639.67 0.64 _g95_write_block
0.01 6640.31 0.64
size_record_buffer
0.01 6640.94 0.63 _g95_any_4
0.01 6641.50 0.56
_g95_bump_element_dim
0.01 6642.02 0.52 100000 0.00 0.00 sgamma_
0.01 6642.53 0.51
data_transfer_init
0.01 6643.03 0.50 100000 0.00 0.00 snorm_
0.01 6643.52 0.49 7 0.00 0.00 io_MP_writebuff_
0.01 6644.01 0.48 7500000 0.00 0.00
nrutil_MP_assert_eq3__
0.01 6644.48 0.47 matrix_MP_choldc_
0.01 6644.91 0.44 write_fixed
0.01 6645.32 0.41 write_separator
0.01 6645.70 0.38
_g95_is_internal_unit
0.01 6646.07 0.36
random_MP_random_gamma__
0.01 6646.41 0.34 _g95_find_unit
0.00 6646.70 0.29
_g95_write_integer
0.00 6646.99 0.29 write_free
0.00 6647.25 0.26 100000 0.00 0.00 gengam_
0.00 6647.51 0.26 _g95_salloc_w
0.00 6647.72 0.21
_g95_transfer_real
0.00 6647.92 0.20
_g95_get_float_flavor
0.00 6648.11 0.19 start_transfer
0.00 6648.30 0.19
write_formatted_sequential
0.00 6648.48 0.18 _g95_free_fnodes
0.00 6648.65 0.17 _g95_st_write
0.00 6648.81 0.16 _g95_extract_mint
0.00 6648.97 0.16 fd_flush
0.00 6649.13 0.16 rotate_right
0.00 6649.29 0.16 write_record
0.00 6649.42 0.13 _g95_get_ioparm
0.00 6649.55 0.13 _g95_get_sign
0.00 6649.68 0.13
_g95_st_write_done
0.00 6649.80 0.12 215596 0.00 0.00 rgnqsd_
0.00 6649.92 0.12 _g95_get_unit
0.00 6650.04 0.12 _g95_sfree
0.00 6650.15 0.11 _g95_library_end
0.00 6650.26 0.11 free_fnode
0.00 6650.36 0.10 215596 0.00 0.00 ranf_
0.00 6650.46 0.10 init_write
0.00 6650.56 0.10
nrutil_MP_outerprod_d__
0.00 6650.65 0.09 writen
0.00 6650.73 0.08 215661 0.00 0.00 __g95_master_0__
0.00 6650.81 0.07
nrutil_MP_imaxloc_i__
0.00 6650.88 0.07
_g95_library_start
0.00 6650.94 0.07
_g95_transfer_integer
0.00 6651.01 0.06 recursive_io
0.00 6651.06 0.05
nrutil_MP_ifirstloc_
0.00 6651.10 0.04 215597 0.00 0.00 __g95_master_0__
0.00 6651.14 0.04 itoa_4
0.00 6651.18 0.04
matrix_MP_printmatrix_
0.00 6651.21 0.04 215629 0.00 0.00 getcgn_
0.00 6651.24 0.04
nrutil_MP_assert_eq4__
0.00 6651.27 0.03 matrix_MP_diag_
0.00 6651.30 0.03 32 0.00 0.00 setcgn_
0.00 6651.32 0.02 215630 0.00 0.00 __g95_master_0__
0.00 6651.34 0.02 15411 0.00 0.00 sexpo_
0.00 6651.35 0.01 3 0.00 0.00 io_MP_readbuff_
0.00 6651.36 0.01
_g95_next_list_char
0.00 6651.37 0.01 _g95_sign_r4
0.00 6651.38 0.01 finalize_transfer
0.00 6651.38 0.00 215629 0.00 0.00 qrgnin_
0.00 6651.38 0.00 62 0.00 0.00 mltmod_
0.00 6651.38 0.00 32 0.00 0.00 initgn_
0.00 6651.38 0.00 1 0.00 0.00 inrgcm_
0.00 6651.38 0.00 1 0.00 0.00 qrgnsn_
0.00 6651.38 0.00 1 0.00 0.00 setall_
| |
| Steven G. Kargl 2005-06-07, 8:58 pm |
| In article <1118177092.541862.65380@f14g2000cwb.googlegroups.com>,
"student" <adarsh@stat.tamu.edu> writes:
>
>
> the flat profile as generated by gmon is as follows. now, the total
> computation time on my system was ~25mins = 1500sec. This profiler
> shows that the intrinsic function matmul itself took 2899.24 seconds.
> how do I interpret that? also, if I look at the percent time column,
> intrinsic matmul is taking the most of the time. are there matrix
> multiplication routines that are faster than the intrinsic ones? there
> are some multiplications with diagonal matrices in my code. should it
> increase the efficiency of the code if I write a seperate routine for
> multiplications involving diagonal matrices?
>
> ---------------------
> Flat profile:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls Ks/call Ks/call name
> 43.59 2899.24 2899.24 _g95_matmul22_r4r4
> 9.77 3549.37 650.13 300000 0.00 0.00 matrix_MP_ludcmp_
> 8.14 4091.03 541.66 _g95_spread
> 6.04 4492.47 401.44 1 0.40 1.96 MAIN_
The profile is indeed telling you that the matmul intrinsic is
the pig in execution time. Yes, it may be profitable to implement
your own matmul to take advantage of diagonal nature of your
matrices. You may find a useful routine in Golub's book on
matrix computations. I don't remember the exact citation off the
top of my head.
--
Steve
http://troutmask.apl.washington.edu/~kargl/
| |
| student 2005-06-07, 8:58 pm |
| > Take a look at /proc/cpuinfo to know what cpu is in your system.
> The files also provides the processor flags like SSE and 3dnow
> which are essential for vectorisation. Common single instruction
> multiple data processors only support 32 and 64 bit data types.
cat /proc/cpuinfo shows sse2 in the flags lsit but i don't see 3dnow.
should that be a problem as far as vectorization goes?
| |
| Bart Vandewoestyne 2005-06-07, 8:58 pm |
| In article <d84r7k$mbi$1@gnus01.u.washington.edu>, Steven G. Kargl wrote:
>
>
> I don't use g95, but it's close enough to gfortran that I'll
> provide an answer. You've identified your system as some version
> of linux, so try
>
> g95 -pg -o prog prog.f90
> gprof -l -b prog.gmon | more
>
> man gprof
Does profiling already work for g95? I've just tried this:
bartv@vonneumann:~/fortran$ g95 -pg -o test_string test_string.f95
bartv@vonneumann:~/fortran$ ./test_string
<... some output of the program test_string...>
bartv@vonneumann:~/fortran$ ls *.gmon
ls: *.gmon: No such file or directory
I do have a gmon.out file, but how do i use it? The following does not work:
bartv@vonneumann:~/fortran$ gprof -l -b gmon.out
gprof: gmon.out: not in a.out format
Regards,
Bart
--
"Share what you know. Learn what you don't."
| |
| Steven G. Kargl 2005-06-07, 8:58 pm |
| In article <1118179588.491035@seven.kulnet.kuleuven.ac.be>,
Bart Vandewoestyne <MyFirstName.MyLastName@telenet.be> writes:
> In article <d84r7k$mbi$1@gnus01.u.washington.edu>, Steven G. Kargl wrote:
^^^^^^^^^^^^^^^
[color=darkred]
>
> Does profiling already work for g95?
Read VERY CAREFULLY the above text.
> I do have a gmon.out file, but how do i use it?
RTFM.
> The following does not work:
>
> bartv@vonneumann:~/fortran$ gprof -l -b gmon.out
> gprof: gmon.out: not in a.out format
gprof -l -b a.out gmon.out
--
Steve
http://troutmask.apl.washington.edu/~kargl/
| |
| Rich Townsend 2005-06-08, 4:00 am |
| student wrote:
>
>
>
>
> the flat profile as generated by gmon is as follows. now, the total
> computation time on my system was ~25mins = 1500sec. This profiler
> shows that the intrinsic function matmul itself took 2899.24 seconds.
> how do I interpret that? also, if I look at the percent time column,
> intrinsic matmul is taking the most of the time. are there matrix
> multiplication routines that are faster than the intrinsic ones? there
> are some multiplications with diagonal matrices in my code. should it
> increase the efficiency of the code if I write a seperate routine for
> multiplications involving diagonal matrices?
Hell, yes! Assume that A is a rank-two array holding a general matrix,
and D is a rank-one array holding the components of the diagonal matrix.
Then this code will give you AD:
do j = 1,SIZE(A,2)
AD(:,j) = A(:,j)*D(j)
end do
....and this will give you DA:
do i = 1,SIZE(A,1)
DA(i,:) = D(i)*A(i,:)
end do
This latter code may be better (ie, more optimally) expressed as
explicit loops, to optimize the array accesses:
do j = 1,SIZE(A,2)
do i = 1,SIZE(A,1)
DA(i,j) = D(i)*A(i,j)
end do
end do
This should give a very respectable speed up. Let me know how you get on...
cheers,
Rich
| |
|
| Hi,
since almost all time is spent in the fortran runtime routines
(_g95_*), compiler options will hardly affect execution time. Two
answer the question about the matrix multiply, there are a few
options..
1) if matrices are diagonal ... write your own specific multiply
routine (see Steven/Rich)
2) if very small (~5x5) depends on many things, try matmal, try option
3, try to write explicit code inline
3) if matrices are not small, try using the BLAS libraries, in
particular calling 'sgemm' will do the job in your case. You'll need a
BLAS library that is optimal for your machine, try goto blas
http://www.cs.utexas.edu/users/flame/goto/
or atlas
http://math-atlas.sourceforge.net/
Joost
| |
|
| Hi Bart,
yes it should work, on my SUSE machine with:
> g95 -pg mytest.f90
> ./a.out
> gprof ./a.out gmon.out
Joost
| |
| student 2005-06-08, 4:00 am |
|
> Hell, yes! Assume that A is a rank-two array holding a general matrix,
> and D is a rank-one array holding the components of the diagonal matrix.
> Then this code will give you AD:
>
> do j = 1,SIZE(A,2)
> AD(:,j) = A(:,j)*D(j)
> end do
>
> ...and this will give you DA:
>
> do i = 1,SIZE(A,1)
> DA(i,:) = D(i)*A(i,:)
> end do
>
> This latter code may be better (ie, more optimally) expressed as
> explicit loops, to optimize the array accesses:
>
> do j = 1,SIZE(A,2)
> do i = 1,SIZE(A,1)
> DA(i,j) = D(i)*A(i,j)
> end do
> end do
>
> This should give a very respectable speed up. Let me know how you get on...
>
> cheers,
>
> Rich
Thanks for that post Rich. I already implemented that and my
computation speed almost doubled. But there are other matrices
multiplications for which this doesn't apply. MATMUL is slow and so an
explicit function that I wrote is about the same. Any ideas to how
matrix multiplication could be made faster for general matrices?
| |
| Tim Prince 2005-06-08, 4:00 am |
|
"student" <adarsh@stat.tamu.edu> wrote in message
news:1118166052.339063.314050@z14g2000cwz.googlegroups.com...
>I am working with a g95 compiler and I need to increase the efficiency
> of my code. one of the things that I came across while browsing the
> archives of this group is vectorization. However, after enabling that
> option using the flag -ftree-vectorize, I don't seem to be making any
> significant gains in performance. Can anyone suggest something? Is it
> also something that's processor dependent? If that's the case how would
> I know if my processor can do that pipelining operation? uname -a on my
> system gives the following output:
I believe you must invoke SSE code generation as well as
throwing -ftree-vectorize, or it may be ignored. Yes, it depends on an
SSE/SSE2 capable CPU. Also, look up the flags which tell what vectorization
has been accomplished, or examine the output code. My experience showed gcc
vectorizing more effectively than gfortran, and the x86-64 implementation
doing so more effectively than the i386. This may or may not be associated
with SSE2 code generation being the default for x86-64, so that SSE options
aren't needed to enable vectorization.
gcc for x86-64 vectorizes matrix multiplication quite effectively, and ought
to be able to do the same for any SSE2 target. It may do OK if invoked
while building libgfortran matmul(), but I would expect the source to need
re-arrangement, since gcc avoids using many parallel memory operations.
Given that you are not allowed to see clear source for g95, the advantage of
gfortran should be evident.
| |
|
| > Given that you are not allowed to see clear source for g95, the advantage of
> gfortran should be evident.
I'm under the impression you're misinformed, and providing incorrect
information.
Please go to
www.g95.org ... click 'compilation notes' which brings you to
http://g95.sourceforge.net/src.html click 'g95 source' which will
download
http://g95.sourceforge.net/g95_source.tgz
tar -xvzf g95_source.tgz
cd g95-0.50
tar -xvfz libf95.a-0.50.tar.gz
vi libf95.a-0.50/intrinsics/matmul.c
The source gets updated roughly every two w s, and is as clear as c
code typically is ...
The build process is relatively well documented and certainly under
linux quite easy. E.g. as you suggested in the other thread, I had no
real problem (I dislike editing makefiles) to rebuild the runtime using
-pg.
Joost
| |
| Ian Bush 2005-06-08, 8:58 am |
| student wrote:
> Thanks for that post Rich. I already implemented that and my
> computation speed almost doubled. But there are other matrices
> multiplications for which this doesn't apply. MATMUL is slow and so an
> explicit function that I wrote is about the same. Any ideas to how
> matrix multiplication could be made faster for general matrices?
How big are your matrices ? It sounds as though the implementation
of matmul may well not be blocked, so provided your matrices are not
small ( when compared to your cache size ) a good BLAS implementation
will almost certainly be very much quicker. Have a look into Atlas
and MKL,
Ian
| |
| Bart Vandewoestyne 2005-06-08, 8:58 am |
| In article <d854v7$umi$1@gnus01.u.washington.edu>, Steven G. Kargl wrote:
>
> ^^^^^^^^^^^^^^^
>
> Read VERY CAREFULLY the above text.
I did, and it made me *suspect* that g95 had profiling support in it,
but because my attempts to profile failed, i was in doubt...
>
> RTFM.
Maybe indeed I should have read it instead of just following the above example...
>
> gprof -l -b a.out gmon.out
Thanks! This works. Now I know g95 has profiling support in it :-)
Regards,
Bart
--
"Share what you know. Learn what you don't."
|
|
|
|
|