For Programmers: Free Programming Magazines  


Home > Archive > Compilers > February 2008 > Reordering of functions









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Reordering of functions
Tim Frink

2008-02-18, 7:27 pm

Hi,

I've a question about the influence of compiler optimizations that
reorder functions on the system performance.

Assume a modern processor with all state-of-the art features like
prefetching, branch prediction and a superscalar pipeline. Further
assume that all caches are disabled. Will the program runtime change
when just the order of functions is changed (without any other code
transformation)?

I'm of the opinion that a reordering of function should have little
influence on the program execution, maybe due to some prefetch effects
but thes should be marginal. Of course, with caches this situation
would look different.

How do you see that?

Regards,
Tim

Nikolai Kim

2008-02-19, 7:41 pm

hi,

Function reordering depends heavily on the size of the functions, in
most cases such a reordering goes hand in hand with the
profile-information, and the latter is usually targeted at a "main"
application. whether such code-positioning makes sense, can't be said
definitely. Pettis/Hansen state in their paper, that such an
interprocedural optimization combined with bb-positioning within
functions may be very promising allowing gains from 2 to 15 %
depending on the application.

In fact, most compiler/frameworks support function-inlining which may
lead to even bigger functions, which again may lead to a cache being
to small to hold more than just one function, in such cases such a
reordering does not make much sense. the one could turn off inlining,
this may cause a slight efficiency- improvement and together with the
inner-procedural placement-strategies be of more meaning, but whether
it would be better if you had inlining turned on is an other question.

nk

"Tim Frink" <plfriko@yahoo.de> schrieb im Newsbeitrag
> Assume a modern processor with all state-of-the art features like
> prefetching, branch prediction and a superscalar pipeline. Further
> assume that all caches are disabled. Will the program runtime change
> when just the order of functions is changed (without any other code
> transformation)?
>
> I'm of the opinion that a reordering of function should have little
> influence on the program execution, maybe due to some prefetch effects
> but thes should be marginal. Of course, with caches this situation
> would look different.

Joel Yliluoma

2008-02-19, 7:41 pm

On Mon, 18 Feb 2008 17:22:52 +0100, Tim Frink wrote:
> I've a question about the influence of compiler optimizations that
> reorder functions on the system performance.
>
> Assume a modern processor with all state-of-the art features like
> prefetching, branch prediction and a superscalar pipeline. Further
> assume that all caches are disabled. Will the program runtime change
> when just the order of functions is changed (without any other code
> transformation)?


Aren't "with branch prediction" and "caches are disabled" somewhat
mutually exclusive statements?

As I understand it, the processor does branch prediction based on
past experience (which is cached using some hash generated
from the instruction address), or lacking experience, heuristics.

If such cache exists, then function reordering might help in cases
where multiple functions have jumps in identically hashing addresses
with identical probabilities. I find that to be unlikely.

> I'm of the opinion that a reordering of function should have little
> influence on the program execution, maybe due to some prefetch effects
> but thes should be marginal. Of course, with caches this situation
> would look different.


Function reordering might help also in situations where the processor
supports jumps that are differently encoded depending on the jump length
(such as the bra/brl opcodes on 65c816, or similar on x86). If a tailcall
occurs to a function that is located within a short distance, it might
be faster with than a tailcall with a larger distance, because the opcodes
differ.

In any case, I think the effects of cache (and/or demand loading) are
greatly more significant than those which I listed, regarding the function
ordering.

--
Joel Yliluoma - http://iki.fi/bisqwit/
Chris F Clark

2008-02-19, 7:41 pm

Tim Frink <plfriko@yahoo.de> writes:

> I've a question about the influence of compiler optimizations that
> reorder functions on the system performance.
>
> Assume a modern processor with all state-of-the art features like
> prefetching, branch prediction and a superscalar pipeline. Further
> assume that all caches are disabled. Will the program runtime change
> when just the order of functions is changed (without any other code
> transformation)?
>
> I'm of the opinion that a reordering of function should have little
> influence on the program execution, maybe due to some prefetch effects
> but thes should be marginal. Of course, with caches this situation
> would look different.


Maybe. You've set up a straw-man where you have tried to deny all
ways that changing the order of function calls might affect
preformance and then asked will performance remain unchanged.
However, even in your denial you have left holes open, what about
machines the execute a limited number of instructions out-of-order and
use register renaming. That's a cache-like effect (just like
pre-fetching, and branch prediction are). Reordering the function
calls can influence all of those. They may all be little effects in
isolation, but they may be larger in total than you expect (or they
may remain small even in combination).

More importantly, in rearranging the functions are you allowed to pull
them out-of-loops (in particular, loops caused by recursive
invocations). That's a very significant question.

To understand why consider quicksort. If you do the recursive calls
in the correct order (attacking the larger half non-recursively, and
pushing only the smaller half on the stack), you significantly change
the algorithm form O(n**2) to O(n*log(n)). If your function
reordering causes an effect like that, it will significantly affect
performance, even if no other optimizations are applied.

So, yes, there is a model where rearranging functions has little or no
effect on application performance. However, it doesn't model all the
real world accurately, and it isn't just random luck that function
reordering can improve performance--the reasons for improved
performance can be profound (yielding deep insight into the problem).

Now, I'm curious why you ask the question. What influence does
whether function reordering influences performance (or not) have on
some decision you have to make, or point you need to prove?

BTW, I presumed that you were talking about the rearrangement of the
order in which functions were executed. Changing the location of the
functions in memory, but not their execution order probably has a more
limited effect on performance, since you can't pull a function
out-of-a-loop that way.

Finally, some of the effects are hard to quantify and/or predict.
When I was working on the Alpha optimizer, there were some
optimizations we specifically turned on (and off) in specific orders
because they had an unfortunate interaction on one of the benchmarks
we were measuring against. The right combination of optimizations
yielded a fortuitous alignment of a critical loop, that was lost if we
did (a little) more optimization until we got significantly more
optimizations enabled and could remove enough code that the loop was
always aligned.

Hope this helps,
-Chris

****************************************
**************************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
glen herrmannsfeldt

2008-02-24, 4:49 am

Joel Yliluoma wrote:

(snip)

> Aren't "with branch prediction" and "caches are disabled" somewhat
> mutually exclusive statements?


Well, to many people cache unqualified means data or instruction cache.

Note also that there are static branch prediction systems that
don't use a cache at all.

-- glen

Tim Frink

2008-02-24, 4:49 am

> Maybe. You've set up a straw-man where you have tried to deny all
> ways that changing the order of function calls might affect
> preformance and then asked will performance remain unchanged.
> However, even in your denial you have left holes open, what about
> machines the execute a limited number of instructions out-of-order and
> use register renaming. That's a cache-like effect (just like
> pre-fetching, and branch prediction are). Reordering the function
> calls can influence all of those.


I agree with you on that. That's obvious. But I was actually talking
about the rearrangement of functions in memory while the order of
function invocations remains the same.

> Finally, some of the effects are hard to quantify and/or predict. When I
> was working on the Alpha optimizer, there were some optimizations we
> specifically turned on (and off) in specific orders because they had an
> unfortunate interaction on one of the benchmarks we were measuring
> against. The right combination of optimizations yielded a fortuitous
> alignment of a critical loop, that was lost if we did (a little) more
> optimization until we got significantly more optimizations enabled and
> could remove enough code that the loop was always aligned.


Finding the right combination of optimizations which yields good
results for a set of different programs seems to be an unsolvable
problem. How do compiler designers decide which combinations/sequences
of optimizations seem to be promising and are put together into an
optimization level (like gcc's -O)?

Is this an trial-and-error approach assisted with some experience?

One "solution" I know is the ACOVEA project which uses an genetic
algorithm to find the "best" gcc options for a particular program
under test. But this is a particular case where you want to achieve
best results for one application.

Regards,
Tim

Tim Frink

2008-02-24, 4:49 am

> Aren't "with branch prediction" and "caches are disabled" somewhat
> mutually exclusive statements?


Depends. When you think of a branch target buffer then you are correct.
This buffer is a small cache which holds branch targets previously
computed. But there are also static branch predictions which just
rely on the instruction and the jump direction. For example,
conditional jumps with a backward displacement might be always predicted
as taken. Here the buffering of the branch target might be omitted.

> Function reordering might help also in situations where the processor
> supports jumps that are differently encoded depending on the jump length
> (such as the bra/brl opcodes on 65c816, or similar on x86). If a tailcall
> occurs to a function that is located within a short distance, it might
> be faster with than a tailcall with a larger distance, because the opcodes
> differ.


OK, I agree, that's a good point.
Tim Frink

2008-02-24, 4:49 am

Hi,

> Function reordering depends heavily on the size of the functions, in
> most cases such a reordering goes hand in hand with the
> profile-information, and the latter is usually targeted at a "main"
> application. whether such code-positioning makes sense, can't be said
> definitely. Pettis/Hansen state in their paper, that such an
> interprocedural optimization combined with bb-positioning within
> functions may be very promising allowing gains from 2 to 15 %
> depending on the application.


yes, but the gain they achieve has two reasons:

1) they use a direct-mapped instruction cache and the contiguously
placement of functions which call each other frequently might avoid
conflict cache misses since these functions are not mapped to the
same cache locations after reordering (I excluded this option in
my assumption)
2) they assume that the number of page faults will be minimized
since functions which call each other frequently will be likely
stored in the same page.

While writing this here another issue comes to my mind. When virtual
memory is used a reordering of functions might influence the TLB (if this
"cache" is not disabled) leading to different program execution.

> In fact, most compiler/frameworks support function-inlining which may
> lead to even bigger functions, which again may lead to a cache being
> to small to hold more than just one function, in such cases such a
> reordering does not make much sense. the one could turn off inlining,
> this may cause a slight efficiency- improvement and together with the
> inner-procedural placement-strategies be of more meaning, but whether
> it would be better if you had inlining turned on is an other question.


Well, this does not answer my question. ;-)
Sure, when a function is completely inlined into the code than its
position in the code is irrelevant. However, a compiler will not inline
large functions (you told the reason with caches) so that the question
of their placement in memory still remains.

Tim
glen herrmannsfeldt

2008-02-24, 7:33 pm

Tim Frink wrote:
> For example, conditional jumps with a backward displacement might
> be always predicted as taken. Here the buffering of the branch
> target might be omitted.


Someone doing timing tests on an IBM machine, I believe an ESA/390
machine, found that BXLE (branch on index less than or equal) predicts
the branch as taken, and BXH (branch on index high) predicts as not
taken.

The traditional use for BXLE is at the end of a loop (*), and BXH at
the beginning of a loop. The person doing the tests tried BXH at the
end and BXLE at the beginning with much slower results.

(*) It was used for DO loops in Fortran IV (Fortran 66) code. The
standard allowed, but did not require, the test at the end of the loop
(such that there is at least one trip through the loop even when the
end condition is already satisfied).

-- glen
Chris F Clark

2008-02-24, 7:33 pm

Tim Frink <plfriko@yahoo.de> writes:

....
[color=darkred]
> I agree with you on that. That's obvious. But I was actually talking
> about the rearrangement of functions in memory while the order of
> function invocations remains the same.


If you are talking simply about re-ordering the location of functions
in memory, but not about re-ordering the calls, then cache (paging,
tlb and related) penalties are in fact the main intents. (And my
apologies for mis-understanding previously.) One is trying to remove
useless locations in the instruction area from being in the working
set. For those cases cache (et. al.) effects are truly important.
Don't dismiss the significance of that.

One of the key results I note in that area comes from Steve Sharp who
works at Cadence Design on the nc-verilog compiler. Verilog
applications are very large (relative to other apps) with a very large
working set, and overflow the cache on even the largest processors.
Moreover, the nature of Verilog programs makes it appear that the code
"randomly" wanders about its instruction space, rather than staying in
some "small" tight loops. As such, the best optimization algorithms
to apply to generated code for Verilog programs is ones that reduce
the code or data size, because the effect of a cache miss takes about
50-cycles to recover from, and thus reducing the frequency of cache
misses is the most important thing one can do. In fact in Verilog the
problem is so extreme that one can waste a few instruction executions
in each code sequence to minimize the total code or data footprint and
still get a positive overall effect. There are very few optimizations
(other than those applied to inner loops of array algorithms) that
will save you 50 cycles.

What applies to Verilog applies to a lesser extent to most programs.
If you can reduce the frequency of cache misses (or page faults, less
frequent but more expensive, or TLB misses, somewhere in the middle),
then you will be saving large numbers of cycles. For the code outside
the inner loops, that is likely to be at least as big a factor as
removing instructions from the code path.

> How do compiler designers decide which combinations/sequences
> of optimizations seem to be promising and are put together into an
> optimization level (like gcc's -O)?
>
> Is this an trial-and-error approach assisted with some experience?


In general, one only has a limited set of optimizations that one
actually knows how to implement. Thus, one implements them all.
There are some hueristics (and background knowledge) that help one
decide the order in which to apply optimizations and which ones to
turn on at which levels.

For example, one generally does the "local" (within a basic block)
optimizations first (at the lowest optimization level) as they have
the most payback for the fewest resources and are also the safest.
Global "loop" optimizations are generally next, and there is some
ordering amongst them (e.g. common-sub-expression elimination is
usually one of the first global optimizations). Once, that set has
been mined, one finally applies the interprocedural optimizations.

Thus, one usually applies more local optimizations before more global
ones. Most global optimizations depend on the local optimizations
doing the correct thing as part of their effectiveness. One also
generally does simple optimizations before complex ones, again because
of the dependencies. If you don't get the local simple stuff right,
changing the global complicated stuff becomes more a case of rolling
the dice.

Hope this helps,
-Chris

****************************************
**************************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
Jan Vorbrüggen

2008-02-25, 7:22 pm

Didn't see the original reply, but...:[color=darkred]

But remember that most current processors have more than one level of
cache. I doubt there are many, if any, functions in existence that would
exceed the capacity of, say, the Montecito L3 cache (16 MB or so).

Jan

George Neuner

2008-02-25, 7:22 pm

On Mon, 18 Feb 2008 17:22:52 +0100, Tim Frink <plfriko@yahoo.de>
wrote:

>I've a question about the influence of compiler optimizations that
>reorder functions on the system performance.


>Assume a modern processor with all state-of-the art features like
>prefetching, branch prediction and a superscalar pipeline. Further
>assume that all caches are disabled. Will the program runtime change
>when just the order of functions is changed (without any other code
>transformation)?


>I'm of the opinion that a reordering of function should have little
>influence on the program execution, maybe due to some prefetch effects
>but thes should be marginal. Of course, with caches this situation
>would look different.


It strikes me that your hypothetical processor is a pretty good match
to a modern DSP. Most DSP designs have neither branch prediction nor
traditional code/data cache because they have clocked matched internal
SRAM and if any external RAM is present, it is typically the same.

Some things DSPs typically do have that many traditional CPUs do not
are recognizable loop start/end instructions, a small prefetch code
cache dedicated to (suitably sized) loop bodies, near/far call
instructions, banked memory and multiple buses to fetch code and
(typically) multiple data from separate memory banks simultaneously.

It is very common in DSP development to deliberately place code and
data to take maximum advantage of simultaneous fetch, and to pack
related functions for faster near calling. On VLIW designs, overall
code size may change due to different instruction packing.

I've done a fair bit of DSP programming, and apart from the
aforementioned memory bank and near/far call issues, I can't really
say that I've seen any noticeable impact on speed just from reordering
of functions.

George
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com