For Programmers: Free Programming Magazines  


Home > Archive > Compilers > April 2004 > Optimizations to improve icache hit rate









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Optimizations to improve icache hit rate
Jeff Kenton

2004-04-21, 1:33 am

I had a discussion during an interview about optimizations designed to
decrease icache misses. It seems to me that code re-arrangement will
give small benefits at most, and that general optimizations,
especially those that decrease overall code size, will be more
productive in the end.

Comments?
[Sounds to me like it depends on the size of the cache lines. -John]
A Pietu Pohjalainen

2004-04-28, 4:48 pm

Jeff Kenton <Jeffrey.Kenton@comcast.net> wrote:
> I had a discussion during an interview about optimizations designed to
> decrease icache misses. It seems to me that code re-arrangement will
> give small benefits at most, and that general optimizations,
> especially those that decrease overall code size, will be more
> productive in the end.


> Comments?
> [Sounds to me like it depends on the size of the cache lines. -John]



In order to visualise the effect of size of the cache in optimization,
I did this experiment some time ago: I have a simple C-program that
loops for 2^20 times. Then the loop is manually unrolled for 2^1,
2^2, 2^3, .., 2^15 times.

As one might guess, the runtime for this loop first decreases, as the
unroll causes less calls to the exit-condition of the loop. However,
once the L1 and L2 caches on a x86 machine get full, the execution time
suddenly jumps.

The loop body in my experiment happened to be 87 bytes. For example, on
a PentiumIII / 1GHz machine with 16kB L1 cache and 256 kB L2 cache the
average runtimes with the following unroll counts are:

unrolls | execution
--------+----------
1 | 63 ms
2 | 49 ms
4 | 40 ms
8 | 30 ms
16 | 24 ms
32 | 24 ms
64 | 21 ms
128 | 20 ms
256 | 21 ms
512 | 32 ms
1024 | 48 ms
2048 | 50 ms
4096 | 90 ms
8192 | 140 ms
16384 | 140 ms
32768 | 140 ms

As we can calculate, the loop body can be unrolled to L1 cache to
16384 / 87 = 188 times. The first penalty is indeed placed between 128
and 215 unrolls. similarily, the L2 can contain 262144 / 87 = 3013
unrolls - and the second penalty in execution time is placed between
2048 and 4096 unrolls.

This penalty comes from the fact that when the execution reaches to the
end of the unrolled loop, the beginning of the loop has already been
removed from the cache. In the PIII case, the impact seems to take the
execution time from 20 ms to 140 ms.

To the question I'd believe that code re-arrangement optimizations are
orthogonal to those that decrease overall code size. With smaller code
fragments, you can fit more fragments to your limited cache size - which
means more possible orderings of the fragments.

(Full runtime information of my unroll experiment with few Pentium III,
Pentium IV, Athlon and Opteron processors is available at
http://www.cs.helsinki.fi/u/pohjala...ll-results.html )

br,
Pietu Pohjalainen
Anton Ertl

2004-04-29, 2:22 pm

Jeff Kenton <Jeffrey.Kenton@comcast.net> writes:
>I had a discussion during an interview about optimizations designed to
>decrease icache misses. It seems to me that code re-arrangement will
>give small benefits at most, and that general optimizations,
>especially those that decrease overall code size, will be more
>productive in the end.


Counterevidence:

- Mueller and Whalley [mueller&whalley92] found that an optimization
that increases code size by 50%, but increases spatial locality,
reduces the number of I-cache misses even for very small I-caches
(IIRC they simulated 2KB caches).

- We tried to use partial replication to reduce the code size and the
I-cache misses resulting from full replication [ertl&gregg03] without
losing most of its branch prediction accuracy advantage. We found
that the code size was reduced significantly, but the number of
I-cache misses increased, again due to worse spatial locality.

@InProceedings{mueller&whalley92,
author = "Frank Mueller and David B. Whalley",
title = "Avoiding Unconditional Jumps by Code Replication",
booktitle = "{SIGPLAN} '92 Conference on Programming Language
Design and Implementation",
year = "1992",
pages = "322--330",
annote = "Nearly all unconditional jumps can be eliminated by
code replication. This expands compiled C code by about
50\%, but reduces the number of executed instructions
and the cache work (miss ratio increases slightly).
There are some nontrivial problems in this technique,
which are solved heuristically in this paper."
}

@InProceedings{ertl&gregg03,
author = "M. Anton Ertl and David Gregg",
title = "Optimizing Indirect Branch Prediction Accuracy in
Virtual Machine Interpreters",
booktitle = "{SIGPLAN} '03 Conference on Programming Language
Design and Implementation",
year = "2003",
URL = "http://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz",
abstract = "Interpreters designed for efficiency execute a huge
number of indirect branches and can spend more than
half of the execution time in indirect branch
mispredictions. Branch target buffers are the best
widely available\mn{on all recent general-purpose
machines?} form of indirect branch prediction; however,
their prediction accuracy for existing interpretes is
only 2\%--50\%. In this paper we investigate two
methods for improving the prediction accuracy of BTBs
for interpreters: replicating virtual machine (VM)
instructions and combining sequences of VM instructions
into superinstructions. We investigate static
(interpreter build-time) and dynamic (interpreter
run-time) variants of these techniques and compare them
and several combinations of these techniques. These
techniques can eliminate nearly all of the dispatch
branch mispredictions, and have other benefits,
resulting in speedups by a factor of up to 3.17 over
efficient threaded-code interpreters, and speedups by a
factor of up to 1.3 over techniques relying on
superinstructions alone."
}

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/home.html
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com