Home > Archive > Compilers > September 2007 > GC question
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Paul Rubin 2007-07-27, 7:10 pm |
| Suppose you build a big list of cons cells, say a billion of them
(you're on a large machine). This is in a runtime with traditional
marking or copying gc, no generation scavenging or region inference or
anything like that. The collector runs every N bytes of allocation
for some fixed N. Yes I know that's a dumb way to write an allocator
for a big-system implementation but it's fairly common for small ones.
It seems to me that the running time to allocate N cells is O(N**2)
because you run the collector O(N) times during the allocation, and
each collection costs O(N) on average.
I never realized this before. Is it a well-known phenemonon? Is the
main answer something like generation scavenging?
This relates to a real situation that came up on comp.lang.python
yesterday.
| |
| Ulf Wiger 2007-07-27, 7:10 pm |
| >>>>> "PR" == Paul Rubin <phr-2007@nightsong.com> writes:
PR> Suppose you build a big list of cons cells, say a billion of
PR> them (you're on a large machine). This is in a runtime with
PR> traditional marking or copying gc, no generation scavenging or
PR> region inference or anything like that. The collector runs
PR> every N bytes of allocation for some fixed N. Yes I know that's
PR> a dumb way to write an allocator for a big-system implementation
PR> but it's fairly common for small ones.
PR> It seems to me that the running time to allocate N cells is
PR> O(N**2) because you run the collector O(N) times during the
PR> allocation, and each collection costs O(N) on average.
PR> I never realized this before. Is it a well-known phenemonon?
Where I come from (industrial use of Erlang), we have design rules
warning about this very thing.
PR> Is the main answer something like generation scavenging?
<disclaimer>I have only superficial knowledge of GC techniques</..>
Not necessarily. It can actually make things worse.
If you, say, enter a loop that constructs a list one element
at a time, the garbage collector will have no way to tell
when you're going to stop. It will run out of heap, scan,
tenure data, etc., but as the heap just keeps growing, it's
not going to be able to reclaim much, if anything.
There may be different ways of going generational GC, but at
least in Erlang, the collector will keep an old heap and a new
heap, and revert to fullsweep GC if the generational scan fails
to reclaim enough data. If the fullsweep fails, it will resize
the heap. With constantly growing heap, this is likely to occur
at every GC stop. Extremely annoying.
A workaround specific to Erlang is that you can spawn a process
with a predefined minimum heap size - preferrably large enough
that GC will never kick in. It can build the data, then send it
in a big whopping message to the process that needs it. It
will be costly to copy the message, but not nearly as much as
paying for all those GC stops (which also copy).
Most of the time, you try to come up with an implementation that
doesn't build huge data structures. Again in Erlang, this is
usually achieved by exploiting the concurrency patterns in the
problem and dispatching data in small chunks to other processes.
Some seasoned GC expert will perhaps give you a good solution
to the problem. I can only tell you that you're not alone. (:
BR,
Ulf W
--
Ulf Wiger, Senior Specialist,
/ / / Architecture & Design of Carrier-Class Software
/ / / Team Leader, Software Characteristics
/ / / Ericsson AB, IMS Gateways
| |
|
| On Jul 27, 6:34 am, Paul Rubin <phr-2...@nightsong.com> wrote:
> Suppose you build a big list of cons cells, say a billion of them
> (you're on a large machine). This is in a runtime with traditional
> marking or copying gc, no generation scavenging or region inference or
> anything like that. The collector runs every N bytes of allocation
> for some fixed N. Yes I know that's a dumb way to write an allocator
> for a big-system implementation but it's fairly common for small ones.
>
> It seems to me that the running time to allocate N cells is O(N**2)
> because you run the collector O(N) times during the allocation, and
> each collection costs O(N) on average.
>
> I never realized this before. Is it a well-known phenemonon? Is the
> main answer something like generation scavenging?
>
> This relates to a real situation that came up on comp.lang.python
> yesterday.
You're correct of course. Dumb policy often leads to bad performance,
and not just in memory management. If you merely increase the heap
size by a factor greater than one every time it fills up, the copying/
tracing cost remains O(N).
| |
| Ray Dillinger 2007-07-28, 7:10 pm |
| Paul Rubin wrote:
> Suppose you build a big list of cons cells, say a billion of them
> (you're on a large machine). This is in a runtime with traditional
> marking or copying gc, no generation scavenging or region inference or
> anything like that. The collector runs every N bytes of allocation
> for some fixed N. Yes I know that's a dumb way to write an allocator
> for a big-system implementation but it's fairly common for small ones.
> It seems to me that the running time to allocate N cells is O(N**2)
> because you run the collector O(N) times during the allocation, and
> each collection costs O(N) on average.
Yes, that's true.
> I never realized this before. Is it a well-known phenemonon?
It's very well-known; when you implement garbage collection you
want to be able to say at least in broad terms how much it slows
down the execution and how well it scales to larger and smaller
memory arenas, so people profile their GC's routinely and test
them against real problems. When you profile looking for costs,
it's very easy to notice.
> Is the
> main answer something like generation scavenging?
Man, I don't know how to answer this one. Yes, generation
scavenging is one strategy that a fair number of implementers
use to try to hold down GC costs in general. But it isn't the
specific and true answer to your question.
There are a few truths about Garbage Collection. One of them
is that if your system generates garbage at all, then on average,
some fraction of your memory arena will consist of "floating
garbage" - that is, data that cannot be used by your program,
but which has not (yet) been reaped.
Another truth, again assuming a case where your system actually
does GC at all, is that the Garbage collector will, on average,
occupy some fraction of your CPU power.
Now, here's the harsh truth that applies to all conventional
Garbage Collectors. As your live data grows in size, the
smallest achievable product of those two fractions will tend
to increase. The smaller the fraction of the memory space
that your system allows to be floating garbage, the larger
the fraction of CPU power your GC will use. The smaller the
fraction of CPU power you devote to it, the greater will be
the fraction of your memory arena devoted to holding floating
garbage.
more precisely,
For All collectors B,
For All weighting factors A,
For All Memory arena sizes C,
(GAR%(B,C) X A)/ CPU%(B,C) = 1,
GAR%(B,C) X CPU%(B,C) = N,
(GAR%(B,C+1) X A)/ CPU%(B,C+1) = 1,
==> GAR%(B,C+1) X CPU%(B,C+1) > N.
That is, if you hold the ratio of CPU GC effort to proportion
of floating garbage constant, a larger memory arena means that
the product of the two must on average be greater.
By "conventional" garbage collectors, I mean all collectors
that work by tracing all live data - that includes
mark-and-sweep, stop-and-copy, generation-scavenging, and
since most other GC algorithms are some variation of these,
more often than not this is true of any given GC system.
The order of the increase varies from one collection algorithm
to another, but to avoid an increase entirely is very very
difficult and has only been achieved in the context of a very
small set of problems using very unconventional techniques.
The situation you describe (full trace for every N bytes
allocated) will result, on average, in "constant" floating
garbage of N/2. This N was probably carefully selected to
be near the optimum or smallest achievable product of CPU
and memory fractions, for a normal-sized memory arena.
But as the size of the memory arena grows, N/2 is a
smaller and smaller fraction of the arena size. So not
only are you moving to a larger arena, wherin the minimum
achievable cost is greater, but you are also moving away
from the arena size in which N/2 was a reasonable choice.
The product of N/2 and the CPU usage must increase -
and that works out in this particular case to mean the CPU
usage increases according to the square of the memory size.
The good news is that you can mostly fix this, because this
is nowhere near the "smallest achievable" product of CPU
fraction and memory fraction that that collector can achieve.
You can "address" this in a few different ways. First,
you can regulate how often you call your GC. Minimal use
of CPU (but maximal floating garbage) results from the old-
skool lisp-machine approach of calling the GC when you run
out of memory and not before. A "floating garbage" fraction
that's a relatively constant fraction of your memory arena
results from calling your collector once per N bytes
allocated where N is a constant fraction (say, half) of
your memory arena, but your CPU usage will increase
moderately with the size of live data (You are now seeing
a polynomial increase. Trust me when I say that "moderate"
increases are much less painful).
You can also fix the GC cost relative to the amount of
allocation done. Use an incremental version of your GC
that regulates how much work it does according to how
much allocation has happened since last time it was called
(for example, traversing, marking, and potentially freeing
no more than M objects for every one allocated).
You can experiment to try to find a GC regulation function
that stays fairly close the minimum-achievable-product over
a large variety of arena sizes.
Also, you can use a better collector. Other things being
equal, a generational collector is almost always better than
a mark-and-sweep or copying collector in that it scales
better to larger memory arenas and live data sets.
Finally, look at the "other things" I so cavalierly mentioned
last paragraph as being equal - how your allocator handles
out-of-memory errors, etc, can have a huge impact on your
GC's overall performance. This is what Ulf Wiger was
talking about in his response to you. Right now, you are
paying the CPU cost to keep garbage down to less-than-N-bytes.
This is a tiny fraction of your enlarged memory arena, so
you probably aren't seeing many slowdowns due to handling
out-of-memory conditions. If you allow floating garbage to
vary with memory size, which it should, then your memory
footprint overall will grow larger. You'll hit the end of your
memory arena (and whatever penalties your runtime incurs for
it) more often. This is true with your current collector and
the simple change of making N a constant fraction of the memory
arena size. And though a more sophisticated generational
collector will be better than that, it's still going to be
true, even of a generational collector.
For what it's worth, if you increase the memory arena by a
constant amount whenever you "run out", at a cost proportional
to the new size of your arena, then you will incur CPU costs
proportional to the square of your final arena size if you
resize many times. This is a variation of the same problem
you're having now with invoking your tracing collector at a
constant number of allocated bytes. It is much cheaper in
the long run to double the size of the memory arena each time
you run out of space - then you get costs linear with the size
of your final memory arena.
Bear
| |
| Paul Rubin 2007-07-31, 4:24 am |
| Ray Dillinger <bear@sonic.net> writes:
> Now, here's the harsh truth that applies to all conventional
> Garbage Collectors. As your live data grows in size, the
> smallest achievable product of those two fractions will tend
> to increase. The smaller the fraction of the memory space
> that your system allows to be floating garbage, the larger
> the fraction of CPU power your GC will use. The smaller the
> fraction of CPU power you devote to it, the greater will be
> the fraction of your memory arena devoted to holding floating
> garbage. ...
Thanks for this detailed reply, which I'm still trying to digest.
> It is much cheaper in the long run to double the size of the memory
> arena each time you run out of space - then you get costs linear
> with the size of your final memory arena.
Right, I think there's no way around this. I had thought of
generation scavenging in terms of doubling the time between
collections at each generational level, avoiding the O(N**2) time
complexity, but that has the same effect on memory consumption as
doubling the arena size on running out of space.
| |
| Torben Ęgidius Mogensen 2007-08-15, 7:18 pm |
| Paul Rubin <phr-2007@nightsong.com> writes:
> Suppose you build a big list of cons cells, say a billion of them
> (you're on a large machine). This is in a runtime with traditional
> marking or copying gc, no generation scavenging or region inference or
> anything like that. The collector runs every N bytes of allocation
> for some fixed N. Yes I know that's a dumb way to write an allocator
> for a big-system implementation but it's fairly common for small ones.
>
> It seems to me that the running time to allocate N cells is O(N**2)
> because you run the collector O(N) times during the allocation, and
> each collection costs O(N) on average.
That depends on the percentage of live cells at GC time. A copying GC
will GC when one space is full. It then copies all live data over to
the other space. The cost is proportional to the size of the live
data, and the time until the next GC is proportional to the size of
the dead (reclaimed) data. If the heap (one space) has S cells and x%
of the heap is live at GC, the cost of one GC is S*x/100 and the
number of allocations until next GC is (100-x)*S/100. So to allocate
N cells (where N>>S), you need N/((100-x)*S/100) = 100*N/(100-x)/S GCs
each at a cost of S*x/100, so the total cost is 100*N/(100-x)/S *
S*x/100 = N*x/(100-x). So the cost is O(N) for any fixed live
percentage. But the price increases dramatically with the percentage
of live data at GC time. This is why most allocators increase the
heap size if a GC reclaims less than, say, 50% of the heap.
If the percentage of live data is not constant, you can get quadratic
allocation cost. But the calculation is not as simple as you indicate
above.
Torben
| |
|
|
|
|
|