Discussion:
Best Ref-counting algorithms?
(too old to reply)
Christoffer Lernö
2009-07-12 20:41:59 UTC
Permalink
Hi

I'm looking into GC using ref-counting.

Does anyone know what passes for state-of-the-art when it comes to ref-
counting algorithms?
I've read papers by Lins 2003 "Lazy Cyclic Reference Counting",
Bacon / Rajan 2001 "Concurrent Cycle Collection in Reference Counted
Systems" and a few others.

If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?

/Christoffer
Hans Aberg
2009-07-13 11:16:12 UTC
Permalink
Post by Christoffer Lernö
I'm looking into GC using ref-counting.
Does anyone know what passes for state-of-the-art when it comes to ref-
counting algorithms?
I've read papers by Lins 2003 "Lazy Cyclic Reference Counting",
Bacon / Rajan 2001 "Concurrent Cycle Collection in Reference Counted
Systems" and a few others.
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
It depends on what resources you want to collect, and what
implementation language you are choosing. If resources that should
take a place at a specific time, like open files that should be closed
in a C++ style destructor, that may force ordinary reference counts,
as it may be hard to detect the point in time the last reference is
used.

A quick scan on the first paper above gives the impression that it
uses a mark-scan from the root set. If the implementation language is
C++, the problem is that there is no good way to find this root set
(but perhaps in the next revision); the language does not
cooperate. But if one can find it, any type of GC would do.

The approaches might be combined: reference counts for the
destructors, and another GC to collect the memory resources (with
finalizers).

Hans
Christoffer Lernö
2009-07-13 13:42:10 UTC
Permalink
Post by Hans Aberg
It depends on what resources you want to collect, and what
implementation language you are choosing.
I'm playing around with a ref-counting based (memory) GC for a
language I have yet to construct. I'm obviously interested in making
such a GC as fast an efficient as possible. It might be acceptable
that it can't be run concurrently in a separate thread, so if there is
a particularly fast solution for that scenario it would be interesting
as well (even though typically you'd want it threadsafe for use in a
language)

/Christoffer
Hans Aberg
2009-07-14 18:02:38 UTC
Permalink
Post by Christoffer Lernö
Post by Hans Aberg
It depends on what resources you want to collect, and what
implementation language you are choosing.
I'm playing around with a ref-counting based (memory) GC for a
language I have yet to construct.
But what language are you planning use for the implementation? If you
choose C++, then it is hard to find the root set. Suppose you create
object for use in your language; then global objects and those in th
stack should be registered somehow for the tracing, but those on the
heap should not, as the are live when they can be traced from the other.

Therefore, one candidate for C++ might be a conservative GC, which is
implemented in a layer between the platform and the language, trying to
find out a conservative estimate of dead pointers. That is, not
necessarily collecting all.

I'm looking at the C++ draft "n2798.pdf", ch. 20.8, and it does not seem
to provide a means to find the root set. There are some "smart pointer"
classes, which seems to be just ordinary reference counts.

Anyway, even if you do not plan to use C++, perhaps it gives some input.

Hans
Christoffer Lernö
2009-07-15 06:43:26 UTC
Permalink
Post by Hans Aberg
Post by Christoffer Lernö
Post by Hans Aberg
It depends on what resources you want to collect, and what
implementation language you are choosing.
I'm playing around with a ref-counting based (memory) GC for a
language I have yet to construct.
But what language are you planning use for the implementation? If you
choose C++, then it is hard to find the root set. Suppose you create
object for use in your language; then global objects and those in th
stack should be registered somehow for the tracing, but those on the
heap should not, as the are live when they can be traced from the other.
I was thinking of using C for implementation, but does that matter? As
long as I implement the language in a non GC:ed language I will have
to take care of writing the GC myself (or for C, use the Bohm
conservative GC).

/C
BGB / cr88192
2009-07-16 02:38:22 UTC
Permalink
Post by Christoffer Lernö
Post by Hans Aberg
Post by Christoffer Lernö
Post by Hans Aberg
It depends on what resources you want to collect, and what
implementation language you are choosing.
I'm playing around with a ref-counting based (memory) GC for a
language I have yet to construct.
But what language are you planning use for the implementation? If you
choose C++, then it is hard to find the root set. Suppose you create
object for use in your language; then global objects and those in th
stack should be registered somehow for the tracing, but those on the
heap should not, as the are live when they can be traced from the other.
I was thinking of using C for implementation, but does that matter? As
long as I implement the language in a non GC:ed language I will have
to take care of writing the GC myself (or for C, use the Bohm
conservative GC).
yes.

Boehm in general seems like a good option, although I guess I don't use it
because its featureset is not exactly the same as mine, in particular,
because AFAIK it lacks per-object type-info (in my GC, pretty much every
piece of allocated memory has a type...).
Post by Christoffer Lernö
/C
Hans Aberg
2009-07-17 11:05:31 UTC
Permalink
Post by Christoffer Lernö
Post by Hans Aberg
But what language are you planning use for the implementation? If you
choose C++, then it is hard to find the root set. Suppose you create
object for use in your language; then global objects and those in th
stack should be registered somehow for the tracing, but those on the
heap should not, as the are live when they can be traced from the other.
I was thinking of using C for implementation, but does that matter? As
long as I implement the language in a non GC:ed language I will have
to take care of writing the GC myself (or for C, use the Bohm
conservative GC).
If you want to make a tracing GC in C++, then you will have to keep
track of where elements are located by hand.

For example, if you have class Object which keep track of a polymorphic
pointer, and a
class S {
Object x, y;
...
};

Then in
T f(...) {
Object w = new S(...);
...
}
the w belongs to the root set, as it is allocation on the function
stack, whereas the S::x, S::y are on the heap, as they are allocated via
the "new S".

So at least for dynamic languages, this is problem.

Hans
BGB / cr88192
2009-07-15 18:09:07 UTC
Permalink
Post by Hans Aberg
Post by Christoffer Lernö
Post by Hans Aberg
It depends on what resources you want to collect, and what
implementation language you are choosing.
I'm playing around with a ref-counting based (memory) GC for a
language I have yet to construct.
But what language are you planning use for the implementation? If you
choose C++, then it is hard to find the root set. Suppose you create
object for use in your language; then global objects and those in th
stack should be registered somehow for the tracing, but those on the
heap should not, as the are live when they can be traced from the other.
for C and C++ and implementing custom languages, it is not really that much
of a problem in practice.

The main reason is that, since the person is writing the compiler
and/or interpreter, they can directly keep track of whatever
references may exist within their little VM-managed language, making
precise GC and ref-counting fairly easily (especially if, as has been
suggested, a single-threaded implementation is possible...).

I guess it may be noted that there is another possible strategy, namely that
one "could" try to implement their own scheduler (rather than using the OS
scheduler), such that the GC would be able to manage threads by itself (for
example, it could pause all its managed threads during GC).

as-is in my case, I use the OS scheduler, but otherwise create-threads/do
TLS/... via a GC-provided API mostly so that the GC knows about them.

in my case, this also wraps the threading mechanism, so that I can use the
same threads API/... on both Windows and Linux.


now, the great problems are if:
the language is "C-like" (AKA: with raw pointers and like);
one tries to have fine-grained integration with native code;
multithreading becomes involved (this quickly turns a simple GC into a big
mess);
the VM has an extensive C or C++ based utility library (mostly a problem as
then lots of code may end up having to use the GC from C, which can become
very painful, and error prone as one may end up invariably mis-handling
memory references "here and there", which can be disasterous with precise GC
and refcounting);
...

actually, in practice "external integration" has been the main killer of
precise GC's in my case, where typically one will make a language or
interpreter with the intent of either operating it from C-land, or using it
to interface with or control stuff in C-land.

however, the choice in GC strategy becomes a major problem, as it
effectively creates a fragile wall with the outside world, and the effort
required both to maintain this wall, and to allow interfacing through this
wall, can often become terribly awkward.


so, more often, I have ended up falling back to conservative GC, if
anything, because it makes interfacing with the outside world less
problematic...

however, ref-counting+conservative GC may be worthwhile, as it may still
allow ref-counting without many of the complexities and costs of precise GC
(AKA: elaborate root management...).
BGB / cr88192
2009-07-13 16:21:46 UTC
Permalink
Post by Hans Aberg
Post by Christoffer Lernö
I'm looking into GC using ref-counting.
<snip>
Post by Hans Aberg
It depends on what resources you want to collect, and what
implementation language you are choosing. If resources that should
take a place at a specific time, like open files that should be closed
in a C++ style destructor, that may force ordinary reference counts,
as it may be hard to detect the point in time the last reference is
used.
agreed.
long ago, resource-management problems of this sort were a real pain in my
case...
Post by Hans Aberg
A quick scan on the first paper above gives the impression that it
uses a mark-scan from the root set. If the implementation language is
C++, the problem is that there is no good way to find this root set
(but perhaps in the next revision); the language does not
cooperate. But if one can find it, any type of GC would do.
not necessarily "any" type of GC.

in particular, for more subtle reasons, I am not "particularly" sure I would
want to see reference-counting mixed with a conservative GC (thinking about
it, it could work though, so long as one keeps the ref-counts correct and is
careful when mixing ref-counted with non-refcounted memory...).

the problem then is with precise GC (and especially ref-counting) is that it
is a pain to use without explicit compiler support (such as in C). to make
it safe (and especially thread-safe), one ends up having to wrap nearly
everything in macros, manage roots, ... and it gets very tedious...

eliminating the precise GC aspect would at least make it less painful, as it
would eliminate the need to manage roots, but would still leave a whole lot
needing to be done via macros (such as assignments, ...).

if added to a GC, it would probably make sense to add it as an optional (and
explicit) feature (such as, when allocating an object, it is declared as
having a ref-count). it would also be necessary to be very careful where
ref-counted objects are used (such as, not freely putting them in memory
which could be destroyed while failing to update the ref-counts).

non-ref-counted objects would simply ignore ref-counting operations (they
would either be outside the refcount system, or initially created with a
count of "many").


then again, the GC could have a default conservative "destroy handler":
if destroying an object, if it looks like a reference, assume it is a
reference and update the count.

the problem here though, is that with a conservative GC there may be "false
positives", or, in this case, adjusting the ref-counts for objects which
were only being referenced by accident.

it is actually safer to not bother, such that if an object is destroyed and
the counts get out of sync, at least greater values are better than lesser
ones (as then, the object is left for the GC, and not freed prematurely).


then is conservative GC as is usual for conservative GC:
look for pointers in the data and bss sections;
look for pointers in the stack(s);
...


I could do this in a later revision of my existing GC, where I would drop
the thus far unused "precise" mode, allowing me to reuse the all important
bits for conservative refcounting.

alternatively, I could leave "precise mode" intact, but have the relevant
bits be located in the object header (vs the cell-bitmap, which only has 8
bits in which to manage the entire allocator and GC state, and which also
contains the ref-count for precise objects, where the same field is used
both for a ref-count and to identify the basic GC type).

putting the info in the obj-header would avoid me having to mess with the
core allocator machinery, and I currently have a "not very important" 16-bit
field, with which to mooch off a few bits for a refcount (while still
leaving the object marked as "conservative" in the bitmap). (note that the
object header is 8 bytes, with 32-bits as the size, and effectively limiting
objects to 1GB, where I may later use the top-2 bits to flag different
header layouts, such as possibly a header with a 64-bit size, and another
mini-header for small objects).

or such...
Post by Hans Aberg
The approaches might be combined: reference counts for the
destructors, and another GC to collect the memory resources (with
finalizers).
agreed, a good summary...
Torben Ægidius Mogensen
2009-07-14 07:45:56 UTC
Permalink
Post by Christoffer Lernö
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
If you want to implement a fast GC, you should not use refcounting. The
main problem with refcounting is that an assignment to a pointer
variable (which would otherwise just be a register-to-register transfer)
requires two memory reads, two memory writes and a conditional jump, so
it stresses the memory system far more than most other GC algorithms.

Refcounts should, IMO, only be used in the following circumstances:

- Operations on heap pointers are very rare, so even a slow GC doesn't
matter in the overall picture.

- You need hard real-time guarantees.

- The heap is shared between processes/processors with no central
control.

So refcounts are fine for such things as shared libraries (you don't
very often add or delete references to these), but for a functional or
OO language where you have many small heap structures with short
lifetimes, it will give massive overhead.

Torben
Quinn Tyler Jackson
2009-07-14 14:49:54 UTC
Permalink
Post by Torben Ægidius Mogensen
- Operations on heap pointers are very rare, so even a slow GC doesn't
matter in the overall picture.
- You need hard real-time guarantees.
- The heap is shared between processes/processors with no central
control.
So refcounts are fine for such things as shared libraries (you don't
very often add or delete references to these), but for a functional or
OO language where you have many small heap structures with short
lifetimes, it will give massive overhead.
Reference counting is used extensively in the Meta-S parsing engine,
but I don't really think of it as a "garbage collection" mechanism in
the contexts it is used. It gives huge performance gains (not losses)
in those contexts, even though there are many references (through
pointers).

The copy-on-write semantics of RC, for instance, are an ideal
complement to parse memoization, since memoization of a parse requires
the repeated storing and fetching of intermediate parse trees, and RC
effectively reduces this operation to wrap-and-count rather than a
full, deep copy of what can be thousands upon thousands of small
objects. In conjunction with segmented memory allocation, memory
fragmentation over long runs where many small objects are created and
destroyed is also reduced considerably. The wrappers, being fixed in
size even when their underlying data may be of varying length (such as
in the case of string objects), are very effectively managed by
segmented memory allocation, in pools where all segments are of
identical length.

Which to say, objects that are pretty much state-constant after
construction (such as parse tree nodes) but are often copied and often
referenced are good candidates for RC, I have found. Reference
counting used in this way is effectively a form of implicit caching
rather than garbage collection.

I have also found that RC also greatly reduces memory complexity of
large grammars. For instance, huge tries (think hundreds of thousands
of entries), when wrapped in RC, no matter where referenced, can
behave as if owned by the code that needs them (thereby avoiding
explicit sharing), but are only held in memory once. The complexity of
the client code is reduced because it does not have to manage sharing
in any way and can behave as if it owns the trie (or whatever object
is RC'd), but the memory allocation, being lazy, improves speed and
reduces footprint overall.

Just my experience, others' mileage may vary.

--
Quinn Tyler Jackson
Christoffer Lernö
2009-07-14 15:41:07 UTC
Permalink
Post by Torben Ægidius Mogensen
Post by Christoffer Lernö
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
If you want to implement a fast GC, you should not use refcounting. The
main problem with refcounting is that an assignment to a pointer
variable (which would otherwise just be a register-to-register transfer)
requires two memory reads, two memory writes and a conditional jump, so
it stresses the memory system far more than most other GC algorithms.
Would it not be possible to use escape analysis and deferred ref-
counting to improve that by quite a bit?

I was also thinking about Objective C pre 2.0 with manual ref-
counting. My impression was always that updating the ref-count
happened rather rarely.
If you really only need to change the ref-count is when you are
assigning to the member variables of other objects, then will the cost
really be that great in comparison to other factors? Can you perhaps
also make it faster by deferring the ref-count or loading it into a
register?

I guess I am a bit fascinated by the real-time guarantees (more or
less) offered by a ref-count based GC.
GPS
2009-07-14 18:31:41 UTC
Permalink
Post by Torben Ægidius Mogensen
Post by Christoffer Lernö
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
If you want to implement a fast GC, you should not use refcounting. The
main problem with refcounting is that an assignment to a pointer
variable (which would otherwise just be a register-to-register transfer)
requires two memory reads, two memory writes and a conditional jump, so
it stresses the memory system far more than most other GC algorithms.
- Operations on heap pointers are very rare, so even a slow GC doesn't
matter in the overall picture.
- You need hard real-time guarantees.
- The heap is shared between processes/processors with no central
control.
So refcounts are fine for such things as shared libraries (you don't
very often add or delete references to these), but for a functional or
OO language where you have many small heap structures with short
lifetimes, it will give massive overhead.
I think that there may be some other aspects of this problem to consider, or
expand on, to give a more complete description.

Reference counting is often more efficient in general than other GC,
especially with modern computer systems, and this statement is dependent on
a variety of factors.

If a GC results in page transfers to/from the disk, then it will drastically
slow down the program, and system.

If a GC follows the roots, any pages that are stored in virtual memory on a
disk will be transferred back into RAM, so as the program progresses it may
experience significant pauses from transfers at unexpected times.

Reference counting does not require scanning all roots. It is a fine-
grained approach to the problem, but as many have noted, it's much more
costly and difficult in general to handle circular references, and things of
that nature.

Reference counting (RC) in and of itself does not eliminate leaks. RC may
require testing constructors and destructors in a loop to verify that leaks
do not occur in some situations.

Reference counting usually requires more work from the programmer. In my
own work, I have found a shadow of doubt forms about all of my usage of
reference counting, and thus I do not feel confident about code, until it
has been tested, and even then, I have doubts.

Reference counting can require more memory than other types of GC, because a
container is usually used that contains a refcount, and a union of some
primitive storage for type(s). This means that accessing data requires more
indirection, and allocation of the container is required. One alternative
for some situations is to pass register-sized quantities directly, and rely
on a container or object structure for more complex types.

With a mark-and-sweep GC you can use a tag bit, with some extra indirection,
and thus may avoid the need for a container.

Without reference counting it is not uncommon for another type of GC to
leak. Most operating systems use a C API, and when using threads, and the
native API of the system, there can be races at times that are unexpected,
or unexpected false references. Premature reuse of memory is often just as
bad as a leak, and they both occur with most systems I know of that have
used a form of GC.

Time is a difficult aspect for all allocators to clearly define and solve.
Every allocator that exists today has a weakness, whether it is a disk
allocator, or a general memory allocator. The optimal patterns for reducing
fragmentation, metadata overhead, and having the optimal speed, are not set
in stone. It seems that dynamic algorithms that adapt to the situation
might be a solution, but even those have a cost, and thus we do what works
best for now, with our current understanding.

As some Smalltalk developers IIRC have said (paraphrase) "operating systems
are for people that don't know how to design a programming language." That
might fit a LISP developer too.

It's generally much easier to build a system from the ground up with GC,
than to adapt another system to work with it. On the other hand though,
device drivers, and the other constraints of modern systems impose their
will upon us.

BTW Knuth's TAOCP gives a reference to a paper about a solution for a Scheme
implementation that uses reference counting, and handles circular patterns.

-GPS
Hans-Peter Diettrich
2009-07-15 01:13:33 UTC
Permalink
Post by GPS
If a GC results in page transfers to/from the disk, then it will drastically
slow down the program, and system.
Good point!
Post by GPS
Reference counting usually requires more work from the programmer. In my
own work, I have found a shadow of doubt forms about all of my usage of
reference counting, and thus I do not feel confident about code, until it
has been tested, and even then, I have doubts.
Since years I'm happy with the Delphi RC model, including copy-on-write.
There may exist very rare cases for manual intervention, though.
Post by GPS
Reference counting can require more memory than other types of GC, because a
container is usually used that contains a refcount, and a union of some
primitive storage for type(s). This means that accessing data requires more
indirection, and allocation of the container is required.
I see no need for additional containers. Collectable elements already
are heap objects, with management information about allocated size and
links to other heap elements. The RC'ed elements can hold all additional
information in an extended version of that heap management structure,
possibly hidden from the user.

IMO the only required overhead are layout descriptors for all data
structures (including subroutine stack layout), which contain references
to ref-counted elements. Not really different from other GC models.
Post by GPS
One alternative
for some situations is to pass register-sized quantities directly, and rely
on a container or object structure for more complex types.
Can you give a practical example? When such "objects" are passed along
*by value*, I see no need for counting references in such context.
Post by GPS
With a mark-and-sweep GC you can use a tag bit, with some extra indirection,
and thus may avoid the need for a container.
As mentioned above, how would such a GC handle gaps resulting from freed
memory blocks, and how would it determine the type (layout) of all the
heap elements?

IMO such overhead is independent from a specific GC model. In real life
it won't make a remarkable difference in the memory footprint, whether a
GC model adds a single mark bit to every heap element, or a whole count
word.

DoDi
Hans Aberg
2009-07-17 11:09:40 UTC
Permalink
Post by Hans-Peter Diettrich
Post by GPS
If a GC results in page transfers to/from the disk, then it will drastically
slow down the program, and system.
Good point!
I felt so to.

Does this not suggest that at least a tracing GC should be moved into
the OS? It should have some way to mark allocated swapped out memory
unused without swapping back in.

Hans
[Seems to me that it would be easy enough to provide system calls that allow
adequate control from user mode. VM/3370 had a user interface to the pager in
about 1970 to let an OS running in a virtual machine avoid double paging.
-John]
glen herrmannsfeldt
2009-07-17 19:32:19 UTC
Permalink
Hans Aberg <***@math.su.se> wrote:
(snip)

< Does this not suggest that at least a tracing GC should be moved into
< the OS? It should have some way to mark allocated swapped out memory
< unused without swapping back in.

I have watched Win2K do that. When you exit a program using a lot
of memory, it has to page it back while deallocating the memory.

< [Seems to me that it would be easy enough to provide system calls
< that allow <dequa tecontrol from user mode. VM/370 had a user
< interface to the pager in about 1970 to let an OS running in a
< virtual machine avoid double paging. <John]

I always thought of it more as a modification to the guest OS,
but I suppose it does require host support.

You can avoid double paging by giving the full 16M to the guest.
The problem, then, is that a multitasking guest has to wait while
any task is paging. As I understand it, control is returned
to the guest even though some pages are not yet available.
Other tasks can then be run if the needed pages are available.
[Yes, it presented virtual interrupts to the guest OS to say that
a page was unavailable, and later to say it had become available. It
was up to the guest to avoid touching unavailable pages. -John]


There is an interesting and maybe related feature of many current
systems. Many now do not actually allocate pages when requested, but
wait until the allocated memory is modified. All page table entries
for newly allocated memory point to a single page filled with zeros,
with the write protect bit active. Any write to the page allocates a
real page and updates the page tables as needed.

As I understand it, the problem with this method comes when the system
is actually short on memory, but there is no way to indicate that to
the program. I tried this one earlier, after reading a post in
another newsgroup related to pointers (but not to virtual storage).

#include <stdio.h>
#include <stdlib.h>
#define X 64
int main() {
int i;
while(malloc(X*1024*1024-8)) i++;
printf("%dM allocated\n",i*X);
}

You may find that it is given more virtual storage than available
swap space, and it might be that malloc() never returns (void*)0.

-- glen
[This is getting a bit farafield from compilers. Some years ago I spun
off a separate mailing list for GC discussions, which isn't very active
but has over 700 subscribers.

To subscribe: send a message containing subscribe to gclist-***@lists.iecc.com.
-John]
Hans-Peter Diettrich
2009-07-18 01:49:59 UTC
Permalink
Post by glen herrmannsfeldt
There is an interesting and maybe related feature of many current
systems. Many now do not actually allocate pages when requested, but
wait until the allocated memory is modified.
That reminds me on sparse files, which also are stored only to their
really used extent.
Post by glen herrmannsfeldt
As I understand it, the problem with this method comes when the system
is actually short on memory, but there is no way to indicate that to
the program.
What is "short on memory"?

Short of *physical* memory is the usual state of general purpose (not
realtime) multi-tasking systems, reached when the memory requirements of
the system itself, and of all started processes, together exceed the
physical memory capacity. But that's not a critical state with paged
RAM, because only a fraction of the total virtual memory has to reside
in physical memory, at any given time.

Every process can run with a handful of RAM pages, containing the
currently executing instruction and the related data. The availability
of more pages depends on the overall load and scheduling strategies of
the system.

In such an environment I don't see a need for any indication of a
temporary shortage of resources, in detail because the processes have
almost no chance to react in a timely and meaningful way.

IMO the only situation, where some information about the machine state
were really useful, is during the determination of the "best" size of
buffers (caches), which shall reduce the amount of system I/O calls.
When these buffers are oversized, the reduced rate of system calls can
be overcompensated by an increased swapping rate of the system. Then
every process could resize its buffers, according to the current
*performance* of the overall system. But what's a *useful* formula for
the determination of the actually "best" buffer size? Except for
extremely oversized buffers, near the free virtual address space of an
process, I doubt that the system or a process can *measure* the runtime
impact of a reduction or enlargement of some buffer.

Even when a process implements an pool of buffers, which can be flushed
when a performance degradation suggests to do so, the effect IMO would
not be noticeable. Either the cached information is required in actual
processing, then it doesn't matter whether it's read back into the RAM
programmatically or by system swapping code. Or the currently not
required information will be swapped out of RAM, as soon as the system
feels a need for doing so.

Did I miss something?
Post by glen herrmannsfeldt
[This is getting a bit farafield from compilers. Some years ago I spun
off a separate mailing list for GC discussions, which isn't very active
but has over 700 subscribers.
-John]
You're right, memory de/allocation strategies more are an OS than a
compiler or language topic. I only wanted to add some possibly more
important considerations about memory usage, apart from supporting the
lazyness of coders by offering them a GC system ;-)

DoDi
glen herrmannsfeldt
2009-07-18 17:30:54 UTC
Permalink
Hans-Peter Diettrich <***@aol.com> wrote:
(after I wrote)

<> There is an interesting and maybe related feature of many current
<> systems. Many now do not actually allocate pages when requested, but
<> wait until the allocated memory is modified.
(snip)

< What is "short on memory"?

< Short of *physical* memory is the usual state of general purpose

No, they overallocate virtual memory.

I ran the program I posted, and finally killed it when it was
over 2TB virutal. (and 64K real). I don't have anywhere near
2TB of swap space on that machine.

I believe this is true for any somewhat recent Linux version.

I am not sure if run-time library issues should have their own
newsgroup. Memory allocation details are reasonably part of
a run time library discussion, but maybe not compilers.

Anyone working with compilers on current systems, though, should
at least know about this problem.

-- glen
Torben Ægidius Mogensen
2009-07-15 07:52:26 UTC
Permalink
Post by GPS
Post by Torben Ægidius Mogensen
Post by Christoffer Lernö
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
If you want to implement a fast GC, you should not use refcounting.
Reference counting is often more efficient in general than other GC,
especially with modern computer systems, and this statement is
dependent on a variety of factors.
If a GC results in page transfers to/from the disk, then it will
drastically slow down the program, and system.
If a GC follows the roots, any pages that are stored in virtual
memory on a disk will be transferred back into RAM, so as the
program progresses it may experience significant pauses from
transfers at unexpected times.
Reference counting does not require scanning all roots.
The root set is typically fairly small compared to the heap, at least
in languages that use heap allocation a lot (such as OO languages and
functional languages), so having to scan the full root set is not a
major overhead. And with generational GC you can often do minor
collections while staying inside the cache.
Post by GPS
Reference counting (RC) in and of itself does not eliminate leaks.
Nor does any other automatic reclamation method -- liveness is
undecidable, so any automatic method must be conservative and leave
dead stuff around.
Post by GPS
Reference counting can require more memory than other types of GC,
because a container is usually used that contains a refcount, and a
union of some primitive storage for type(s).
While the counter is extra space, I don't think it is significant.
One machine word is sufficient, so it will only be significant for
very small structures. But the fact that objects stay in place after
allocartion and are freed to a free-list means that you either need
uniformly sized objects (like in LISP) or will get fragmentation.
This problem is shared with all non-compacting GCs.
Post by GPS
Without reference counting it is not uncommon for another type of GC to
leak. Most operating systems use a C API, and when using threads, and the
native API of the system, there can be races at times that are unexpected,
or unexpected false references. Premature reuse of memory is often just as
bad as a leak, and they both occur with most systems I know of that have
used a form of GC.
Automatic GC should prevent premature reuse, otherwise it is buggy.

Torben
BGB / cr88192
2009-07-15 18:22:21 UTC
Permalink
Post by Torben Ægidius Mogensen
Post by Christoffer Lernö
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
If you want to implement a fast GC, you should not use refcounting. The
main problem with refcounting is that an assignment to a pointer
variable (which would otherwise just be a register-to-register transfer)
requires two memory reads, two memory writes and a conditional jump, so
it stresses the memory system far more than most other GC algorithms.
I have some experience in compiling for this cases, and there are
special-case optimizations by which large numbers of reference-count
adjustments can be eliminated.

similarly, most code is not likely performance-bound by how quickly it can
assign pointers...
Post by Torben Ægidius Mogensen
- Operations on heap pointers are very rare, so even a slow GC doesn't
matter in the overall picture.
- You need hard real-time guarantees.
- The heap is shared between processes/processors with no central
control.
So refcounts are fine for such things as shared libraries (you don't
very often add or delete references to these), but for a functional or
OO language where you have many small heap structures with short
lifetimes, it will give massive overhead.
you mean vs filling the heap with garbage and waiting for a GC cycle?...

IME, GC+refcounts, in practice, gives far better net performance than GC by
itself (mostly as it mostly eliminates a much bigger cost: the whole GC
process), and has the advantage of avoiding awkward and annoying pauses
typically caused by a GC (these are not just for "hard real-time"
requirements, as these delays can be very noticable and annoying even in
soft real-time apps, where the GC may run and momentarily effect the "flow"
of the whole system...).


the main cost of ref-counting is actually that of writing code to use it
(absent explicit compiler support), and not any such conceptual performance
cost...
George Neuner
2009-07-14 18:27:20 UTC
Permalink
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
I'm looking into GC using ref-counting.
Does anyone know what passes for state-of-the-art when it comes to ref-
counting algorithms?
You've already found it.
Post by Christoffer Lernö
I've read papers by Lins 2003 "Lazy Cyclic Reference Counting",
Bacon / Rajan 2001 "Concurrent Cycle Collection in Reference Counted
Systems" and a few others.
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
Reference counting is really primitive GC technology and you're better
off not starting down that path unless your intent is to provide a
library module that can't make assumptions about its environment.

I haven't read Bacon's paper, but you'll note that Lins uses a marking
scan as a backup to identify and break data cycles for the reference
counter. Although the marking scan need not be run unless memory is
tight, it significantly complicates the overall implementation.

In fact, the marking scanner is nearly half the implementation of a
more capable mark-sweep or mark-compact collector ... neither of which
will have the reference counter's problems with circular data
structures.

If you're dead set on reference counting, you should investigate 1-bit
reference counting. I don't have URL's to give you but they should be
easy to find. The programming language is designed so that most data
are not shared and then you only need a simple flag to say whether the
item is shared and so needs special handling.

George
Christoffer Lernö
2009-07-15 07:05:12 UTC
Permalink
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
I've read papers by Lins 2003 "Lazy Cyclic Reference Counting",
Bacon / Rajan 2001 "Concurrent Cycle Collection in Reference Counted
Systems" and a few others.
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
Reference counting is really primitive GC technology and you're better
off not starting down that path unless your intent is to provide a
library module that can't make assumptions about its environment.
I haven't read Bacon's paper, but you'll note that Lins uses a marking
scan as a backup to identify and break data cycles for the reference
counter. Although the marking scan need not be run unless memory is
tight, it significantly complicates the overall implementation.
I implemented the simple non-concurrent version of the algorithm as
described in Bacon's paper and I did not find it all that complex, the
concurrent version is naturally more involved though.
Post by George Neuner
In fact, the marking scanner is nearly half the implementation of a
more capable mark-sweep or mark-compact collector ... neither of which
will have the reference counter's problems with circular data
structures.
Yes, but what I find interesting is that the scanning in the case of
ref-counting is on a small subset of all objects. For more advanced
tracing collectors, this is also the case. There is a nice paper on
the duality between ref-counting and tracing collectors by Bacon,
Cheng and Rajan.

Bacon also claims that their collector got really good results when it
comes to GC cost, quite competitive with tracing collectors.
Post by George Neuner
If you're dead set on reference counting, you should investigate 1-bit
reference counting. I don't have URL's to give you but they should be
easy to find. The programming language is designed so that most data
are not shared and then you only need a simple flag to say whether the
item is shared and so needs special handling.
Yes, I think I remember reading a bit on that last year.

I guess I simply have to implement a few different collectors (tracing
and ref-counted) and see which type I like best.

/C
George Neuner
2009-07-16 16:35:26 UTC
Permalink
On Wed, 15 Jul 2009 00:05:12 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
I haven't read Bacon's paper, but you'll note that Lins uses a marking
scan as a backup to identify and break data cycles for the reference
counter. Although the marking scan need not be run unless memory is
tight, it significantly complicates the overall implementation.
I implemented the simple non-concurrent version of the algorithm as
described in Bacon's paper and I did not find it all that complex, the
concurrent version is naturally more involved though.
Marking is certainly not difficult - not even when done concurrently,
but tuning it for good performance is tricky. In particular, how you
handle the queue - as a stack or as a FIFO - and how you handle out of
space conditions on the queue are critical.
Post by Christoffer Lernö
... what I find interesting is that the scanning in the case of
ref-counting is on a small subset of all objects. For more advanced
tracing collectors, this is also the case. There is a nice paper on
the duality between ref-counting and tracing collectors by Bacon,
Cheng and Rajan.
You can only safely restrict scanning if you can segregate objects
that are shared (or could be). Most programming languages don't allow
for that distinction up front, so the compiler or runtime environment
has to handle it behind the scenes.

I previously mentioned 1-bit counting which is one method of handling
the segregated sharing issue. Copying a pointer requires a check of
the object to see whether it is unique or shared. You can freely copy
a pointer to a shared object, but if the object is not shared you
first move it to a separate "shared object" heap, mark it shared
(obviously), and save the new address to both the source and
destination of the pointer copy. Then you can free the original
unshared object.
Post by Christoffer Lernö
Bacon also claims that their collector got really good results when it
comes to GC cost, quite competitive with tracing collectors.
I have no doubt in Bacon's claims for the applications and conditions
under which he tested.

All collectors have their own "sweet spot" of environment conditions
under which they work very well ... reference counters are no
exception.

For general purpose programming, most studies have shown that tracing
or copying collectors have lower average performance impact on the
mutator(s). Reference counters are known to work better in
constrained memory situations and are pretty much the only effective
way to implement distributed collectors where it's impossible to use
pointers due to disjoint memories (though better ways are being
actively researched).
Post by Christoffer Lernö
I guess I simply have to implement a few different collectors (tracing
and ref-counted) and see which type I like best.
If you're after very high performance (a hobby of mine due to a stint
in real time programming), you should look into Baker's treadmill
collector. Treadmills are expensive space-wise, but they have fixed
processing costs and worst case behavior that is completely
predictable (a critical requirement for RT).

Every GC design has weak spots and most have worst case performance
that is extremely bad (treadmills are an exception to bad worst case
performance, but their weak spot is fixed size allocation blocks).

The trick is to anticipate the use - which is all but impossible if
the GC is intended to support general programming - and design for
either the best average case or the worst possible case. Most GC
systems today are hybrids designed for the best average case.

George
Christoffer Lernö
2009-07-17 18:14:27 UTC
Permalink
Post by George Neuner
On Wed, 15 Jul 2009 00:05:12 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
I haven't read Bacon's paper, but you'll note that Lins uses a marking
scan as a backup to identify and break data cycles for the reference
counter. Although the marking scan need not be run unless memory is
tight, it significantly complicates the overall implementation.
I implemented the simple non-concurrent version of the algorithm as
described in Bacon's paper and I did not find it all that complex, the
concurrent version is naturally more involved though.
Marking is certainly not difficult - not even when done concurrently,
but tuning it for good performance is tricky. In particular, how you
handle the queue - as a stack or as a FIFO - and how you handle out of
space conditions on the queue are critical.
My implementation used a bounded array that would GC all roots
automatically when the array is full. I noticed that in my exprimental
microbenchmark I got the best results when having around 100 entries
in the array - increasing or decreasing this number would increase the
GC-time. (My benchmark would create groups of 4 objects with cyclic
dependency and let the cycle collection run automatically). I suspect
that the optimal number will differ quite a bit depending on the type
of allocation that is done.
Post by George Neuner
Post by Christoffer Lernö
... what I find interesting is that the scanning in the case of
ref-counting is on a small subset of all objects. For more advanced
tracing collectors, this is also the case. There is a nice paper on
the duality between ref-counting and tracing collectors by Bacon,
Cheng and Rajan.
You can only safely restrict scanning if you can segregate objects
that are shared (or could be). Most programming languages don't allow
for that distinction up front, so the compiler or runtime environment
has to handle it behind the scenes.
You are talking about restricting scanning for a tracing GC?
Post by George Neuner
Post by Christoffer Lernö
Bacon also claims that their collector got really good results when it
comes to GC cost, quite competitive with tracing collectors.
I have no doubt in Bacon's claims for the applications and conditions
under which he tested.
Most comes from the implementation of the Jalapeno java VM. From what
I can gather there was quite a lot different GCs implemented and
benchmarked on the VM.

http://www.research.ibm.com/people/d/dfb/papers.html
Post by George Neuner
Post by Christoffer Lernö
I guess I simply have to implement a few different collectors (tracing
and ref-counted) and see which type I like best.
If you're after very high performance (a hobby of mine due to a stint
in real time programming), you should look into Baker's treadmill
collector. Treadmills are expensive space-wise, but they have fixed
processing costs and worst case behavior that is completely
predictable (a critical requirement for RT).
I saw that Treadmills was supposedly covered in Lins book mentioned
earlier (which I ordered), I'll do some googling to find out more too.
But if you have any direct links to papers I'd be happy if you would
share them.
Post by George Neuner
Every GC design has weak spots and most have worst case performance
that is extremely bad (treadmills are an exception to bad worst case
performance, but their weak spot is fixed size allocation blocks).
The trick is to anticipate the use - which is all but impossible if
the GC is intended to support general programming - and design for
either the best average case or the worst possible case. Most GC
systems today are hybrids designed for the best average case.
My problem with tracing GCs is that once the heap grows in excess of a
few hundred MB, most exhibit noticeable pauses. I want a GC that is
suitable for game programming, and I'd much rather have a steady lower
throughput that is predicable, than one that occasionally pauses but
with better average performance.

For a game, you'd want 30 or 60 FPS, meaning that the total time
between frames are ~33 ms or ~17 ms respectively.

It's extremely more preferrable that the GC steals 5ms every frame
than 100ms every 600 frames.

(And that would be a rather well-behaving GC - I've experienced java
applications freezing tens of seconds on my new Macbook Pro when
collecting around 100 Mb out of 250Mb)

So anything that has a constant cost is much preferrable to one that
cause you to pay chunks of time for GC.
George Neuner
2009-07-18 02:33:54 UTC
Permalink
On Fri, 17 Jul 2009 11:14:27 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Post by George Neuner
On Wed, 15 Jul 2009 00:05:12 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
... what I find interesting is that the scanning in the case of
ref-counting is on a small subset of all objects. For more advanced
tracing collectors, this is also the case. There is a nice paper on
the duality between ref-counting and tracing collectors by Bacon,
Cheng and Rajan.
You can only safely restrict scanning if you can segregate objects
that are shared (or could be). Most programming languages don't allow
for that distinction up front, so the compiler or runtime environment
has to handle it behind the scenes.
You are talking about restricting scanning for a tracing GC?
No, I'm talking about segregating shared objects in a ref-counter GC
so that only the shared object heap needs to be scanned for breaking
cycles.
Post by Christoffer Lernö
Post by George Neuner
Post by Christoffer Lernö
Bacon also claims that their collector got really good results when it
comes to GC cost, quite competitive with tracing collectors.
I have no doubt in Bacon's claims for the applications and conditions
under which he tested.
Most comes from the implementation of the Jalapeno java VM. From what
I can gather there was quite a lot different GCs implemented and
benchmarked on the VM.
http://www.research.ibm.com/people/d/dfb/papers.html
Post by George Neuner
Post by Christoffer Lernö
I guess I simply have to implement a few different collectors (tracing
and ref-counted) and see which type I like best.
If you're after very high performance (a hobby of mine due to a stint
in real time programming), you should look into Baker's treadmill
collector. Treadmills are expensive space-wise, but they have fixed
processing costs and worst case behavior that is completely
predictable (a critical requirement for RT).
I saw that Treadmills was supposedly covered in Lins book mentioned
earlier (which I ordered), I'll do some googling to find out more too.
But if you have any direct links to papers I'd be happy if you would
share them.
Baker's classic paper is at
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.5878

Be aware that it describes only the basic structures and operation ...
it does not offer hints as to how to adapt the system for any
particular purpose. [The same can be said for most GC papers. Most
successful GC implementations are hybrids.]

Lins book is the bible of GC. Don't be concerned by the old
publication date (early 90's IIRC) - there are no truly "new" ideas in
GC, only new combinations of basic ideas that are all covered in that
book.
Post by Christoffer Lernö
Post by George Neuner
Every GC design has weak spots and most have worst case performance
that is extremely bad (treadmills are an exception to bad worst case
performance, but their weak spot is fixed size allocation blocks).
The trick is to anticipate the use - which is all but impossible if
the GC is intended to support general programming - and design for
either the best average case or the worst possible case. Most GC
systems today are hybrids designed for the best average case.
My problem with tracing GCs is that once the heap grows in excess of a
few hundred MB, most exhibit noticeable pauses. I want a GC that is
suitable for game programming, and I'd much rather have a steady lower
throughput that is predicable, than one that occasionally pauses but
with better average performance.
That's not a failing of tracing GC, it's a failing of the particular
implementation. There are lazy tracers that have near zero impact on
the mutator - the downside is that they need a (helluva) lot of extra
memory to make sure the mutator doesn't starve.
see http://portal.acm.org/citation.cfm?id=286878
[don't know if this one is available elsewhere - I can send it to you
if need be.]

Generational GC solves a lot of the pause complaints because most
collection effort is concentrated on the nursery area which is
typically 1 to a few megabytes or on the first generation which is
typically 10 to 20 megabytes. Older generations are collected less
and less frequently.
Post by Christoffer Lernö
For a game, you'd want 30 or 60 FPS, meaning that the total time
between frames are ~33 ms or ~17 ms respectively.
It's extremely more preferrable that the GC steals 5ms every frame
than 100ms every 600 frames.
(And that would be a rather well-behaving GC - I've experienced java
applications freezing tens of seconds on my new Macbook Pro when
collecting around 100 Mb out of 250Mb)
Not that this will solve your problem, but the server version of the
JVM has much better GC behavior.
Post by Christoffer Lernö
So anything that has a constant cost is much preferrable to one that
cause you to pay chunks of time for GC.
Reference counting in general does _NOT_ have constant costs ... RC's
costs depend greatly on whether linked structures are eagerly or
lazily freed. Sweeping a block for pointers and freeing the
associated blocks depends on the size of the allocation - you can
bound concurrent lazy sweeping, but you can't reuse a block until it
has been completely swept so if you allow variably sized allocations
there is always lurking a pathological case where a large array of
pointers needs to be swept before reuse.

Treadmills, because they deal only with fixed sized blocks, have
constant operation costs and predictable behavior. They are suitable
for use in real time programming if the worst case operation meets the
system's timing constraints. Because they are non-moving collectors,
tracing can be done in parallel with the mutator. Treadmills only
need interrupt the mutator during sweeping which can be done lazily
and involves only strictly bounded, constant cost operations.

George
Christoffer Lernö
2009-07-18 22:06:26 UTC
Permalink
Post by George Neuner
On Fri, 17 Jul 2009 11:14:27 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Post by George Neuner
You can only safely restrict scanning if you can segregate objects
that are shared (or could be). Most programming languages don't allow
for that distinction up front, so the compiler or runtime environment
has to handle it behind the scenes.
You are talking about restricting scanning for a tracing GC?
No, I'm talking about segregating shared objects in a ref-counter GC
so that only the shared object heap needs to be scanned for breaking
cycles.
So the idea is to let the ref counting work for the non-shared and
then the shared set (which should be significantly smaller) is handled
by a tracing GC?
Post by George Neuner
Post by Christoffer Lernö
Post by George Neuner
If you're after very high performance (a hobby of mine due to a stint
in real time programming), you should look into Baker's treadmill
collector. Treadmills are expensive space-wise, but they have fixed
processing costs and worst case behavior that is completely
predictable (a critical requirement for RT).
I saw that Treadmills was supposedly covered in Lins book mentioned
earlier (which I ordered), I'll do some googling to find out more too.
But if you have any direct links to papers I'd be happy if you would
share them.
Baker's classic paper is at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.5878
Since this is a tracing GC, will not a pure version of a Treadmill
take longer to complete a round depending on how large the heap is?

I admit to still having a rather vague understanding of various
tracing GCs, but don't the vanilla tracing GCs necessarily work slower
the larger the heap is?

Especially if we have a large object tree that gets scanned repeadedly
but never needs to be collected.

My understanding is that the idea with generational GCs is to move
those structures to scan them much less often than majority of objects
that are very short-lived.

On the other hand, perhaps your suggestion is exactly to combine use 1
bit refcounts separate objects, and then slowly use a treadmill GC on
the shared objects.

Still, the fact that large, unchanging objects needs to be scanned is
a bit annoying as it increases costs the larger the heap is.

The cyclic tracer in Bacon's paper has the same problem as long as
references to these objects are added and removed:

A typical problem when this is an issue is there you have a Server
object that has references to most other parts of the system. This
server object creates connection objects that are added when new
connections are made, and removed when they disconnect.

If the connection objects keep a reference to the Server object, we'd
need to scan the whole server object and references - which might be
almost the entire heap! - every time we want to destroy a circular
reference to a connection.
Post by George Neuner
Post by Christoffer Lernö
Post by George Neuner
Every GC design has weak spots and most have worst case performance
that is extremely bad (treadmills are an exception to bad worst case
performance, but their weak spot is fixed size allocation blocks).
The trick is to anticipate the use - which is all but impossible if
the GC is intended to support general programming - and design for
either the best average case or the worst possible case. Most GC
systems today are hybrids designed for the best average case.
My problem with tracing GCs is that once the heap grows in excess of a
few hundred MB, most exhibit noticeable pauses. I want a GC that is
suitable for game programming, and I'd much rather have a steady lower
throughput that is predicable, than one that occasionally pauses but
with better average performance.
That's not a failing of tracing GC, it's a failing of the particular
implementation. There are lazy tracers that have near zero impact on
the mutator - the downside is that they need a (helluva) lot of extra
memory to make sure the mutator doesn't starve.
But that's also a problem. Obviously any GC would need to be able to
keep up with the garbage it needs to collect.
Post by George Neuner
Generational GC solves a lot of the pause complaints because most
collection effort is concentrated on the nursery area which is
typically 1 to a few megabytes or on the first generation which is
typically 10 to 20 megabytes. Older generations are collected less
and less frequently.
Isn't there the risk that unless you want a pause when the older
generation really IS collected, you'd need that the older generation
collector to not stop the world when it's run.

I mean, the problem with the IDE with the long pauses was basically
that it seemed to wait to GC until every 30 minutes or so, but
compensate by collecting for a much longer period when it finally
invoked the GC.

Although perhaps not that extreme, that's what happens in several of
the java GCs as well. And where things REALLY break down is when a GC
decides to do a trace when it has swapped into virtual memory. I've
seen processes freeze for over 30 minutes trying to do a full GC when
partially swapped into virtual memory.
Post by George Neuner
Post by Christoffer Lernö
For a game, you'd want 30 or 60 FPS, meaning that the total time
between frames are ~33 ms or ~17 ms respectively.
It's extremely more preferrable that the GC steals 5ms every frame
than 100ms every 600 frames.
(And that would be a rather well-behaving GC - I've experienced java
applications freezing tens of seconds on my new Macbook Pro when
collecting around 100 Mb out of 250Mb)
Not that this will solve your problem, but the server version of the
JVM has much better GC behavior.
I've noticed a bit of improvement, but I still have seen the
occasional hiccups of many hundred ms.
Post by George Neuner
Post by Christoffer Lernö
So anything that has a constant cost is much preferrable to one that
cause you to pay chunks of time for GC.
Reference counting in general does _NOT_ have constant costs ... RC's
costs depend greatly on whether linked structures are eagerly or
lazily freed.
You're right of course. But the idea being approximately able to know
when you potentially might be releasing a large amount of objects is
very attractive if you are working on time-critical code.

Let's say you have code like this:

while(1)
{
updateLogic();
updateBuffer();
sleep(timeToFrame());
switchBuffer();
}

Then you want to be pretty damn sure that the GC won't kick in after
sleeping 5ms out of 6 and runs its 10ms GC.

Just suddenly braindumping here, but would it not be worthwhile to be
able to control garbage collection? In java you can sort of force a
GC, but that's pretty much it.

It would be nice to say "start automatic GC scheduling" or "wait with
GC" or "run GC for 10ms starting from now".

Any ideas why such controls are absent from java and several other
GC:ed languages?

/C
George Neuner
2009-07-21 06:13:29 UTC
Permalink
I think John is getting tired of this thread - GC is more a runtime
issue than a compiler one (except maybe for region-based GC which is
explicitly compiler directed).

I'm happy to continue this discussion by email if you want, but I do
suggest that you take the time to read Lins book because it seems like
you have some confusion about how the various forms of GC operate.
[You're also welcome to move it to gclist. See
http://lists.services.net/cgi-bin/mj_wwwusr/domain=lists.iecc.com?func=lists-full-long&extra=gclist
-John]

On Sat, 18 Jul 2009 15:06:26 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
So the idea is to let the ref counting work for the non-shared and
then the shared set (which should be significantly smaller) is handled
by a tracing GC?
That's one way to do it. Another I'll describe in a moment.

You're making a semantic mistake in equating "tracing" with "tracing
GC". Tracing is just a means to identify the set of live data - the
actual work of reclaiming garbage blocks is generically known as
"sweeping" and it can different forms. For most non-moving
collectors, tracing is just the first step in performing a collection.
OTOH, there are copying collectors for which tracing is the entirety
of the collection process. [For anyone about to interject here, I am
considering that Mark-Compact is a copying collector.]


For a ref-counter, the "interesting" set - cycles - is a subset of the
garbage. Garbage normally has no roots and is mainly disjoint ... so
it can't really be traced. So you first have to separate the garbage
by tracing and marking the live data. Then you scan the heap looking
for unmarked (garbage) chains of blocks that are circularly linked.

When a cycle is identified, the scanner breaks it by nulling the
pointer back to the head block and reducing the head block's counter.
At the end of scanning, the newly linear garbage can be freed normally
by the ref-counter. You can't do it immediately upon breaking a cycle
because the same blocks may be members of multiple cycles. The
scanner needs to remember the head blocks of the cycles it finds and
send all the chains to the ref-counter after the heap scan is
complete.
Post by Christoffer Lernö
Baker's classic paper [on Treadmills] is at
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.5878
Since this is a tracing GC, will not a pure version of a Treadmill
take longer to complete a round depending on how large the heap is?
I admit to still having a rather vague understanding of various
tracing GCs, but don't the vanilla tracing GCs necessarily work slower
the larger the heap is?
Tracing in a Mark-Sweep collector touches only live data, so it
depends on utilization rather than total heap size ... it's the
sweeping phase that depends on heap size.

Mark-Compact and Cheney-style copying collectors touch only live data.
Post by Christoffer Lernö
Especially if we have a large object tree that gets scanned repeadedly
but never needs to be collected.
My understanding is that the idea with generational GCs is to move
those structures to scan them much less often than majority of objects
that are very short-lived.
Yes. The heap is subdivided into smaller areas which are handled
separately.
Post by Christoffer Lernö
On the other hand, perhaps your suggestion is exactly to combine use 1
bit refcounts separate objects, and then slowly use a treadmill GC on
the shared objects.
My suggestion was that shared objects be segregated and handled
separately. Any method that will handle cycles can be used to manage
the shared data.

I personally would not choose to use a ref-counter at all in a system
that requires high performance ... there is too much variability in
reclamation regardless of whether it is done aggressively or lazily.

AFAIK, Treadmills are the only non-moving collectors for which, like
Cheney copiers, tracing is the entirety of the collection process. And
they are the _only_ collectors for which time and cycle costs are
predictable.
Post by Christoffer Lernö
Still, the fact that large, unchanging objects needs to be scanned is
a bit annoying as it increases costs the larger the heap is.
Tracing can always be done in parallel with the operation of the
mutator. In a non-moving collector, sweeping can be too. They cost
very little if you have a multi-processor.
Post by Christoffer Lernö
A typical problem when this is an issue is there you have a Server
object that has references to most other parts of the system. This
server object creates connection objects that are added when new
connections are made, and removed when they disconnect.
If the connection objects keep a reference to the Server object, we'd
need to scan the whole server object and references - which might be
almost the entire heap! - every time we want to destroy a circular
reference to a connection.
That's what generational GC is for.
Post by Christoffer Lernö
Obviously any GC would need to be able to keep up with the garbage it
needs to collect.
True ... but this represents another possible misunderstanding about
how GC works. Only ref-counters and Mark-Sweep collectors actually
"collect garbage". Mark-Compact and Cheney-style copying collectors
completely _ignore_ garbage: they are affected only by the amount of
live data and not by heap size or by the rate of garbage accumulation.
Post by Christoffer Lernö
Generational GC solves a lot of the pause complaints because most
collection effort is concentrated on the nursery area ...
... Older generations are collected less and less frequently.
Isn't there the risk that unless you want a pause when the older
generation really IS collected, you'd need that the older generation
collector to not stop the world when it's run.
That depends on how the system is structured and is way too involved
to answer here. The short answer is: you can't design a system that
won't pause ... but you can design a system that won't pause
noticeably for a given definition of "noticeable".
Post by Christoffer Lernö
... the idea being approximately able to know when you potentially
might be releasing a large amount of objects is very attractive if
you are working on time-critical code.
Yes. I have a real-time programming background so I understand this
desire completely. The knee-jerk fix for this is to use stack or
region allocation. Practical region-based GC is still a research
project. See http://www.it-c.dk/research/mlkit/papers.html
Post by Christoffer Lernö
It would be nice to say "start automatic GC scheduling" or "wait with
GC" or "run GC for 10ms starting from now".
Any ideas why such controls are absent from java and several other
GC:ed languages?
Because experience has shown that most programmers will eventually
hang themselves if you give them rope.

Most GC'd languages have as a goal the desire to protect programmers
from themselves. The average programmer today is only a little better
than poor. Language features like GC and monitor-based mutual
exclusion (Java's synchronized objects) are no longer conveniences for
good programmers but necessities without which the average programmers
cannot work effectively.

George
Christoffer Lernö
2009-07-22 09:31:06 UTC
Permalink
Post by George Neuner
I think John is getting tired of this thread - GC is more a runtime
issue than a compiler one (except maybe for region-based GC which is
explicitly compiler directed).
(I'll take the more general discussion to e-mail, but perhaps we can
continue the part that deals with compiler-directed GC)

Speaking about about regions, I stumbled over Cyclone by pure accident
(and by implication also read up on linear types).

Since I'm trying to do an imperative OO (yet another one), I feel its
sort of a fight where that an OO language by its nature is a natural
fit for heap allocation, whereas what one constantly dreams about is
if things could be as easy as stack allocation is, which both has
extremely fast allocation and perfect garbage collection.

Cyclone sort of showed an example on how it is possible to at least
get a little closer while actually giving a bit more to play with than
C.

Even before reading about Cyclone I was playing with the idea of being
able to define something similar, for instance being able to
explicitly select allocation region, where you then could define
multiple heaps and GC them separately. (Perhaps even having ways of
merging them, moving from one to another etc)

That's not quite the same as regions in Cyclone, but perhaps it might
be 'good enough'.

I was thinking about something like this:

Heap newHeap = Heap.allocate( ... arguments ...);

newHeap.push();

... all heap-allocations now default to newHeap ...

newHeap.pop();

... heap allocations now on default heap.

newHeap.collect(); // Collects the heap
Heap.current().merge(newHeap); // Merges the heaps.


Another scheme I was thinking of would be to _always_ initially
(dynamically) allocate on the stack, and then copy to the heap the
objects that still have references when exiting a function.

This would obviously cost a lot if most allocations need to be on the
heap, but if objects are mostly transient, could it be worth it?

One would obviously need to mark the object to know when it has been
assigned to a non-stack object, or a stack object that has been
marked.


This would then be sort of an alternative to escape analysis which I
suppose must be rather conservative. For that reason I suspect it is
even harder to do the proper analysis in a dynamically typed language
as it is not possible during compile time to determine the exact
methods that will be called.

Instead the pessimistic assumption must be that any local variable
that is sent as an argument to a method must be allocated on the heap
(as it might possibly be retained further down the calling chain).

If one instead would use a marking (i.e. 1 bit refcounting) scheme on
the stack allocated objects, one would perhaps get a less conservative
estimation, but obviously paid with the cost of having to copy stack
objects to the heap.


/Christoffer
George Neuner
2009-07-25 06:46:43 UTC
Permalink
On Wed, 22 Jul 2009 02:31:06 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Even before reading about Cyclone I was playing with the idea of being
able to define something similar [to lexically scoped heap regions],
for instance being able to explicitly select allocation region, where
you then could define multiple heaps and GC them separately. (Perhaps
even having ways of merging them, moving from one to another etc)
That's not quite the same as regions in Cyclone, but perhaps it might
be 'good enough'.
Cyclone's compiler does not perform region analysis (per Tofte) to
direct allocation of program objects to particular region heaps. In
Cyclone, regions are named entities and allocation of program objects
from them is under programmer control.

Tofte's regions are hidden runtime entities known only to the compiler
- the decision to allocate an object from a particular region is made
automatically by the compiler using a combination of escape and static
lifetime analysis.

Tofte's regions are lexically scoped, but because the language he
used, ML, provides 1st class functions and closures, his regions have
indefinite extent and behave more like Cyclone's "dynamic" regions
than like its lexical regions.
Post by Christoffer Lernö
Another scheme I was thinking of would be to _always_ initially
(dynamically) allocate on the stack, and then copy to the heap the
objects that still have references when exiting a function.
This would obviously cost a lot if most allocations need to be on the
heap, but if objects are mostly transient, could it be worth it?
It's always worth avoiding such a copy if possible. It would be much
more useful to perform escape analysis on your objects to determine
whether they can be stack allocated.
Post by Christoffer Lernö
This would then be sort of an alternative to escape analysis which I
suppose must be rather conservative. For that reason I suspect it is
even harder to do the proper analysis in a dynamically typed language
as it is not possible during compile time to determine the exact
methods that will be called.
Escape analysis is purely an issue of lexical scoping and whether the
object may outlive the scope in which it is defined. What form of
typing the language uses is not relevant ... you are only looking for
the object to leave the control of the scope chain.

It is not really any harder to do escape analysis for Lisp than it is
for C - it just takes longer because Lisp allows nested functions and
C doesn't.
Post by Christoffer Lernö
Instead the pessimistic assumption must be that any local variable
that is sent as an argument to a method must be allocated on the heap
(as it might possibly be retained further down the calling chain).
No. Technically, only objects which have identity can escape. For
this purpose "identity" means "location" and further means that
potential escapees are limited to actual data structures (closure
creation) or to pointers to local data structures.

However, a pointer to a local data structure is only a problem if it
if is somehow passed upward to an enclosing scope or sideways to a
disjoint scope which preserves it (which may be indeterminate). In
either case you would need to heap allocate the data structure to be
safe. If neither situation applies, the data structure can be stack
allocated.

George
Christoffer Lernö
2009-07-27 18:18:56 UTC
Permalink
Post by George Neuner
On Wed, 22 Jul 2009 02:31:06 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Even before reading about Cyclone I was playing with the idea of being
able to define something similar [to lexically scoped heap regions],
for instance being able to explicitly select allocation region, where
you then could define multiple heaps and GC them separately. (Perhaps
even having ways of merging them, moving from one to another etc)
That's not quite the same as regions in Cyclone, but perhaps it might
be 'good enough'.
Cyclone's compiler does not perform region analysis (per Tofte) to
direct allocation of program objects to particular region heaps. In
Cyclone, regions are named entities and allocation of program objects
from them is under programmer control.
From a performance perspective (I mentioned my wish for predictive GC
performance before), would it make sense to enable programmer
controlled regions rather than pure compiler directed ones?
Post by George Neuner
Post by Christoffer Lernö
This would then be sort of an alternative to escape analysis which I
suppose must be rather conservative. For that reason I suspect it is
even harder to do the proper analysis in a dynamically typed language
as it is not possible during compile time to determine the exact
methods that will be called.
Escape analysis is purely an issue of lexical scoping and whether the
object may outlive the scope in which it is defined. What form of
typing the language uses is not relevant ... you are only looking for
the object to leave the control of the scope chain.
In the case of an OO language with dynamic dispatch, then if I'm not
mistaken it's not really possible to tell what happens to any of the
arguments of a method invocation (including the object itself) at
compile time.

It's only post-fact (when returning from the current scope) one is in
a position to determine is the object will outlive the current scope
or not.

That's why I suggested allocating everything on the stack, as I
suppose that most allocation could safely be done on the stack
(assuming constructor allocation being inlined to the parent scope).

Am I missing something that could enable me to do proper escape
analysis and put allocations on the stack?


/Christoffer
George Neuner
2009-07-30 04:28:38 UTC
Permalink
On Mon, 27 Jul 2009 11:18:56 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Post by George Neuner
On Wed, 22 Jul 2009 02:31:06 -0700 (PDT), Christoffer Lernv
From a performance perspective (I mentioned my wish for predictive GC
performance before), would it make sense to enable programmer
controlled regions rather than pure compiler directed ones?
Programmer controlled regions = manual memory management. There is
nothing wrong with it (and Mark-Release regions are an extremely
efficient way to do it), but it cannot be construed to be "automatic"
or "GC" any more than is deleting a member array in a C++ object's
destructor.
Post by Christoffer Lernö
Post by George Neuner
Escape analysis is purely an issue of lexical scoping and whether the
object may outlive the scope in which it is defined. What form of
typing the language uses is not relevant ... you are only looking for
the object to leave the control of the scope chain.
In the case of an OO language with dynamic dispatch, then if I'm not
mistaken it's not really possible to tell what happens to any of the
arguments of a method invocation (including the object itself) at
compile time.
It's only post-fact (when returning from the current scope) one is in
a position to determine is the object will outlive the current scope
or not.
Even though the control path is not determined until run time, it is
still statically specified (even in Lisp where you can add/subtract
methods at run time). Although you don't know which control path will
be taken, you still know that (depending on what the language allows)
only actual OO objects, local structures or pointers to them can
possibly escape. The analysis considers the set of potential escapees
passed to the static method call. The actual types involved don't
matter, the analysis is only concerned with names.

If you can do whole program or compilation unit analysis, you may find
through call chain analysis that escape is impossible even though
escape analysis says it's possible. Some functional languages do this
(under high optimization settings) to minimize unnecessary heap
allocation.

Some functional languages do such analysis (under high optimization
settings) to prevent needless heap allocation.
Post by Christoffer Lernö
That's why I suggested allocating everything on the stack, as I
suppose that most allocation could safely be done on the stack
(assuming constructor allocation being inlined to the parent scope).
The problem with stack allocating first and then copying is that the
copy must be made *before* the pointer potentially escapes. This is
because once the object leaves the scope chain, anything might happen
to it - including hidden access by an asynchronous mutator.

You also have to consider the nature of the objects such as whether a
shallow copy is sufficient or a deep copy is necessary and whether you
can actually perform the needed operations automatically.

Beyond that, once the heap copy exists the stack copy logically dies,
which means that the local code must only reference the heap copy from
that point on.

The escape analysis tells you indirectly which objects are safe to
stack allocate by telling you which objects must be heap allocated to
be safe. Escape analysis is conservative in that it doesn't know
whether the object does escape ... it only can tell that it might.
Post by Christoffer Lernö
Am I missing something that could enable me to do proper escape
analysis and put allocations on the stack?
I think at this point you understand the actual analysis and the
potential consequences adequately. What you need to consider now is
how using it impacts your design.

George
Christoffer Lernö
2009-08-02 14:54:06 UTC
Permalink
Post by George Neuner
On Mon, 27 Jul 2009 11:18:56 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
In the case of an OO language with dynamic dispatch, then if I'm not
mistaken it's not really possible to tell what happens to any of the
arguments of a method invocation (including the object itself) at
compile time.
It's only post-fact (when returning from the current scope) one is in
a position to determine is the object will outlive the current scope
or not.
Even though the control path is not determined until run time, it is
still statically specified (even in Lisp where you can add/subtract
methods at run time). Although you don't know which control path will
be taken, you still know that (depending on what the language allows)
only actual OO objects, local structures or pointers to them can
possibly escape. The analysis considers the set of potential escapees
passed to the static method call. The actual types involved don't
matter, the analysis is only concerned with names.
If I borrow an example from some Objective-C sample code:

- (void)awakeFromNib {
NSArray *aspects = [self aspects];
NSUInteger i, c = [aspects count];

NSString *path;
NSData *rtfData = nil;

// Create the NSTextStorage and load the default text file
path = [[NSBundle bundleForClass:[self class]]
pathForResource:@"README" ofType:@"rtf"];
rtfData = [NSData dataWithContentsOfFile:path];
_textStorage = [[NSTextStorage allocWithZone:[self zone]]
initWithRTF:rtfData documentAttributes:NULL];

// Load the popup.
[aspectPopUpButton removeAllItems];
for (i=0; i<c; i++) {
[aspectPopUpButton addItemWithTitle:[[aspects objectAtIndex:i]
aspectName]];
}

// Set up an initial aspect to view
if (c > 0) {
[self swapInAspectAtIndex:0];
} else {
// No aspects!
[aspectPopUpButton addItemWithTitle:NSLocalizedString(@"No
Aspect", @"Item title for aspect popup if no aspects exist.")];
}
}

Can we say anything at all about the non-primitives here?

We create the objects NSString path and NSData rtfData, but both of
those are argment for methods that has implementations that are
determined at runtime, which means we cannot determine if the extent
of these objects need to outlive the current scope.

In fact, I cannot think of any non-trivial object usage where we can
assert that escape is impossible, since even

SomeObject *someObject = [[SomeObject alloc] init];

could possibly already cause "someObject" to be registered to an
object outside the scope of the current function.

That's why I meant that there is an issue with dynamic dispatch.


/Christoffer
George Neuner
2009-08-03 06:35:25 UTC
Permalink
On Sun, 2 Aug 2009 07:54:06 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
Post by George Neuner
On Mon, 27 Jul 2009 11:18:56 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
In the case of an OO language with dynamic dispatch, then if I'm not
mistaken it's not really possible to tell what happens to any of the
arguments of a method invocation (including the object itself) at
compile time.
It's only post-fact (when returning from the current scope) one is in
a position to determine is the object will outlive the current scope
or not.
Even though the control path is not determined until run time, it is
still statically specified (even in Lisp where you can add/subtract
methods at run time). Although you don't know which control path will
be taken, you still know that (depending on what the language allows)
only actual OO objects, local structures or pointers to them can
possibly escape. The analysis considers the set of potential escapees
passed to the static method call. The actual types involved don't
matter, the analysis is only concerned with names.
snip long example
Post by Christoffer Lernö
Can we say anything at all about the non-primitives here?
No.
Post by Christoffer Lernö
We create the objects NSString path and NSData rtfData, but both of
those are argment for methods that has implementations that are
determined at runtime, which means we cannot determine if the extent
of these objects need to outlive the current scope.
In fact, I cannot think of any non-trivial object usage where we can
assert that escape is impossible, since even
SomeObject *someObject = [[SomeObject alloc] init];
could possibly already cause "someObject" to be registered to an
object outside the scope of the current function.
That's why I meant that there is an issue with dynamic dispatch.
It is not a problem of dynamic dispatch so much as a problem with the
implementation in Objective C. Dynamic dispatch takes 3 main forms:
messaging, class virtual functions and non-class generic functions.

Objective C objects have one entry point, a compiler generated
variadic (vararg) function which compares the message name string to a
list of recognized values and dispatches to the appropriate method.
Each method function then parses any additional arguments it needs
from the vararg list. Because of this structure there is no way to
tell which method is being called as it depends on the contents of the
character string. Even knowing which object/class is being referenced
does not help much. Smalltalk suffered from a similar defect although
its dispatch mechanism worked a bit differently.

With virtual functions (Java, C++, etc.) or with generic functions
(Lisp, etc.), the compiler can statically narrow the set of
possibilities based on the arguments and may be able to narrow the set
to only one function.


In languages like C each function is a disjoint scope - there is no
static scope chain because nested functions are not permitted. But it
is still not the case that every pointer used as a function argument
escapes the function's scope.

When you have no static scope chain, you have to follow the dynamic
scope chain by value propagation and call chain analysis ... ie. you
have to follow the values as they are passed from function to
function. If you can reach the leaves of the call tree without the
pointer escaping, then, by transitivity, it doesn't escape the
originating function's scope. Use of function pointers, of course,
screws with this analysis.

This can be done under separate compilation. For example, a function
can be analyzed statically to see which values are generated, which
are killed, which are passed to another function and which are stored
into a non-local data structure (gen-kill is normally done anyway and
the others can be added easily). The compiler can store such
inputs/outputs meta-information about the function and use it later
without needing access to the function's code.

Of course, none of this helps if you don't have the source or the
meta-information - such as with a canned library. However, in that
case, call chain analysis will fail to reach the leaves of the call
tree and so will have to assume the value(s) in question escape(s).


The point is that most languages are amenable to some kind of escape
analysis. Just because the language has dynamic dispatch is not a
reason to ignore it a priori.

George
Hans-Peter Diettrich
2009-08-07 00:11:10 UTC
Permalink
George Neuner schrieb:

Thanks for your detailed explanations [snipped].
Post by George Neuner
Of course, none of this helps if you don't have the source or the
meta-information - such as with a canned library. However, in that
case, call chain analysis will fail to reach the leaves of the call
tree and so will have to assume the value(s) in question escape(s).
I've been thinking about pointers and references in the last days.
Dynamic runtime behaviour may deserve similar dynamic support for
tracking/limiting the use of pointers. As long as (canned) libraries
have attributed entry points, created by the compiler during their
creation, the same compiler can trust that information.

In other cases it can be helpful, at least, to distinguish between kind
of managed and unmanaged code (see .NET), so that the compiler can know
what a called function is allowed to do with the passed arguments. The
key point is cooperation of the involved tools (compiler, librarian,
linker...) and their use of (binary) file formats with room for required
attributes. The C++ approach with mangled names is interesting, but it
reveals pro's and con's at the same time.

DoDi

Larry
2009-07-17 13:14:09 UTC
Permalink
Post by Christoffer Lernö
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
Post by Christoffer Lernö
I've read papers by Lins 2003 "Lazy Cyclic Reference Counting",
Bacon / Rajan 2001 "Concurrent Cycle Collection in Reference Counted
Systems" and a few others.
[snip]
Post by Christoffer Lernö
I implemented the simple non-concurrent version of the algorithm as
described in Bacon's paper and I did not find it all that complex, the
concurrent version is naturally more involved though.
How did you implement the children function in figure 2 on page 6 of
the Bacon 2001 article available at:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.2464&rep=rep1&ty
pe=url&i=0

? More specifically, how would you calculate children for:

boost::fusion::vector<T1,T2,.., Tn>

for any types, T1,T2,..., Tn, where boost::fusion is the template
described here:

http://www.boost.org/doc/libs/1_39_0/libs/fusion/doc/html/fusion/container/ve
ctor.html

? In general, how would children for any fusion container or for that
matter any
structure of the general form:

template<typename T1, typename T2, ...>
struct Container: public Super<T1,T2,...>
{
field1<T1,.T2,...> member1;
field2<T1,T2,...> member2;
...
fieldm<T1,T2,...> memberm;
};

?

-regards,
Larry
Christoffer Lernö
2009-07-18 07:12:34 UTC
Permalink
Post by Larry
Post by Christoffer Lernö
I implemented the simple non-concurrent version of the algorithm as
described in Bacon's paper and I did not find it all that complex, the
concurrent version is naturally more involved though.
How did you implement the children function in figure 2 on page 6 of
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.2464&rep...
pe=url&i=0
boost::fusion::vector<T1,T2,.., Tn>
As I wrote elsewhere, this is for a new language, so for this case it
is simply a matter of going through all member variables of the
object. Arrays/MutableArrays are built-in objects where implementation
will assure that all entries in the array behave as member variables
in this regard.

/C
BGB / cr88192
2009-07-15 19:04:55 UTC
Permalink
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
<snip>
Post by George Neuner
Reference counting is really primitive GC technology and you're better
off not starting down that path unless your intent is to provide a
library module that can't make assumptions about its environment.
I disagree.
Post by George Neuner
I haven't read Bacon's paper, but you'll note that Lins uses a marking
scan as a backup to identify and break data cycles for the reference
counter. Although the marking scan need not be run unless memory is
tight, it significantly complicates the overall implementation.
In fact, the marking scanner is nearly half the implementation of a
more capable mark-sweep or mark-compact collector ... neither of which
will have the reference counter's problems with circular data
structures.
yeah.

my recommendation is to not complicate the ref-counting itself with
complicated cycle-breaking technology...


instead, a strategy which I have used, and which a number of other
"industrial strength" VMs use (the Python VM, and AFAIK SpiderMonkey, and
others...), is to combine together reference counting with something like a
mark/sweep or generational GC.

this way, the 2 systems work in parallel, with the ref-counting cleaning up
in the majority of cases, and falling back to the mark/sweep or generational
GC to deal with the situation of the heap having become full (often, with
all the cycles/... missed by the ref-counter).

the main advantage then is that the GC only runs a fraction as often, which
very quickly pays for itself with languages which spew garbage "like there
is no tomorrow" (where in a burst of activity one may find their heap filled
within a matter of seconds...).


of course, for other reasons, I have generally been moving to much more
memory and garbage-conservative implementation approaches, mostly trying to
find ways to avoid GC whenever possible (granted... one reason for this
being that, in general, my GC is not the most reliable GC around, and so
avoiding the GC running also helps avoid much of the risk of crashes...).

partly my GC's reliability issues though are because it is multithreaded,
and lacks a proper write barrier (I have a SW write barrier, but, ermm...
most of my code does not use it...).

it was actually a single threaded GC, but was made multithreaded through the
use of a few not-so-clean hacks a few years back...

as is, it almost makes sense to "re-engineer" the thing, AKA, to start with
a clean design and mostly new code, but while likely keeping the API intact.

(decided to leave out my elaborations on the idea...).
Post by George Neuner
If you're dead set on reference counting, you should investigate 1-bit
reference counting. I don't have URL's to give you but they should be
easy to find. The programming language is designed so that most data
are not shared and then you only need a simple flag to say whether the
item is shared and so needs special handling.
this sounds interesting, but I don't know much about this approach...
BartC
2009-07-16 09:54:56 UTC
Permalink
Post by BGB / cr88192
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
If you're dead set on reference counting, you should investigate 1-bit
reference counting. I don't have URL's to give you but they should be
easy to find. The programming language is designed so that most data
are not shared and then you only need a simple flag to say whether the
item is shared and so needs special handling.
this sounds interesting, but I don't know much about this approach...
It sounds scarily like what I was once using: a language where
variables never shared data, except when used in expressions and
function calls, when a copy bit was set so that only shallow copies of
any value could be pushed around instead.

With strings, and arrays of arbitrary complexity, being handled
automatically with implicit pointers and the copy bit used to tell it
when to free, duplicate, or do nothing, this worked extremely well
with no need for GC:

a:=(10,20,30,40,50) #create some array
a:=0 #now the array magically disappears and a is an int

It doesn't work well however when the language allows *explicit* pointers:

a:=(10,20,30,40,50) #an array again
p:=&a[2] #set some explicit pointers to elements in a
q:=&a[3]
a:=0

Now the array disappears again, but p and q are left pointing at phantom
elements. If this 1-bit trick can help with this, I'd be interested too, as
full GC seems an overkill. At this moment it's not too much of a problem,
because explicit pointers aren't used so much, and when they are they are
just managed manually.

--
Bart
George Neuner
2009-07-17 20:00:17 UTC
Permalink
Post by BartC
Post by BGB / cr88192
Post by George Neuner
On Sun, 12 Jul 2009 13:41:59 -0700 (PDT), Christoffer Lernv
If you're dead set on reference counting, you should investigate 1-bit
reference counting. I don't have URL's to give you but they should be
easy to find. The programming language is designed so that most data
are not shared and then you only need a simple flag to say whether the
item is shared and so needs special handling.
this sounds interesting, but I don't know much about this approach...
It sounds scarily like what I was once using: a language where
variables never shared data, except when used in expressions and
function calls, when a copy bit was set so that only shallow copies of
any value could be pushed around instead.
With strings, and arrays of arbitrary complexity, being handled
automatically with implicit pointers and the copy bit used to tell it
when to free, duplicate, or do nothing, this worked extremely well
a:=(10,20,30,40,50) #create some array
a:=0 #now the array magically disappears and a is an int
a:=(10,20,30,40,50) #an array again
p:=&a[2] #set some explicit pointers to elements in a
q:=&a[3]
a:=0
Now the array disappears again, but p and q are left pointing at phantom
elements. If this 1-bit trick can help with this, I'd be interested too, as
full GC seems an overkill.
It might be some help, but you still need some way to identify the
base address of the array so you can deal with its allocation block,
check bounds, etc.

One way to handle this with 1-bit RC is to allocate the array
descriptor and data block separately. You manage the small descriptor
and special case the GC to also release the data block when the
descriptor is released. This arrangement also allows for having easy
extensible arrays.

It does mean that array references are double indirected - ie. the
"pointers" are really "handles", which may have some performance
issues. It isn't a problem for loops and such where you'd compute and
cache the address anyway, but for isolated references it can be an
issue.


One thing I forgot to mention previously is that the 1-bit "share"
flag can usually be kept as an immediate tag in the pointers
themselves. That makes checking/setting it simple but to use the
pointer you may have to strip the tag or factor its value into your
index calculations.

George
Torben Ægidius Mogensen
2009-07-16 07:39:51 UTC
Permalink
Post by George Neuner
If you're dead set on reference counting, you should investigate 1-bit
reference counting. I don't have URL's to give you but they should be
easy to find. The programming language is designed so that most data
are not shared and then you only need a simple flag to say whether the
item is shared and so needs special handling.
You can get most of the benefit of 1-bit reference counts by using
linear-type inference. Basically, at compile time you find objects that
you can statically guarantee will have only one use _and_ (most of the
time) code the reuse directly in the compiled code instead of putting
the objects on a free list for later reuse. Non-linear objects are
GC'ed normally.

Torben
Gene
2009-07-16 05:01:29 UTC
Permalink
Post by Christoffer Lernö
I'm looking into GC using ref-counting.
Does anyone know what passes for state-of-the-art when it comes to ref-
counting algorithms?
I've read papers by Lins 2003 "Lazy Cyclic Reference Counting",
Bacon / Rajan 2001 "Concurrent Cycle Collection in Reference Counted
Systems" and a few others.
If one would like to implement a fast GC using refcounting, should I
implement something like in those papers, or are there better
algorithms out there?
To put this work in context, try
http://www.amazon.com/Garbage-Collection-Algorithms-Automatic-Management/dp/0
471941484,
where Lins is a co-author. I've never seen a more insightful
treatment of the whole topic of garbage collection. The book does
point out that the overhead of reference counting usually exceeds that
of mark-and-sweep or copying collectors.
Loading...