One of CHICKEN's coolest features has to be its unique approach to garbage collection. When someone asked about implementation details (hi, Arthur!), I knew this would make for an interesting blog post. This post is going to be long and technical, so hold on to your hats! Don't worry if you don't get through this in one sitting.
Prerequisites
There's a whole lot of stuff that we'll need to explain before we get to the actual garbage collector. CHICKEN's garbage collection (GC) strategy is deeply intertwined with its compilation strategy, so I'll start by explaining the basics of that, before we can continue (pun intended) with the actual GC stuff.
A short introduction to continuation-passing style
The essence of CHICKEN's design is a simple yet brilliant idea by Henry Baker, described in his paper CONS Should Not CONS Its Arguments, Part II: Cheney on the M.T.A.. The paper is pretty terse, but it's well-written, so I recommend you check it out before reading on. If you grok everything in it, you probably won't get much out of my blog post and you can stop reading now. If you don't grok it, it's probably a good idea to re-read it again later.
Baker's approach assumes a Scheme to C compiler which uses continuation-passing style (CPS) as an internal representation. This is the quintessential internal representation of Scheme programs, going all the way back to the first proper Scheme compiler, RABBIT.
Guy L. Steele (RABBIT's author) did not use CPS to make garbage collection easier. In fact, RABBIT had no GC of its own, as it relied on MacLISP as a target language (which compiled to PDP-10 machine code and had its own garbage collector). Instead, continuations allowed for efficient implementation of nested procedure calls. It eliminated the need for a stack to keep track of this nesting by simply returning the "next thing to do" to a driver loop which took care of invoking it. This made it possible to write down iterative algorithms as a recursive function without causing a stack overflow.
Let's consider a silly program which sums up all the numbers in a list, and shows the result multiplied by two:
(define (calculate-sum lst result) (if (null? lst) result (calculate-sum (cdr lst) (+ result (car lst))))) (define (show-sum lst) (print-number (* 2 (calculate-sum lst 0)))) (show-sum '(1 2 3))
A naive compilation to C would look something like this (brutally simplified):
void entry_point() { toplevel(); exit(0); /* Assume exit(1) is explicitly called elsewhere in case of error. */ } void toplevel() { /* SCM_make_list() & SCM_fx() allocate memory. "fx" stands for "fixnum". */ SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3)); show_sum(lst); } SCM_obj* show_sum(SCM_obj *lst) { SCM_obj result = calculate_sum(lst, SCM_fx(0)); /* SCM_fx_times() allocates memory. */ return SCM_print_number(SCM_fx_times(SCM_fx(2), result)); } SCM_obj* calculate_sum(SCM_obj *lst, SCM_obj *result) { if (lst == SCM_NIL) { /* Optimised */ return result; } else { /* SCM_fx_plus() allocates memory. */ SCM_obj *tmp = SCM_cdr(lst); SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst)); return calculate_sum(tmp, tmp2); /* Recur */ } } SCM_obj *SCM_print_number(SCM_obj *data) { printf("%d\n", SCM_fx_to_integer(data)); return SCM_VOID; }
This particular implementation probably can't use a copying garbage collector like CHICKEN uses, because the SCM_obj pointers which store the Scheme objects' locations would all become invalid. But let's ignore that for now.
Due to the recursive call in calculate_sum(), the stack just keeps growing, and eventually we'll get a stack overflow if the list is too long. Steele argued that this is a silly limitation which results in the proliferation of special-purpose "iteration" constructs found in most languages. Also, he was convinced that this just cramps the programmer's style: we shouldn't have to think about implementation details like the stack size. In his time people often used goto instead of function calls as a performance hack. This annoyed him enough to write a rant about it, which should be required reading for all would-be language designers!
Anyway, a compiler can transparently convert our Scheme program into CPS, which would look something like this after translation to C:
/* Set up initial continuation & toplevel call. */ void entry_point() { SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation); SCM_call *call = SCM_make_call(0, cont); SCM_driver_loop(call); } void SCM_driver_loop(SCM_call *call) { /* The trampoline to which every function returns its continuation. */ while(true) call = SCM_perform_continuation_call(call); } SCM_call *toplevel(SCM_cont *cont) { SCM_cont *next = SCM_make_cont(1, &show_sum, cont); SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3)); return SCM_make_call(1, next, lst); } SCM_call *show_sum(SCM_cont *cont, SCM_obj *lst) { SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont); SCM_cont *now = SCM_make_cont(2, &calculate_sum, next); return SCM_make_call(2, now, lst, SCM_fx(0)); } SCM_call *calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) { if (lst == SCM_NIL) { /* Optimised */ return SCM_make_call(1, cont, result); } else { SCM_obj *tmp = SCM_cdr(lst); SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst)); SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont); return SCM_make_call(2, now, tmp, tmp2); /* "Recur" */ } } SCM_call *show_sum_continued(SCM_cont *cont, SCM_obj *result) { SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont); SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result); return SCM_make_call(1, now, tmp); } SCM_call *SCM_print_number(SCM_cont *cont, SCM_obj *data) { printf("%d\n", SCM_fx_to_integer(data)); return SCM_make_call(1, cont, SCM_VOID); }
In the above code, there are two new data types: SCM_cont and SCM_call.
An SCM_cont represents a Scheme continuation as a C function's address, the number of arguments which it expects (minus one) and another continuation, which indicates where to continue after the C function has finished. This sounds recursive, but as you can see the very first continuation created by entry_point() is a specially prepared one which will cause the process to exit.
An SCM_call is returned to the driver loop by every generated C function: this holds a continuation and the arguments with which to invoke it. SCM_perform_continuation_call() extracts the SCM_cont from the SCM_call and invokes its C function with its continuation and the arguments from the SCM_call. We won't dwell on the details of its implementation now, but assume this is some magic which just works.
You'll also note that the primitives SCM_car(), SCM_cdr(), SCM_fx_plus() and SCM_fx_times() do not accept a continuation. This is a typical optimisation: some primitives can be inlined by the compiler. However, this is not required: you can make them accept a continuation as well, at the cost of further splintering the C functions into small sections; the calculate_sum() function would be split up into 4 separate functions if we did that.
Anyway, going back to the big picture we can see that this continuation-based approach consumes a more or less constant amount of stack space, because each function returns to driver_loop. Baker's fundamental insight was that the stack is there anyway (and it will be used by C), and if we don't need it for tracking function call nesting, why not use it for something else? He proposed to allocate all newly created objects on the stack. Because the stack would hopefully fit the CPU's cache in its entirety, this could give quite a performance benefit.
Generational collection
To understand why keeping new data together on the stack can be faster, it's important to know that most objects are quite short-lived. Most algorithms involve intermediate values, which are accessed quite a bit during a calculation but are no longer needed afterwards. These values need to be stored somewhere in memory. Normally you would store them together with all other objects in the main heap, which may cause fragmentation of said heap. Fragmentation means that memory references may cross page boundaries. This is slow, because it will clear out the CPU's memory cache and may even require swapping it in, if the machine is low on memory.
On top of that, generating a lot of intermediate values means generating a lot of garbage, which will trigger many GCs during which a lot of these temporary objects will be cleaned up. However, during these GCs, the remaining longer-lived objects must also be analysed before it can be decided they can stick around.
This is rather wasteful, and it turns out we can avoid doing so much work by categorising objects by age. Objects that have just been created belong to the first generation and are stored in their own space (called the nursery - I'm not kidding!), while those that have survived several GC events belong to older generations, which each have their own space reserved for them. By keeping different generations separated, you do not have to examine long-lived objects of older generations (which are unlikely to be collected) when collecting garbage in a younger generation. This can save us a lot of wasted time.
Managing data on the stack
The Cheney on the M.T.A. algorithm as used by CHICKEN involves only two generations; one generation consisting of newly created objects and the other generation consisting of older objects. In this algorithm, new objects get immediately promoted (or tenured) to the old generation after a GC of the nursery (or stack). Such a GC is called a minor GC, whereas a GC of the heap is called a major GC.
This minor GC is where the novelty lies: objects are allocated on the stack. You might wonder how that can possibly work, considering the lengths I just went through to explain how CPS conversion gets rid of the stack. Besides, by returning to the trampoline function whenever a new continuation is invoked, anything you'd store on the stack would need to get purged (that's how the C calling convention works).
That's right! The way to make this work is pretty counter-intuitive: we go all the way back to the first Scheme to C conversion I showed you and make it even worse. Whenever we want to invoke a continuation, we just call its function. That means that the example program we started out with would compile to this:
/* Same as before */ void entry_point() { SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation); SCM_call *call = SCM_make_call(0, cont); SCM_driver_loop(call); } SCM_call *saved_cont_call; /* Set by SCM_save_call, read by driver_loop */ jmp_buf empty_stack_state; /* Set by driver_loop, read by minor_GC */ void SCM_driver_loop(SCM_call *call) { /* Save registers (including stack depth and address in this function) */ if (setjmp(empty_stack_state)) call = saved_cont_call; /* Got here via longjmp()? Use stored call */ SCM_perform_continuation_call(call); } void SCM_minor_GC() { /* ... Copy live data from stack to heap, which is a minor GC. Described later. ... */ longjmp(empty_stack_state); /* Restore registers (jump back to driver_loop) */ } void toplevel(SCM_cont *cont) { if (!fits_on_stack(SCM_CONT_SIZE(0) + SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE * 3 + SCM_PAIR_SIZE * 3)) { SCM_save_call(0, &toplevel, cont); /* Mutates saved_cont_call */ SCM_minor_GC(); /* Will re-invoke this function from the start */ } else { /* The below stuff will all fit on the stack, as calculated in the if() */ SCM_cont *next = SCM_make_cont(1, &show_sum, cont); SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3)); SCM_call *call = SCM_make_call(1, next, lst); SCM_perform_continuation_call(call); } } void show_sum(SCM_cont *cont, SCM_obj *lst) { if (!fits_on_stack(SCM_CONT_SIZE(0) * 2 + SCM_CALL_SIZE(2) + SCM_FIXNUM_SIZE)) { SCM_save_call(1, &show_sum, cont, lst); SCM_minor_GC(); } else { SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont); SCM_cont *now = SCM_make_cont(2, &calculate_sum, next); SCM_call *call = SCM_make_call(2, now, lst, SCM_fx(0)); SCM_perform_continuation_call(call); } } void calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) { /* This calculation is overly pessimistic as it counts both arms of the if(), but this is acceptable */ if (!fits_on_stack(SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE + SCM_CONT_SIZE(1) + SCM_CALL_SIZE(2))) { SCM_save_call(2, &calculate_sum, cont, lst, result); SCM_minor_GC(); } else { if (lst == SCM_NIL) { SCM_call *call = SCM_make_call(1, cont, result); SCM_perform_continuation_call(call); } else { SCM_obj *tmp = SCM_cdr(lst); SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst)); SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont); SCM_call *call = SCM_make_call(2, now, tmp, tmp2); SCM_perform_continuation_call(call); /* "Recur" */ } } } void show_sum_continued(SCM_cont *cont, SCM_obj *result) { if (!fits_on_stack(SCM_CONT_SIZE(1) + SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE)) { SCM_save_call(1, &show_sum_continued, cont, result); SCM_minor_GC(); } else { SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont); SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result); SCM_call *call = SCM_make_call(1, now, tmp); SCM_perform_continuation_call(call); } } void SCM_print_number(SCM_cont *cont, SCM_obj *data) { if (!fits_on_stack(SCM_CALL_SIZE(1))) { SCM_save_call(1, &show_sum_continued, cont, data); SCM_minor_GC(); } else { printf("%d\n", SCM_fx_to_integer(data)); SCM_call *call = SCM_make_call(1, cont, SCM_VOID); SCM_perform_continuation_call(call); } }
Whew! This program is quite a bit longer, but it isn't that different from the second program I showed you. The main change is that none of the continuation functions return anything. In fact, these functions, like Charlie in the M.T.A. song, never return. In the earlier version every function ended with a return statement, now they end with an invocation of SCM_perform_continuation_call().
To make things worse, allocating functions now also use alloca() to place objects on the stack. That means that the stack just keeps filling up like the first compilation I showed you, so we're back to where we started! However, this program is a lot longer due to one important thing: At the start of each continuation function we first check to see if there's enough space left on the stack to accommodate the objects this function will allocate.
If there's not enough space, we re-create the SCM_call with which this continuation function was invoked using SCM_save_call(). This differs from SCM_make_call() in that it will not allocate on the stack, but will use a separate area to set aside the call object. The pointer to that area is stored in saved_cont_call.
SCM_save_call() can't allocate on the stack for a few reasons: The first and most obvious reason is that the saved call wouldn't fit on the stack because we just concluded it is already full. Second, the arguments to the call must be kept around even when the stack is blown away after the GC has finished. Third, this stored call contains the "tip" of the iceberg of live data from which the GC will start its trace. This is described in the next section.
After the minor GC has finished, we can jump back to the trampoline again. We use the setjmp() and longjmp() functions for that. When the first call to SCM_driver_loop() is made, it will call setjmp() to save all the CPU's registers to a buffer. This includes the stack and instruction pointers. Then, when the minor GC finishes, it calls longjmp() to restore those registers. Because the stack and instruction pointer are restored, this means execution "restarts" at the place in driver_loop() where setjmp() was invoked. The setjmp() then returns again, but now with a nonzero value (it was zero the first time). The return value is checked and the call is fetched from the special save area to get back to where we were just before we performed the GC.
This is half the magic, so please make sure you understand this part!
The minor GC
The long story above served to set up all the context you need to know to dive into the GC itself, so let's take a closer look at it.
Picking the "live" data from the stack
As we've seen, the GC is invoked when the stack has completely filled up. At this point, the stack is a complete mess: it has many stack frames from all the function calls that happened between the previous GC and now. These stack frames consist of return addresses for the C function calls (which we're not even using), stack-allocated C data (which we don't need) and somewhere among that mess there are some Scheme objects. These objects themselves also belong to two separate categories: the "garbage" and the data that's still being used and needs to be kept around (the so-called live data). How on earth are we going to pick only the interesting bits from that mess?
Like I said before, the saved call contains the "tip of the iceberg" of live data. It turns out this is all we need to get at every single object which is reachable to the program. All you need to do is follow the pointers to the arguments and the continuation stored in the call. For each of these objects, you copy them to the heap and if they are compound objects you follow all the pointers to the objects stored within them, and so on. Let's take a look at a graphical representation of how this works. In the picture below I show the situation where a GC is triggered just after the second invocation of calculate-sum (i.e., the first recursive call of itself, with the list '(2 3)):
After the initial shock from seeing this cosmic horror has worn off, let's take a closer look. It's like a box of tangled cords: if you take the time to carefully untangle them, it's easy, but if you try to do it all at once, it'll leave you overwhelmed. Luckily, I'm going to talk you through it. (by the way: this is an SVG so you can zoom in on details as far as you like using your browser's zooming functionality).
Let's start with the big picture: On the left you see the stack, on the right the heap after copying and in the bottom centre there's a small area of statically allocated objects, which are not subject to GC. To get your bearings, check the left margin of the diagram. I have attempted to visualise C stack frames by writing each function's name above a line leading to the bottom of its frame.
Let's look at the most recently called function, at the top of the stack. This is the function which initiated the minor GC: the second call to calculate_sum(). The shaded area shows the pointers set aside by SCM_save_call(), which form the tip of the iceberg of live data. More on that later.
The next frame belongs to the first call to calculate_sum(). It has allocated a few things on the stack. The topmost element on the stack is the last thing that's allocated due to the way the stack grows upwards in this picture. This is a pointer to an SCM_call object, marked with "[call]", which is the name of the variable which is stored there. If you go back to the implementation of calculate_sum(), you can see that the last thing it does is allocate an SCM_call, and store its pointer in call. The object itself just precedes the variable on the stack, and is marked with a thick white border to group together the machine words from which it is composed. From bottom to top, these are:
- A tag which indicates that this is a call containing 2 arguments,
- a pointer to an SCM_cont object (taken from the now variable),
- a pointer to an SCM_obj object (the cdr of lst, taken from tmp) and
- a pointer to an SCM_obj object (a fixnum, taken from tmp2).
Other compound objects are indicated in the same way.
You'll also have noticed the green, white and dashed arcs with arrow tips. These connect pointers to their target addresses. The dashed ones on the right hand side of the stack indicate pointers that are used for local variables in C functions or SCM_call objects. These pointers are unimportant to the garbage collector. The ones on the left hand side of the stack are pointers from Scheme objects to other Scheme objects. These are important to the GC. The topmost pointer inside the call object we just looked at has a big dashed curve all the way down to the cdr of lst, and the one below it points at the value of result, which is the fixnum 1.
If you look further down the stack, you'll see the show_sum procedure which doesn't really allocate much: an SCM_call, the initial intermediate result (fixnum 0), and two continuations (next and now in the C code). The bulk of the allocation happens in toplevel, which contains the call to show_sum and allocates a list structure. This is on the stack in reverse order: first the pair X = (3 . ()), then the pair Y = (2 . <X>) and the pair Z = (1 . <Y>). The () is stored as SCM_NIL in the static area, to which the cdr of the bottom-most pair object on the stack points, which is represented by a long green line which swoops down to the static area.
Copying the live data to the heap
The green lines represent links from the saved call to the live data which we need to copy. You can consider the colour green "contagious": imagine everything is white initially, except for the saved call. Then, each line starting at the pointers of the call are painted green. The target object to which a line leads is also painted green. Then, we recursively follow lines from pointers in that object and paint those green, etc. The objects that were already in the heap or the static area are not traversed, so they stay white.
When an object is painted green, it is also copied to the heap, which is represented by a yellow line. The object is then overwritten by a special object which indicates that this object has been moved to the heap. This special object contains a forwarding pointer which indicates the new location of the object. This is useful when you have two objects which point to the same other object, like for example in this code:
(let ((a (list 3 2 1))
      (b (cons 4 a))
      (c (cons 4 a)))
  ...)
Here you have two lists (4 3 2 1) which share a common tail. If both lists are live at the moment of GC, we don't want to copy this tail twice, because that would result in it being split into two distinct objects. Then, a set-car! on a might only be reflected in b but not c, for example. The forwarding pointers prevent this from happening by simply adjusting a copied object's constituent objects to point to their new locations. Finally, after all data has been copied, all the newly copied objects are checked again for references to objects which may have been relocated after the object was copied.
The precise algorithm that performs this operation is very clever. It requires only two pointers and a while loop, but it still handles cyclic data structures correctly. The idea is that you do the copying I described above in a breadth-first way: you only copy the objects stored in the saved call (without touching their pointers). Next, you loop from the start of the heap to the end, looking at each object in turn (initially, those are the objects we just copied). For these objects, you check their components, and see whether they exist in the heap or in the stack. If they exist in the stack, you copy them over to the end of the heap (again, without touching their pointers). Because they are appended to the heap, the end pointer gets moved to the end of the last object, so the while loop will also take the newly copied objects into consideration. When you reach the end of the heap, you're done. In C, that would look something like this:
SCM_obj *slot; int i, bytes_copied; char *scan_start = heap_start; for(i = 0; i < saved_object_count(saved_call); ++i) { obj = get_saved_object(saved_call, i); /* copy_object() is called "mark()" in CHICKEN. It also set up a forwarding pointer at the original location */ bytes_copied = copy_object(obj, heap_end); heap_end += bytes_copied; } while(scan_start < heap_end) { obj = (SCM_obj *)scan_start; for(i = 0; i < object_size(obj); ++i) { slot = get_slot(obj, i); /* Nothing needs to be done if it's in the heap or static area */ if (exists_in_stack(slot)) { if (is_forwarding_ptr(slot)) { set_slot(obj, i, forwarding_ptr_target(slot)); } else { bytes_copied = copy_object(slot, heap_end); set_slot(obj, i, heap_end); heap_end += bytes_copied; } } } scan_start += object_size(obj); }
This algorithm is the heart of our garbage collector. You can find it in runtime.c in the CHICKEN sources in C_reclaim(), under the rescan: label. The algorithm was invented in 1970 by C.J. Cheney, and is still used in the most "state of the art" implementations. Now you know why Henry Baker's paper is called "Cheney on the M.T.A." :)
After the data has been copied to the heap, the longjmp() in minor_GC() causes everything on the stack to be blown away. Then, the top stack frame is recreated from the saved call. This is illustrated below:
Everything in the shaded red area below the stack frame for driver_loop() is now unreachable because there are no more pointers from live data pointing into this region of the stack. Any live Scheme objects allocated here would have been copied to the heap, and all pointers which pointed there relayed to this new copy. Unfortunately, this stale copy of the data will permanently stick around on the stack, which means this data is forever irreclaimable. This means it is important that the entry point should consume as little stack space as possible.
The major GC
You might be wondering how garbage on the heap is collected. That's what the major GC is for. CHICKEN initially only allocates a small heap area. The heap consists of two halves: a fromspace and a tospace. The fromspace is the heap as we've seen it so far: in normal usage, this is the part that's used. The tospace is always empty.
When a minor GC is copying data from the stack to the fromspace, it may cause the fromspace to fill up. That's when a major GC is triggered: the data in the fromspace is copied to the tospace using Cheney's algorithm. Afterwards, the areas are flipped: the old fromspace is now called tospace and the old tospace is now called fromspace.
During a major GC, we have a slightly larger set of live data. It is not just the data from the saved call, because that's only the stuff directly used by the currently running continuation. We also need to consider global variables and literal objects compiled into the program, for example. These sorts of objects are also considered live data. Aside from this, a major collection is performed the same way as a minor collection.
The smart reader might have noticed a small problem here: what if the amount of garbage cleaned up is less than the data on the stack? Then, the stack data can't be copied to the new heap because it simply is too small. Well, this is when a third GC mode is triggered: a reallocating GC. This causes a new heap to be allocated, twice as big as the current heap. This is also split in from- and tospace. Then, Cheney's algorithm is performed on the old heap's fromspace, using one half of the new heap as tospace. When it's finished, the new tospace is called fromspace, and the other half of the new heap is called tospace. Then, the old heap is de-allocated.
Some practical notes
The above situation is a pretty rough sketch of the way the GC works in CHICKEN. Many details have been omitted, and the actual implementation is extremely hairy. Below I'll briefly mention how a few important things are implemented.
Object representation
You might have noticed that the stack grows pretty quickly in the CPS-converted C code I showed you. That's because the SCM_obj representation requires allocating every object and storing a pointer to it, so that's a minimum of two machine words per object.
CHICKEN avoids this overhead for small, often-used objects like characters, booleans and fixnums. It ensures all allocated objects are word-aligned, so all pointers to objects have their lower bits set to zero. This means you can easily see whether something is a pointer to an object or something else.
All objects in CHICKEN are represented by a C_word type, which is the size of a machine word. So-called immediate values are stored directly inside the machine word, with nonzero lower bits. Non-immediate values are cast to a pointer type to a C structure which contains the type tag and bits like I did in the example.
Calls are not represented by objects in CHICKEN. Instead, the C function is simply invoked directly from the continuation's caller. Continuations are represented as any other object. For didactic reasons, I used a separate C type to distinguish it from SCM_obj, but in Scheme continuations can be reified as first-class objects, so they shouldn't be represented in a fundamentally different way.
Closures
You might be wondering how closures are implemented, because this hasn't been discussed at all. The answer is pretty simple: in the example code, a SCM_call object stored a plain C function's address. Instead, we could store a closure instead: this is a new type of object which holds a C function plus its local variables. Each C function receives this closure as an extra argument (in the CHICKEN sources this is called self). When it needs to access a closed-over value, it can be accessed from the closure object.
Mutations
Another major oversight is the assumption that objects can only point from the stack into the heap. If Scheme was a purely functional language, this would be entirely accurate: new objects can refer to old objects, but there is no way that a preexisting object can be made to refer to a newly created object. For that, you need to support mutation.
But Scheme does support mutation! So what happens when you use vector-set! to store a newly created, stack-allocated value in an old, heap-allocated vector? If we used the above algorithm, the newly created element would either be part of the live set and get copied, but the vector's pointer would not be updated, or it wouldn't be part of the live set and the object would be lost in the stack reset.
The answer to this problem is also pretty simple: we add a so-called write barrier. Whenever a value is written to an object, it is remembered. Then, when performing a GC, these remembered values are considered to be part of the live set, just like the addresses in the saved call. This is also the reason CHICKEN always shows the number of mutations when you're asking for GC statistics: mutation may slow down a program because GCs might take longer.
Stack size
How does CHICKEN know when the stack is filled up? It turns out that there is no portable way to detect how big the stack is, or whether it has a limit at all!
CHICKEN works around this simply by limiting its stack to a predetermined size. On 64-bit systems, this is 1MB, on 32-bit systems it's 256KB. There is also no portable way of obtaining the address of the stack itself. On some systems, it uses a small bit of assembly code to check the stack pointer. On other systems, it falls back on alloca(), allocating a trivial amount of data. The address of the allocated data is the current value of the stack pointer.
When initialising the runtime, just before the entry point is called, the stack's address is taken to determine the stack's bottom address. The top address is checked in the continuation functions, and the difference between the two is the current stack size.
A small rant
While doing the background research for this post, I wanted to read Cheney's original paper. It was very frustrating to find so many references to it, which all lead to a a paywall on the ACM website.
I think it's absurd that the ACM charges $15 for a paper which is over forty years old, and only two measly pages long. What sane person would plunk down 15 bucks to read 2 pages, especially if it is possibly outdated, or not even the information they're looking for?
The ACM's motto is "Advancing Computing as a Science & Profession", but I don't see how putting essential papers behind a paywall is advancing the profession, especially considering how many innovations now happen as unpaid efforts in the open source/free software corner of the world. Putting such papers behind a paywall robs the industry from a sorely-needed historical perspective, and it stifles innovation by forcing us to keep reinventing the wheel.
Some might argue that the ACM needs to charge money to be able to host high-quality papers and maintain its high quality standard, but I don't buy it. You only need to look at USENIX, which is a similar association. They provide complete and perpetual access to all conference proceedings, and the authors maintain full rights to their work. The ACM, instead, has now come up with a new "protection" racket, requiring authors to give full control of their rights to the ACM, or pay for the privilege of keeping the rights on their own work, which is between $1,100 and $1,700 per article.
On a more positive note, authors are given permission to post drafts of their papers on their own website or through their "Author-izer" service. Unfortunately, this service only works when the link is followed directly from the domain on which the author's website is located (through the Referer header). This is not how the web works: it breaks links posted in e-mail as well as search engines.
Secondly, the ACM are also allowing their special interest groups to provide full access to conference papers of the most recent conference. However, this doesn't seem to be encouraged in any way, and only a few SIGs seem to do this.
Luckily, I found a copy of the Cheney paper on some course website. So do yourself a favour and get it before it's taken down :(
Update: If you are also concerned about this, please take a small moment to add your name to this petition.
Update 2: I've become aware of a web site called Sci-Hub that makes papers freely available to all. It bypasses paywalls through shared full-access accounts. Sadly, this is technically still illegal in many countries and some of its domains have been seized in attempts at censoring them.
Further reading
Garbage collection is a fascinating subject whose roots can be traced back all the way to the origins of LISP. There's plenty of information to be found:
- http://www.memorymanagement.org/ is a great one-stop reference.
- Felix Winkelmann, CHICKEN's original author, has presented about Scheme implementation techniques at FrOSCon, which included a bit about CPS conversion and garbage collection. The slides are also available for download.
- The original "Lambda papers" are a must-read if you want to study the beginnings of Scheme.
- Speaking about historical information: Matthias Felleisen teaches a course on the History of Programming Languages, which includes some notes on the history of garbage collection, and about concurrent garbage collection as well.
- I found a few surveys of garbage collection techniques: By Wilson, by Chase and by Zhong. The Zhong paper has missing images, unfortunately. The original (with images!) can be found as a zipped MS Word document.