The 3 most recent posts (archive)
What to expect from CHICKEN 6 Posted on 2024-11-18
NOTE: This is a guest post by Felix Winkelmann, the founder and one of the current maintainers of CHICKEN Scheme.
Introduction
This article is about the changes in the next major version of CHICKEN Scheme.
The current version is 5.4.0. The next major version, 6.0.0, is mostly implemented. Currently we are adding final touches before preparing the next release. If you are already familiar with CHICKEN, this article will help you get a more detailed view of what has been done. If you are not into CHICKEN or even Scheme, this article may still give you an insight into what's involved in the development and maintenance of a programming language project. You may also find it interesting to know how we address issues like portability and backwards-compatibility. There are also some juicy details on implementation techniques.
No previous knowledge of CHICKEN Scheme is required, but it will help if you are familiar with common terms used in programming language design and Lisp-y languages.
Versions
CHICKEN uses the usual major.minor.patch versioning scheme, where we bump major versions only for significant changes that break compatibility with older versions. CHICKEN has a relatively large community for a non-mainstream language and has lots of contributed extension libraries called "eggs". Breaking backwards compatibility in non-major versions adds implementation effort for users that just want keep their CHICKEN up to date and any eggs they're using should also keep working.
The process of avoiding breakage can sometimes become challenging. We may, for example, provide two different implementations for one and the same thing, to allow adoption of the new feature while keeping old code working. Typically this happens when the new feature is safer, easier to use, faster or more standards compliant than the old. We also remove features in stages. First we mark it as deprecated, and delete it later. This allows users to upgrade their code gradually.
On major version changes, we create a new mirror of the egg "repository", where most of the extensions are stored. Externally contributed eggs can choose when and how to publish an extension for specific versions. This is described in a previous post on this blog.
Major version changes also usually bump the "binary version". This is a suffix to the shared runtime library name to avoid intermixing tools and programs written for one version with the libraries for another.
We started CHICKEN major version 6 to introduce new features that were dearly required but were impossible to add without changing external interfaces, in particular what modules exist and what they export. Specifically, we needed full support for UNICODE strings and compliance with the R7RS (small) language standard.
Both features were already available in a rudimentary manner as external libraries. They were not fully integrated, a fact that showed in various places and could lead to subtle bugs as core and extension code assumed differences in string representation and the behaviour of standard procedures. The same approach was used for integrating the "full numeric tower" (the full set of numeric types, including rationals, big integers and complex numbers), which was formerly a library and was moved into the core system with the 5.0.0 release.
We'll now describe the most noteworthy changes made during this major version transition.
UNICODE
A major shortcoming of CHICKEN up to now was that it didn't support full UNICODE strings. All string data was assumed to consist of 8-bit characters in the ASCII or Latin-1 range. Internationalisation of source code and applications is unavoidable and libraries and operating systems have moved towards UTF-8 as a standard character encoding. So we saw no choice but to find a reasonably efficient way of using the full range of possible characters in all parts of the system.
There is an open-ended space of design choices regarding how to efficiently implement UNICODE strings. The Damocles Sword of unanticipated performance degradation constantly looms over the head of the language implementer, so finding the ideal solution in terms of memory use, speed and simplicity is not easy. Some implementations go to great lengths by providing complex storage schemes or multiple representations depending on a strings' content. In the end, we think a simple representation serves everybody better by being easier to understand and maintain.
Also it is not entirely sure that (say) a dynamic multi-representation approach pays off sufficiently. Too many factors come into play, like string usage patterns in applications, memory management and implementation overhead. Data exchange at the boundaries of operating system and foreign code also have to be taken into account. You want to avoid unnecessary conversions and copying, especially because CHICKEN is geared towards easy interfacing to external libraries and OS functionality.
We decided to use an UTF-8 representation internally. Many operating systems and applications already support UTF-8, which removes the need for costly conversions. UTF-8 is also backwards compatible to ASCII and keeps storage requirements at a minimum.
Since UTF-8 is a multi-byte encoding, character lookup in a string is of linear complexity. Characters have to be scanned when addressing a specific index in a string. To avoid repeatedly scanning the same section of a string when iterating over it, we use a simple cache slot in the string that maps byte-offsets to code point indices and holds the offset/index pair of the last access.
String representation
A quick recap regarding how strings are currently stored in CHICKEN is in order. If you are interested, there's also an older post on this blog with a more detailed explanation of the data representation used by CHICKEN.
Characters are stored as immediates with the special bit pattern 1110 (type code 111):
This gives us enough bits (even on 32-bit platforms) to hold all code points in the Basic Multilingual Plane (code points in the range 32 - 0x1fffff). CHICKEN 5 already supported characters in that range, but strings were still restricted to having 8-bit elements.
Up to CHICKEN 5 a string was represented by a single byteblock:
In CHICKEN 6 we add an indirection:
As you can see, the string itself is a normal block containing a pointer to a byteblock holding the UTF-8 byte sequence and a trailing zero byte. The "Count" slot is a fixnum with the number of code points in the string (the length). "Offset" and "Index" act as a cache for the byte-offset and code point-index of the last access. They are reset to zero if the string changes its length due to a character's width changing at an offset before the cached point.
An indirection is necessary regardless of the other details of the representation: as UTF-8 is a multi-byte encoding, destructively modifying characters in a string may grow or shrink the total byte-sequence. The string must still keep its identity and be indistinguishable from the original string (in terms of the primitive eq? procedure).
One obvious optimisation we can do here is that character accesses for strings where the length ("Count") and the length of the character data (encoded in the header of the byteblock) minus one is equal: then we can simply index by byte offset.
Alternative representations are possible. I believe the following is used in Chibi Scheme: keep a list of "chunks", i.e. multiple byte-offset/code point-index pairs per string. You can traverse this list to obtain the first chunk of string data containing the index you want to access.
In CHICKEN this would probably be represented thus:
This way we can find the offset for the index right at or before the location addressed. This reduces the amount of scanning to the part inside a chunk pointed to by the offset/index pair.
Such an approach of maintaining the offset/index list is more complex and keeping it up to date is more effort than the simple offset/index cache. Should the performance impact of the simpler representation turn out to be too large, the alternative approach may still be an option to try.
This leads me to the subject of benchmarks. We do have a benchmark suite holding a number of micro-benchmarks and CHICKEN 6 has been compared to previous versions using this suite. Nevertheless, the results of micro-benchmarking have to be taken with a grain of salt.
The behaviour of real-world applications may differ considerably depending on the memory consumption, memory address patterns and the type and amount of data processed. Reasoning about performance is especially difficult due to the complex caching mechanisms of contemporary hardware and operating systems. Therefore we will try to use the simplest approach that has a good chance of providing sufficient performance while keeping the implementation effort and complexity at a minimum.
Dealing with the outside world
There's another point we have to address when considering strings that are received from external sources like OS APIs, from files and from the foreign function interface. What to do when encountering invalid UTF-8 byte sequences, for example when reading file or directory names? Signal an error? Convert to some other representation or mark the location in the byte sequence using some special pattern?
We decided to basically ignore the problem. Strings are accepted whether valid or not. Only when decoding a string we distinguish between valid and invalid sequences. When indexing and scanning a string's byte sequence, we return the invalid byte as a "surrogate pair end" code point that has the value 0xDCxx. This approach allows us to store the byte in the lower 8 bits of the code point. When inserting a character in a string that has such a value, we simply copy the byte. As I understand it, this is the approach used by the "surrogateescape" error handler in PEP 383. However, we do this masking every time we decode an unexpected byte (and do the inverse when encoding).
Say we received a string from an external source containing the following byte sequence:
This is invalid UTF-8. Extracting character by character with the described method would yield the following code point values:
Inserting such a surrogate pair end code point will inject the value of the lower byte. For example, calling (string-set! str #\xdcfe), where str is the above invalid utf-8 encoded string would yield the following byte sequence:
The advantage is that we can simply accept and pass strings containing invalid sequences as if nothing happened. We don't need to do any conversions and checks. We can also access items in the character sequence and store them back without having to worry about the validity of the sequence with respect to the UTF-8 encoding. The process is "lossless".
The disadvantage is that we have to perform an additional check when encoding or decoding UTF-8 sequences. Again we decided to reduce copying and transcoding overhead and rather have complete transparency regardless of the source of the string. Furthermore no validation is performed for incoming strings. The R7RS standard procedure utf8->string which converts bytevectors into strings does validation, though to at least ensure that strings created from binary data are always correctly encoded.
A final issue is UNICODE handling on Windows platforms. There, the OS API entry-points that are UNICODE aware use UTF-16 for strings like filenames or the values of environment variables. On Windows there is no choice but to encode and decode from UTF-8 to UTF-16 and back when interfacing with the OS.
Port encodings
When accessing files, we still want to be able to read and write data in an 8-bit encoding or in binary. We may also want to support additional encodings, even though these are currently not provided by default so we'll need to associate an "encoding" property to "ports". Ports are the Scheme objects that provide access to files and other streams of data, like traffic received from a network. The encoding is selected when opening a port using additional arguments to standard procedures that create ports for textual input and output like open-input-port/open-output-port.
Ports internally hold a table of "port methods", roughly similar to how object-oriented programming systems attach behaviour to data objects of a particular class. The port methods in former versions of CHICKEN included the low-level counterpart of the standard operations peek-char, read-char and read-string for input ports and write-char/write-string (among others) for output ports. The major change here is to replace the string I/O methods with methods that act upon bytevectors.
An internal mechanism delegates the operations for encoding and decoding or scanning for the next character in a stream to an internal hook procedure. Additional encodings can be registered by using a built-in procedure to extend the hook. Currently supported are binary, UTF-8 and Latin-1 encodings. Support for a larger set of encodings can be done through extensions and thus can be developed separately from the core system. Port encodings can be accessed using port-encoding. They can also be changed using the SRFI-17 setter for port-encoding, because encodings need sometimes to be changed while the port is still open.
R7RS does not define whether binary I/O standard procedures are allowed to operate on textual ports and vice versa. In CHICKEN we do not restrict the set of operations depending on the port type, so you can read and write bytevectors to and from a textual port and the other way round. We think this is more practical and intuitive - it makes more sense to read and write binary data as bytevectors and have dedicated operations for this.
R7RS support
The second big change for CHICKEN 6 is proper R7RS (small) compliance. Like with the UTF-8 support this was previously provided by an extension library, but using it needed some minor extra steps to set up. Now, all syntactic definitions of the (scheme base) module are available by default (most notably define-library) without requiring any further action.
Deciding what is available by default in a compilation unit or interpretation environment is a bit of a problem: to make it easy to get going when experimenting or scripting, we defaulted to having all standard R5RS procedures and macros available in the base environment of the interpreter (csi), together with the imports from the (chicken base) module. Compiled code was slightly more restricted but defaulted also to R5RS.
In CHICKEN 5 the scheme module referred to the R5RS standard procedures. To avoid breaking too much existing code this is still the case. So now, scheme is an alias for the R7RS (scheme r5rs) library module that exports all bindings of the former language standard. But to avoid duplicating the set of exported identifiers over several core modules, certain functionality has been moved from (chicken base) to (scheme base) and is thus not available in the default toplevel environment.
To make a long story short, the switch makes it necessary to access certain standard bindings like open-input-string by importing additional modules like (scheme base). This is not necessarily a bad thing, as it incentivises the user to write code in a more standard compliant way. But it all feels a bit clunky and may have been a suboptimal choice. Note that just adding an (import (scheme base)) is usually enough to make existing code run. We will see how that works out.
All required (scheme ...) modules are implemented and can be imported using their full functionality. Two notable changes that influence backwards compatibility are generative record types and hexadecimal escape sequences in string literals.
Formerly record types defined with define-record-type were non-generative: re-defining a record type with the same name, even if the type definition is completely different, would not create a new type. Instances of the former definition would also be of the newly defined type, provided the type name is the same. Now every definition of a record type creates a completely new type, regardless of the name. This is of course more correct and much safer as it doesn't invalidate previously defined instances.
A second (and much more annoying) change is that R7RS requires hex escape sequences in string literals to be terminated by a semicolon. Unfortunately the change is incompatible to the convention used in most programming languages, including existing many Lisp and Scheme implementations.
What in CHICKEN 5 would looked like this:
"the color \x1b[31mRED\x1b[0m"
in CHICKEN 6 (and R7RS) must now be (note the semicolon):
"the color \x1b;[31mRED\x1b;[0m"
The motivation here was probably to avoid having dedicated escape sequences for values that require more than 2 hex digits (e.g. \uNNNN). The change is particularly bad from a compatibility point of view. All string literals that contain the old style of escape sequences must be changed. To keep code working in both CHICKEN 5 and 6 you can use the 4-digit \uNNNN escape sequence which is still valid in all versions.
Foreign Function Interface changes
CHICKEN has a very easy to use foreign function interface, which mostly derives from the fact that the compiler generates C. The alternative approach are binary FFIs that use native code to interface with C code, like libffi, which must reproduce a lot of ABI details to safely interface with C libraries, things like struct alignment and padding, how arguments of various lengths are passed on the stack, etc.
The most notable FFI-related change for CHICKEN 6 is that it allows passing and returning C structs and unions to and from foreign code by value. The contents are copied in and out of bytevectors and wrapped in a block. The components of the struct or union can not be directly manipulated in Scheme but can be passed on to other foreign functions. Additionally, for completeness, you can now also directly pass C99 complex numbers. Note that complex numbers in the FFI are always passed as inexact (i.e., floating-point), as opposed to Scheme complex numbers that may have exact (integer or even rational) real and imaginary components.
Platform support and build system
Here two major changes have been introduced. First, we now have a proper configuration ("configure") script. Formerly, all build parameters were passed as extra arguments to make(1), like this:
make PREFIX=$HOME/.local ...
This required that for every make(1) invocation, the same set of parameters had to be given to avoid inconsistencies. A configuration script written in portable POSIX sh(1) notation is now provided to perform the configuration step once before building. It also follows the de facto convention used in many GNU programs, where the usual incantation is:
./configure; make; make install
Note that we don't use the dreaded "autotools" (autoconf, automake and libtool), which have arcane syntax, are very brittle and produce more problems that they are trying to solve. They were originally designed to port code to now dead or obscure UNIX derivatives, yet fail to provide a straightforward and easy to use configuration tool for modern systems (Linux/BSD, Mac, Windows, mostly). Our configure script is hand written instead of auto-generated, and only checks for the platform differences that are relevant to those platforms that CHICKEN actually supports.
The old style of passing variables to make(1) should still work, but is deprecated.
The second change in the build system is that we cleaned up the confusion about toolchains on Windows systems. There is a bunch of half-maintained flavors of GNU-based C development environments for Windows systems ("MingGW", "MingGW-w64", "MSYS2", etc.) and it is a constant source of pain to support and test all these variants.
There is now one official "blessed" toolchain that we support. Specifically, we recommend Chris Wellon's excellent w64devkit. It contains compilers, build tools and enough of a POSIX environment for building CHICKEN on Windows. We also require a POSIX sh(1) (contained in w64devkit) and have dropped all support for building on a non-POSIX shell, i.e. cmd.exe. This simplifies the build and package management tools considerably. It also ensures we have less environments to test.
Building for Windows Subsystem for Linux (WSL) and Cygwin is of course still supported, but those use the UNIX build setup and need no or very little specific platform support.
Minor changes
Quite a number of minor changes have taken place that either increase the robustness or standards compliance of the system.
syntax-error is now a macro, as required by R7RS. Previously there was a syntax-error procedure in (chicken syntax) with the same argument signature. So where the error was previously raised at runtime, it is now done at expansion time. This is something to take note of when porting code to CHICKEN 6. The invocation is still the same, so the difference can at least be identified easily and corrected, and (scheme base) needs to be imported, of course.
The csc compiler driver passes arguments now properly to subprocesses via execve(2) and not system(3) which reduces shell quoting headaches.
The chicken-install package ("egg") manager locks the build cache directory to avoid conflicts when running multiple installations in parallel. Also, a custom-config feature has been added to places in the package description (.egg) file that specify compile and link options provided by external tools like pkg-config for more portable configuration of native libraries that packages use. The configuration script is expected to be Scheme code. Other eggs can extend the set of possible customisation facilities by providing library code to access them.
The feathers debugger has been removed from the core system and re-packaged as an egg, allowing separate development or replacement. It always was a bit of a proof-of-concept thing that lacks the robustness and flexibility of a real debugger. Some users found it helpful, so we keep it for the time being.
Future directions
Every major release is a chance of fixing long-standing problems with the codebase and address bad design decisions. CHICKEN is now nearly 25 years old and we had many major overhauls of the system. Sometimes these caused a lot of pain, but still we always try to improve things and hopefully make it more enjoyable and practical for our users. There are places in the code that are messy, too complex, or that require cleanup or rewrite, always sitting there waiting to be addressed. On the other hand CHICKEN has been relatively stable compared to many other language implementations and has a priceless community of users that help us improving it. Our users never stop reminding us of what could be better, where the shortcomings are, where things are hard to use or inefficient.
So the final goal is and will always be to make it more robust and easier to use. Performance improvements are always good but are secondary. Strong standards compliance is a requirement, unless it interferes with practicality. We also try to avoid dependencies in the core system at all costs. This eases porting and avoids friction that is very often introduced by inadequate tools or overdesigned development environments.
Some moody ramblings about Scheme standards
The switch to seamless support for R7RS (small) was due for quite a while. R7RS is by now the generally accepted Scheme standard that implementations try to follow. How the further development of R7RS (large) will turn out remains to be seen, but I'm not very confident that it will result in anything more that just a shapeless agglomeration of features. The current direction seems to be oriented towards creating something that includes everything but the "kitchen-sink" - too ambitious and too much driven by the urge to produce something big and comprehensive.
What always distinguished Scheme from other languages and Lisp dialects was its elegance and minimalism. This resulted in a large number of experimental implementations, the investigation of various directions of programming language semantics and in highly interesting advanced implementation techniques that are still in use today. The expressiveness and the small number of core concepts made it also perfect for teaching computing. It still is great for teaching, even if the tendency to address perceived requirements of the "market" replaces the academic use of Scheme with languages that belong more to the "mainstream". This strikes me as strange, as if learning multiple languages and studying different programming paradigms couldn't help in obtaining a broader view of software development; or as if one is somehow wasting time by exploring the world of programming in a non-mainstream language.
Repeatedly the small size and limited scope of Scheme has driven a number of enthusiasts dedicated to favouring a broader use in the industry to demand "bigness". They dream of a comprehensive standard that will support everything and please everyone and makes it easy to write portable and non-trivial applications in Scheme and have them run on large number of implementations. With this comes the tacit expectation that implementers will just follow such a large standard and implement it faithfully.
But what made Scheme successful (and beautiful) was the small size, the small number of powerful concepts, the minimalism, the joy in experimentation. Trying to create the Comprehensive Mother of all Schemes is in my opinion a waste of time. In fact, it makes Scheme less relevant. Large languages inevitably die - the warts and inconsistencies that crop up when you try to design too much up front will just get more blatant as the language ages and will annoy users, constrain implementers and make people search for alternatives.
You can already write serious programs and use a large number of libraries in Scheme: just install CHICKEN, Guile or (even) Racket and get going. The code will to a certain part not be portable across implementations, that's unavoidable, but to use dynamic languages effectively and successfully, some experience with and a certain understanding of the design of the underlying compiler or interpreter is necessary anyway.
Acknowledgements
I would like to thank my employer, bevuta IT GmbH which sponsored part of the effort of preparing a new major release of CHICKEN.
Adding weak references to CHICKEN Posted on 2023-06-27
Recently, a user was asking on IRC why CHICKEN doesn't give proper line numbers for error call traces in the interpreter. This is indeed rather odd, because the compiler gives numbered traces just fine.
After a brief discussion, we figured that this was probably because line numbers are stored in a "database" (hash table) which maps the source expression to a line number. Because you can keep calling (eval) with fresh input, the number of expressions evaluated is potentially infinite.
This would lead to unbounded growth of the line number database, eventually eating up all available memory.
We'd like to fix this. A great solution for this problem would be weak references. A weak reference is a reference to a value that does not cause the garbage collector to hold on to that value. If other things still refer to the value, it is preserved and the weak reference is maintained. If only weak references refer to a value, it may be collected.
The line number database could then consist of such weak references to source expressions. If an expression is no longer held onto, it can be collected and the line number eventually removed. This would turn the database from a regular hash table into a weak hash table.
This requires hooking into the garbage collector so that it knows about these references and can drop them. Since our collector is a copying collector, addresses of objects change over time, and a weak reference needs to be updated if something is still hanging onto it, without itself causing the weakly held object to stick around.
A quick recap of garbage collection
To explain how we changed our garbage collector, I will start with a quick high-level recap of how the garbage collection works. I'll explain just enough to cover what is needed to understand weak references. For a more in-depth look at CHICKEN's garbage collection, see my post on the GC.
First, we have to realise that CHICKEN stores all its values in a machine word. It distinguishes between "immediate" values and non-immediate or "block" values. Immediate values can be either fixnums (bit 0 is set) or other "small" values (bit 1 is set). Non-immediates are recognised because the lower two bits are zero, which is very convenient, as a pointer to a word-aligned address happens to have the lower two bits cleared as well!
In other words, non-immediate values are simply C pointers, whereas the immediate values are encoded long integers. So, non-immediates are allocated in memory and represented by a pointer to them. At the pointer's address we find a header which encodes what exactly is stored there, and (optionally) the data, where we store the contents of compound objects.
typedef struct { C_word header; C_word data[]; /* Variable-length array: header determines length */ } C_SCHEME_BLOCK;
In typical Scheme objects, the data consists of slots. For example, the car and the cdr of a pair are slots, the elements of a vector are slots, etc. Each slot contains a Scheme value. In a graphical representation, a block object looks like this:
The data representation will be important to understand how we implement weak pairs, and to follow the garbage collection process.
If you want more details, see my post about data representation in CHICKEN.
A tale of two spaces
CHICKEN divides the heap up in two halves. While the program is running, only one half is in use. When this half fills up, we start the garbage collection process. We trace all the data that's still considered "live" (aka GC "roots") and copy it over to the other half.
Let's look at an example:
(cons 5 '()) ; unreferenced, may be collected (define x (cons (vector 42) '())) ; sticks around
The half that was full is called the fromspace and the half that starts out empty is called the tospace. Initially, the tospace is empty, and the data is only in the fromspace:
Then, we move the objects, one at a time, starting at the roots. So in the example, the pair that we named "x" gets moved first to tospace, while its slots still point into fromspace:
The changes are in yellow. After tracing the roots, we trace all the contents of the objects in fromspace and copy over the pointed-to objects:
Finally, we're done. Any remaining objects in fromspace are garbage and can be ignored, effectively "clearing" fromspace, and we flip the two spaces so that fromspace becomes tospace and vice versa:
Onward! I mean, forward!
This was a very simple case. A more complex case is when there are several objects with mutual references. We must somehow keep track of which objects got moved where, so that we don't copy the same object more than once.
Another example:
(define shr (vector 42)) (define x (cons shr 1) (define y (cons shr 2))
This would look like the following when visualized:
As I've explained, roots are copied first, so let's say shr gets copied first:
As you can see, this introduces something new. There's a forwarding pointer stored at the address where we originally found the header of the pair known as shr. This is done so that we don't have to traverse all the objects pointing to shr (which could even be cyclic!). Instead, whenever we find an object that holds an address where there now is a forwarding pointer, we derefence the pointer and change the object to point to the target address.
So, if x is copied next, we get the following picture:
We actually *always* leave behind a forwarding pointer when an object is copied, because the GC does not "know" whether anything references the object. I skipped that in the initial examples to keep it simple. And finally, y can be copied:
Now you can see, the forwarding pointers are still there in fromspace, but nothing references them anymore. The GC is now done. Of course, fromspace and tospace will be flipped (but I'm not showing that here).
How weak pairs work in the GC
Let's make a few changes to our example:
(define zero (weak-cons (vector 0) 0)) ;; new - car may be collected (weak-cons (vector 9) 9) ;; new - may be collected (define shr (vector 42)) (define x (cons shr 1) (define y (weak-cons shr 2)) ;; changed to be weak (set! shr #f) ;; new - dropping root for shr (define z (weak-cons (vector #f) 3)) ;; new - car may be collected
Now, y is changed to be a weak pair holding shr in its car, while x is still a normal pair with the same object in its car. The shr variable is also cleared, so that only x holds onto shr's old value strongly. We've added a few more weak pairs while we're at it.
This weak-cons procedure does not exist in CHICKEN 5.3.0; this is what we'll be adding. It creates a pair whose car field is a weak reference. The cdr field is a regular reference. The reason for having the cdr be a regular reference is that this allows us to build up lists of items which may be collected. The "spine" of the list will remain intact, but the list entries will be cleared (see below).
Let's take a look at our new initial situation:
The GC will start with the live objects, as usual. Let's start at the top, copying zero:
Since zero is a weak pair, we won't traverse the car slot, keeping the vector it points to in fromspace. This is intentional, as the GC treats weak pairs differently; only the cdr slot gets scanned, while the car gets skipped and left as a "dangling" pointer into fromspace.
We'll fix that later, as we don't want to keep this situation after the garbage collection process is done.
Normally we'd pick up shr next, but that is not a root anymore. Let's continue with x instead:
Then, as we scan over the contents of x, we encounter what used to be shr and move it, updating the pointer in x:
Next, we move y:
You'll notice that because it's a weak pair, its car slot does not get updated to the new location of what used to be shr, even though that vector has already been copied. As mentioned before, we'll fix this up later.
Finally, z gets moved:
I'm sure you've noticed that now we're left with a big spaghetti bowl of pointers, some of which are still pointing into fromspace. So let's investigate how we can fix this mess.
Fixing up the dangling weak pairs, naively
It just so happens that CHICKEN internally already has support for weak pairs. Those are pairs whose car field holds objects that may be collected. These are used solely in the symbol table, so that symbols aren't maintained indefinitely. This is important because it stops symbol table stuffing attacks. Those are a form of Denial of Service (DoS) by forcing the system to eat up all memory.
In this current implementation, the only weak pairs in existence are those in the symbol table's hash bucket chains. So, the solution there is very simple: traverse all the symbol table buckets and update weak pointers.
So we traverse the slots like during normal GC's mark operation, but with two exceptions:
- When we mark the slot, we don't cause the contents to be copied over into tospace, but only chase any forwarding pointers.
- When we encounter a pointer into fromspace, we replace the slot with a special "broken weak pair" marker.
That would yield this final picture:
As you can see, all the weak pairs have their car slots updated, even the ones that we could consider as garbage, like the two at the top. Those get the special value #!bwp, or "broken weak pointer" to indicate that the value originally stored there is no longer live.
Smarter fixing up of weak pairs
Traversing all the weak pairs is fine when we know every single one in the system. When only hash table buckets in the symbol table may be weak, that's the case. But now we want to expose weak pairs to the user, what do we do?
The simplest solution would be to add a table or list of all the weak pairs as they're constructed, or as the GC encounters them. This has a few problems:
- It is wasteful of memory, as the extra pointers in this table would be redundant.
- If done during garbage collection, we'll have to dynamically allocate during a GC, when memory is already tight (and it's slow).
- We'd end up traversing potentially loads and loads of dead weak pairs that would be themselves collected, as in the example above.
Alternatively, you could traverse the entire heap again after GC and mark the weak pairs. In fact, that's what Chez Scheme does. However, Chez has separate heaps for different types of objects (known as a BIBOP scheme). This has the advantage of traversing only the live weak pairs, instead of every single live object.
In CHICKEN, we'd be traversing every single object, as we store everything on a single heap. If the heap is large, this would be wasteful. But at least we wouldn't have to traverse dead weak pairs!
When looking into this problem, I decided to study the Schemes which have weak pointers. Those were Chez Scheme, Chibi Scheme and MIT Scheme. Of those, only MIT Scheme has a Cheney-style copying garbage collector with a single heap like CHICKEN does. The sources to MIT Scheme are well-documented, so I quickly found how they do it, and boy is it clever!
MIT Scheme's solution:
- Traverses only the still-live weak pairs,
- introduces only a single word of memory overhead,
- and is brilliantly simple once you get how it works.
Garbage collecting weak pairs, MIT style
Let's take a closer look at that end state of that last GC run. Remember that a forwarding pointer overwrites the original header of the object that used to live there. But the memory of the slots is still sitting there, allocated but unused!
The great idea in MIT Scheme is that we can "recycle" that unused memory and store a linked list of still-live weak pairs in that space. The extra word of overhead I mentioned before acts as the head of that list. Let's look at how that looks during the GC, by replaying that last run, but with MIT Scheme's algorithm for tracking weak pairs:
The initial state is the same, except we have the additional weak pair chain head. The GC again starts at the top, copying zero, same as before, but with one addition. When the forwarding pointer gets created, we notice that it's a weak pair. This causes the weak pair chain head to get modified so it points to the new forwarding pointer. The original value of the chain head (the empty list marker) gets stored where the old weak pair's car slot used to be.
Again, we continue with x, which is a regular pair, so nothing interesting happens to the space behind the forwarding pointer:
Then, as we scan over the contents of x, we again encounter what used to be shr and move it, updating the pointer in x. The forwarding pointer for shr is also not put in the chain because shr is not a weak pair:
Next, we move y. Because it is a weak pair, we again link up its forwarding pointer into the chain that we're building up. This means the pointer in the weak chain head (which pointed to zero's original address) gets put in the old car field of y, and the head itself now points to y's forwarding pointer:
Finally, z gets moved. Since it is also a weak pair, when making the forwarding pointer we also link it up into the chain:
Finally, we now only need to traverse the weak pairs by following the trail of pointers starting at the head of the live weak pair chain. For each of the forwarding pointers in the chain:
- Follow the forwarding pointer to its corresponding live weak pair,
- take that pair's car,
- check if the car refers to a forwarding pointer. If it does:
- update the car to the forwarding pointer's destination.
I'm not going to do this one by one as it's pretty obvious. I will show you the result of this walk:
You can see that we've only updated three out of four weak pairs - exactly the number of weak pairs that got copied during this GC and are still considered live. Because only weak pairs that exist in tospace have to be updated, we are doing the absolute minimum work necessary.
The nice thing about this is, that even if you produce lots of garbage weak pairs, they don't have any additional impact on GC performance. Only those weak pairs that survive garbage collection will be scanned and updated.
When implementing this improvement I noticed that our performance on benchmarks improved a little bit, even though they don't use weak pairs directly. This was all due to only scanning copied weak pairs and not having to scan the entire symbol table on every GC.
With the support for user-facing weak pairs in place, supporting line numbers in the interpreter was relatively easy. This means that debugging your code will be easier starting with the upcoming 5.4 release of CHICKEN!
Further reading
- At the Village CHICKENs event, I presented my findings about how to add weak references. Slides are here.
- Back in 2014, Ruby added symbol GC to stop symbol table stuffing attacks. It also improved performance because the GC had to do less work copying stale symbols. The flurry of interest in symbol table stuffing attacks in Ruby was also my inspiration for fixing and improving the symbol GC in CHICKEN.
- The manual to MIT Scheme has a section on weak pairs, and the manual to Chez Scheme has its own, slightly different API.
- MultiScheme: A Parallel Processing System Based on MIT Scheme by James S. Miller. This is the paper that introduces the weak-cons operation and explains in more detail how the MIT Scheme GC was modified, in the same way we modified CHICKEN's GC.
- Don't stop the BIBOP: Flexible and Efficient Storage Management for Dynamically Typed Languages by R. Kent Dybvig, David Eby, and Carl Bruggeman. This paper explains how Chez Scheme's allocation scheme works. We've toyed with the idea of implementing a BIBOP allocator and GC in CHICKEN as well, but it'd be a large undertaking without knowing for sure that it would improve performance.
- Data representations in PDP-10 MacLISP by Guy L. Steele describes the original BIBOP allocation scheme.
Clojure from a Schemer's perspective Posted on 2021-03-03
Recently I joined bevuta IT, where I am now working on a big project written in Clojure. I'm very fortunate to be working in a Lisp for my day job!
As I've mostly worked with Scheme and have used other Lisps here and there, I would like to share my perspective on the language.
Overall design
From a first view, it is pretty clear that Clojure has been designed from scratch by (mostly) one person who is experienced with Lisps and as a language designer. It is quite clean and has a clear vision. Most of the standard library has a very consistent API. It's also nice that it's a Lisp-1, which obviously appeals to me as a Schemer.
My favourite aspect of the language is that everything is designed with a functional-first mindset. This means I can program in the same functional style as I tend to do in Scheme. Actually, it's even more functional, because for example its maps (what would be hash tables in Scheme) are much less clunky to deal with. In Scheme, SRFI-69 hash tables are quite imperative, with hash-table-set! and hash-table-update! being the ways to insert new entries, which of course mutate the existing object. Similarly, Clojure vectors can easily be extended (on either end!) functionally.
The underlying design of Clojure's data structures must be different. It needs to efficiently support functional updates; you don't want to fully copy a hash table or vector whenever you add a new entry. I am not sure how efficient everything is, because the system I'm working on isn't in production yet. A quick look at the code implies that various data structures are used under the hood for what looks like one data structure in the language. That's a lot of complexity! I'm not sure that's a tradeoff I'd be happy to make. It makes it harder to reason about performance. You might just be using a completely different underlying data structure than expected, depending on which operations you've performed.
(non) Lispiness
To a seasoned Lisp or Scheme programmer, Clojure can appear positively bizarre. For example, while there is a cons function, there are no cons cells, and car and cdr don't exist. Instead, it has first and rest, which are definitely saner names for a language designed from scratch. It has "persistent lists", which are immutable lists, but in most day to day programming you will not even be using lists, as weird as that sounds!
Symbols and keywords
One thing that really surprised me is that symbols are not interned. This means that two symbols which are constructed on the fly, or when read from the same REPL, are not identical (as in eq or eq?) to one another:
user> (= 'foo 'foo) true user> (identical? 'foo 'foo) false
Keywords seem to fulfil most "symbolic programming" use cases. For example, they're almost always used as "keys" in maps or when specifying options for functions. Keywords are interned:
user> (= :foo :foo) true user> (identical? :foo :foo) true
Code is still (mostly) expressed as lists of symbols, though. When you're writing macros you'll deal with them a lot. But in "regular" code you will deal more with keywords, maps and vectors than lists and symbols.
Numeric tower
A favorite gotcha of mine is that integers are not automatically promoted to bignums like in most Lisps that support bignums. If you need bignums, you have to use special-purpose operators like +' and -':
user> (* (bit-shift-left 1 62) 2) Execution error (ArithmeticException) at user/eval51159 (REPL:263). integer overflow user> (*' (bit-shift-left 1 62) 2) 9223372036854775808N user> (* (bit-shift-left 1 62) 2N) ; regular * supports BigInt inputs, though 9223372036854775808N user> (* 1N 1) ; but small BigInts aren't normalized to Java Longs 1N
This could lead to better performance at the cost of more headaches when dealing with the accidental large numbers in code that was not prepared for them.
What about rationals, you ask? Well, those are just treated as "the unusual, slow case". So even though they do normalize to regular integers when simplifying, operations on those always return BigInts:
user> (+ 1/2 1/4) 3/4 user> (+ 1/2 1/2) 1N user> (/ 1 2) ; division is the odd one out 1/2 user> (/ 4 2) ; it doesn't just punt and always produce bignums, either: 2
The sad part is, bitwise operators do not support bignums, at all:
user> (bit-shift-right 9223372036854775808N 62) Execution error (IllegalArgumentException) at user/eval51167 (REPL:273). bit operation not supported for: class clojure.lang.BigInt user> (bit-shift-right' 9223372036854775808N 62) ; does not exist Syntax error compiling at (*cider-repl test:localhost:46543(clj)*:276:7). Unable to resolve symbol: bit-shift-right' in this context
There's one benefit to all of this: if you know the types of something going into numeric operators, you will typically know the type that comes out, because there is no automatic coercion. Like I mentioned, this may provide a performance benefit, but it also simplifies reasoning about types. Unfortunately, this does not work as well as you would hope because division may change the type, depending on whether the result divides cleanly or not.
Syntax
For many Lispers, this is the elephant in the room. Clojure certainly qualifies as a Lisp, but it is much heavier on syntax than most other Lisps. Let's look at a small contrived example:
(let [foo-value (+ 1 2)
bar-value (* 3 4)]
{:foo foo-value
:bar bar-value})
This is a let just like in Common Lisp or Scheme. The bindings are put inside square brackets, which is literal syntax for vectors. Inside this vector, key-value pairs are interleaved, like in a Common Lisp property list.
The lack of extra sets of "grouping" parentheses is a bit jarring at first, but you get used to it rather quickly. I still mess up occasionally when I accidentally get an odd number of entries in a binding vector. Now, the {:foo foo-value :bar bar-value} syntax is a map, which acts like a hash table (more on that below).
There doesn't seem to be a good rationale about why vectors are used instead of regular lists, though. What I do really like is that all the binding forms (even function signatures!) support destructuring. The syntax for destructuring maps is a bit ugly, but having it available is super convenient.
What I regard as a design mistake is the fact that Clojure allows for optional commas in lists and function calls. Commas are just whitespace to the reader. For example:
(= [1, 2, 3, 4] [1 2 3 4]) => true (= '(1, 2, 3, 4) '(1 2 3 4)) => true (= {:foo 1, :bar 2, :qux 3} {:foo 1 :bar 2 :qux 3}) => true (= (foo 1, 2, 3, 4) (foo 1 2 3 4)) => true ;; A bit silly: (= [,,,,,,1,,,2,3,4,,,,,,] [1 2 3 4]) => true
Maybe this is to make up for removing the extra grouping parentheses in let, cond and map literal syntax? With commas you can add back some clarity about which items belong together. Rarely anybody uses commas in real code, though. And since it's optional it doesn't make much sense.
This has an annoying ripple effect on quasiquotation. Due to this decision, a different character has to be used for unquote, because the comma was already taken:
`(1 2 ~(+ 1 2)) => (1 2 3) `(1 2 ~@(list 3 4)) => (1 2 3 4)
This might seem like a small issue, but it is an unnecessary and stupid distraction.
Minimalism
One of the main reasons I enjoy Scheme so much is its goal of minimalism. This is achieved through elegant building blocks. This is embodied by the Prime Clingerism:
Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.
Let's check the size of the clojure.core library. It clocks in at 640 identifiers (v1.10.1), which is a lot more than R5RS Scheme's 218 identifiers. It's not an entirely fair comparison as Scheme without SRFI-1 or SRFI-43 or an FFI has much less functionality as well. Therefore, I think Clojure's core library is fairly small but not exactly an exercise in minimalism.
Clojure reduces its API size considerably by having a "sequence abstraction". This is similar to Common Lisp's sequences: you can call map, filter or length on any sequence-type object: lists, vectors, strings and even maps (which are treated as key/value pairs). However, it is less hacky than in Common Lisp because for example with map you don't need to specify which kind of sequence you want to get back. I get the impression that in Common Lisp this abstraction is not very prominent or used often but in Clojure everything uses sequences. What I also liked is that sequences can be lazy, which removes the need for special operators as well.
If you compare this to Scheme, you have special-purpose procedures for every concrete type: length, vector-length, string-length etc. And there's no vector-map in the standard, so you need vector-map from SRFI 43. Lazy lists are a separate type with its own set of specialized operators. And so on and so forth. Using concrete types everywhere provides for less abstract and confusing code and the performance characteristics of an algorithm tend to be clearer, but it also leads to a massive growth in library size.
After a while I really started noticing mistakes that make additional features appear necessary: for example, there's a special macro called loop to make tail recursive calls. This uses a keyword recur to call back into the loop. In Scheme, you would do that with a named let where you can choose your own identifier to recur. It's also not possible to nest such Clojure loops, because the identifier is hardcoded. So, this called for adding another feature, which is currently in proposal. Speaking of recur, it is also used for tail recursive self-calls. It relies on the programmer rather than the compiler to mark calls as tail recursive. I find this a bit of a cop-out, especially in a language that is so heavily functional. Especially since this doesn't work for mutually tail-recursive functions. The official way to do those is even more of a crutch.
I find the special syntax for one-off lambdas #(foo %) just as misguided as SRFI 26 (cut and cute). You often end up needing to tweak the code in such a way that you have to transform the lambda to a proper fn. And just like cut, it doesn't save that many characters anyway and makes the code less readable.
The -> macro is a clever hack which allows you to "thread" values through expressions. It implicitly adds the value as the first argument to the first forms, the result of that form as the first argument for the next, etc. Because the core library is quite well-designed, this works 90% of the time. Then the other 10% you need ->> which does the same but adds the implicit argument at the end of the forms. And that's not always enough either, so they decided to add a generic version called as-> which binds the value to a name so you can put it at any place in the forms. These macros also don't compose well. For example, sometimes you need a let in a -> chain to have a temporary binding. That doesn't work because you can't randomly insert forms into let, so you have to split things up again.
And as I note below, the minimalism is kind of "fake" because some essentials simply aren't provided; you have to rely on Java for that.
Java integration
Clojure was originally designed as a "hosted language", so it leverages the JVM. It does this admirably well; Java classes can be seamlessly invoked through Clojure, without any ceremony:
user> (java.util.UUID/randomUUID) #uuid "bb788bae-5099-4a64-9c37-f6219d40a47f" ;; alternatively: user> (import 'java.util.UUID) java.util.UUID user> (UUID/randomUUID) #uuid "0bfd2092-14e1-4b88-a465-18698943ea4e"
The downside is that the above is the way to generate a random UUID. So even though uuids have literal syntax in Clojure (as #uuid "..."), there is no Lispy API for them in the Clojure standard library. This can be pretty frustrating, especially in the beginning. There's no clear indication where to look; sometimes you'll be poring over Java language docs for random stuff you thought would have a Clojure interface (like, say, creating temporary files or dealing with byte arrays). At those moments, you're basically programming Java with parentheses.
Having said that, there will often be community-provided nicer APIs for many of those things, but then you need to decide between adding an extra dependency just for a slightly nicer syntax.
Development style
REPL-driven development
Speaking of Java, one thing that constantly bothers me is the slow startup times of the REPL. In my current project, it takes almost 30 seconds to boot up a development REPL. Half a minute!
Luckily, there's great Slime-like Emacs integration with CIDER. Basically, the only sane way to do iterative development is by connecting to a REPL first thing you do and then sending your code to it all the time.
Now, this may sound weird from a Scheme programmer, but I never fully bought into the REPL style of developing. Sure, I experiment all the time in the REPL to try out a new API design or to quickly iterate on some function I'm writing. But my general development style tends more towards the "save and then run the test suite from an xterm". Relying solely on the REPL just "feels" jarring to me. I also constantly run into issues where re-evaluating a buffer doesn't get rid of global state that was built up on a previous run. When this happens, I'm testing an old version of some function without realising it. Keeping track of the "live" state versus the textual code I'm looking at is a total mind fuck for me. I don't understand how others can do this.
Another thing I seem to constantly do is write some code, have the tests go all green, only to see the CI crash on some cyclic dependency in my namespaces. The REPL does not always see those, because reloading a buffer with a namespace declaration works just fine when you loaded the imported namespaces before, even though they refer to the namespace being re-evaluated.
One thing I really find very nice when you're using CIDER is that everything (and I do mean everything) from Clojure is just a "jump to source" away. Most of the builtin functions seems to be written in Clojure itself. For example, if you want to know how map is implemented, you can just press M-. to see it.
Maps and keywords for everything
One thing you'll really notice is that in idiomatic Clojure code, maps are used for everything. A map is a functionally updateable hash table. It looks like this:
{:key-1 "value 1" :key-2 "value 2"}
This lends to a very dynamic style of programming, very much like you would in (dare I say it?) PHP. A bit of a strange comparison, but PHP also makes dealing with arrays (which double as maps in a weird way) extremely ergonomic. There, missing nested keys are automatically created on the fly and because of a strange quirk in its developmental history, arrays are the only objects which are passed by value. This means you can program in a referentially transparent way, while still mutating them inside functions at will. Not exactly the same mechanism, but the end effect on programming style feels very similar: you reach for them whenever you want to bunch some stuff together. It is the go-to data structure when you need flexibility.
In other Lisps you'd use alists (or plists, or SRFI-69 hash tables) for this, but they don't deal so well with nested maps and the library is not as convenient. For example, you can easily select, drop and rename keys in a map:
(-> {:key-1 "value 1" :key-2 "value 2"} (set/rename-keys {:key-1 :key}) (dissoc :key-2) (assoc :foo "bar")) => {:key "value 1" :foo "bar"}
This -> notation took me a while to get used to by the way, and I'm still not entirely comfortable with it. I explained how it works above. It's a macro for "threading" expressions. In Scheme, you'd probably use a let* for this, or something. In Clojure that would look like this:
(let [map {:key-1 "value 1" :key-2 "value 2"} map (set/rename-keys map {:key-1 :key}) map (dissoc map :key-2) map (assoc map :foo "bar")] map) => {:key "value 1" :foo "bar"}
As you can see, the version with -> is much more convenient and less repetitive. Unfortunately, it doesn't compose that well (duh, it's a macro), but because of the way the standard library is designed it is more useful than it would seem at first glance.
Anyway, the way maps are typically used everywhere in a project means that there's a lot less "structure" to your data structures. It is extremely convenient to use maps, even though there are also things like records and protocols. Because of their convenience, you'll end up using maps for everything. As I've noticed in my refactorings, when you change the structure of maps, a lot of code is going to break without a clear indication of where it went wrong.
This is made extra painful by "nil punning". For example, when you look up something in a map that doesn't exist, nil is returned. In Clojure, many operations (like first or rest) on nil just return nil instead of raising an error. So, when you think you are looking up something in a map, but the "map" is actually nil, it will not give an error, but it will return nil.
Now like I said, sometimes you may get an error on nil. It's a bit unclear which operations are nil-punning and which will give a proper error. So when you finally get a nil error, you will have a hell of a time trying to trace back where this nil got generated, as that may have been several function calls ago. This is an example where I really like the strictness of Scheme as compared to some other Lisps, as nil-punning is traditionally a dynamic Lisp thing; it's not unique to Clojure.
Multimethods with keywords
Initially, I was quite impressed by the way multimethods work; they're super simple and clean, yet powerful. First, you declare the multimethod and a "decision procedure", which returns a value that can be compared:
(defmulti say-hi :kind) (defmethod say-hi :default [animal] (println (:name animal) "says hello")) (defmethod say-hi :duck [animal] (println (:name animal) "says quack")) (defmethod say-hi :dog [animal] (println (:name animal) "says woof")) (say-hi {:name "Daffy" :kind :duck}) => "Daffy says quack" (say-hi {:name "Pluto" :kind :dog}) => "Pluto says woof" (say-hi {:name "Peter" :kind :human}) => "Peter says hello"
Using multimethods takes some care and taste, because it splits up your logic. So instead of having one place where you have decisions made with an if or cond tree, you have a function call and then depending on how the multimethod was defined, a different function will be called. This is basically what makes C++ so difficult to deal with in large projects: when people use function overloading, it can get really messy. You need to figure out which of the many things called "say-hi" is actually called in a situation, before you can dive into that implementation.
Compared to the insane amount of customizability that e.g. CLOS offers you, the design restraint shown in Clojure multimethods was nice to see, but then I realised this simplicity can be completely defeated by building hierarchies. That is, Clojure allows you to define a hierarchy on keywords. This was a huge wtf for me, because to me, keywords are just static entities that are unrelated to eachother.
When you realise how Clojure keywords can be namespaced, it makes slightly more sense: this gives them some separation.
A keyword can appear in "bare" form like :foo. This is a globally scoped keyword that belongs to no particular code. It's definitely not smart to hang a hierarchy onto such a keyword, and you're also better off not adding any "meta attributes" to them.
The other form is ::foo, which puts the keyword in the current namespace, which is shorthand for ::more-magic.net/foo if you are in the more-magic.net namespace.
Conclusion
All in all, Clojure is a well-designed language with neat features and it's certainly a lot better than most other JVM languages. There are things in it that I wish Scheme had, and it's certainly functional and modern. As a general programming language, I just can't get over the JVM and all its Java trappings, which is just not my cup of tea.
Apart from the JVM, there are some gratuitous departures from traditional Lisps, especially the "rich syntax" and the extreme reliance and overloading of keywords and maps.
As always, such things are a matter of taste, so take my opinion with a large grain of salt.