About this blog

Hi, I'm Peter Bex, a Scheme and free software enthusiast from the Netherlands. See my user page on the CHICKEN wiki or my git server for some of my projects.

The 3 most recent posts (archive) Atom feed

Recently, a user was asking on IRC why CHICKEN doesn't give proper line numbers for error call traces in the interpreter. This is indeed rather odd, because the compiler gives numbered traces just fine.

After a brief discussion, we figured that this was probably because line numbers are stored in a "database" (hash table) which maps the source expression to a line number. Because you can keep calling (eval) with fresh input, the number of expressions evaluated is potentially infinite.

This would lead to unbounded growth of the line number database, eventually eating up all available memory.

We'd like to fix this. A great solution for this problem would be weak references. A weak reference is a reference to a value that does not cause the garbage collector to hold on to that value. If other things still refer to the value, it is preserved and the weak reference is maintained. If only weak references refer to a value, it may be collected.

The line number database could then consist of such weak references to source expressions. If an expression is no longer held onto, it can be collected and the line number eventually removed. This would turn the database from a regular hash table into a weak hash table.

This requires hooking into the garbage collector so that it knows about these references and can drop them. Since our collector is a copying collector, addresses of objects change over time, and a weak reference needs to be updated if something is still hanging onto it, without itself causing the weakly held object to stick around.

A quick recap of garbage collection

To explain how we changed our garbage collector, I will start with a quick high-level recap of how the garbage collection works. I'll explain just enough to cover what is needed to understand weak references. For a more in-depth look at CHICKEN's garbage collection, see my post on the GC.

First, we have to realise that CHICKEN stores all its values in a machine word. It distinguishes between "immediate" values and non-immediate or "block" values. Immediate values can be either fixnums (bit 0 is set) or other "small" values (bit 1 is set). Non-immediates are recognised because the lower two bits are zero, which is very convenient, as a pointer to a word-aligned address happens to have the lower two bits cleared as well!

In other words, non-immediate values are simply C pointers, whereas the immediate values are encoded long integers. So, non-immediates are allocated in memory and represented by a pointer to them. At the pointer's address we find a header which encodes what exactly is stored there, and (optionally) the data, where we store the contents of compound objects.

  typedef struct
  {
    C_word header;
    C_word data[];    /* Variable-length array: header determines length */
  } C_SCHEME_BLOCK;

In typical Scheme objects, the data consists of slots. For example, the car and the cdr of a pair are slots, the elements of a vector are slots, etc. Each slot contains a Scheme value. In a graphical representation, a block object looks like this:

The data representation will be important to understand how we implement weak pairs, and to follow the garbage collection process.

If you want more details, see my post about data representation in CHICKEN.

A tale of two spaces

CHICKEN divides the heap up in two halves. While the program is running, only one half is in use. When this half fills up, we start the garbage collection process. We trace all the data that's still considered "live" (aka GC "roots") and copy it over to the other half.

Let's look at an example:

  (cons 5 '())  ; unreferenced, may be collected
  (define x (cons (vector 42) '())) ; sticks around

The half that was full is called the fromspace and the half that starts out empty is called the tospace. Initially, the tospace is empty, and the data is only in the fromspace:

Then, we move the objects, one at a time, starting at the roots. So in the example, the pair that we named "x" gets moved first to tospace, while its slots still point into fromspace:

The changes are in yellow. After tracing the roots, we trace all the contents of the objects in fromspace and copy over the pointed-to objects:

Finally, we're done. Any remaining objects in fromspace are garbage and can be ignored, effectively "clearing" fromspace, and we flip the two spaces so that fromspace becomes tospace and vice versa:

Onward! I mean, forward!

This was a very simple case. A more complex case is when there are several objects with mutual references. We must somehow keep track of which objects got moved where, so that we don't copy the same object more than once.

Another example:

  (define shr (vector 42))
  (define x (cons shr 1)
  (define y (cons shr 2))

This would look like the following when visualized:

As I've explained, roots are copied first, so let's say shr gets copied first:

As you can see, this introduces something new. There's a forwarding pointer stored at the address where we originally found the header of the pair known as shr. This is done so that we don't have to traverse all the objects pointing to shr (which could even be cyclic!). Instead, whenever we find an object that holds an address where there now is a forwarding pointer, we derefence the pointer and change the object to point to the target address.

So, if x is copied next, we get the following picture:

We actually *always* leave behind a forwarding pointer when an object is copied, because the GC does not "know" whether anything references the object. I skipped that in the initial examples to keep it simple. And finally, y can be copied:

Now you can see, the forwarding pointers are still there in fromspace, but nothing references them anymore. The GC is now done. Of course, fromspace and tospace will be flipped (but I'm not showing that here).

How weak pairs work in the GC

Let's make a few changes to our example:

  (define zero (weak-cons (vector 0) 0))  ;; new - car may be collected
  (weak-cons (vector 9) 9) ;; new - may be collected
  (define shr (vector 42))
  (define x (cons shr 1)
  (define y (weak-cons shr 2)) ;; changed to be weak
  (set! shr #f) ;; new - dropping root for shr
  (define z (weak-cons (vector #f) 3)) ;; new - car may be collected

Now, y is changed to be a weak pair holding shr in its car, while x is still a normal pair with the same object in its car. The shr variable is also cleared, so that only x holds onto shr's old value strongly. We've added a few more weak pairs while we're at it.

This weak-cons procedure does not exist in CHICKEN 5.3.0; this is what we'll be adding. It creates a pair whose car field is a weak reference. The cdr field is a regular reference. The reason for having the cdr be a regular reference is that this allows us to build up lists of items which may be collected. The "spine" of the list will remain intact, but the list entries will be cleared (see below).

Let's take a look at our new initial situation:

The GC will start with the live objects, as usual. Let's start at the top, copying zero:

Since zero is a weak pair, we won't traverse the car slot, keeping the vector it points to in fromspace. This is intentional, as the GC treats weak pairs differently; only the cdr slot gets scanned, while the car gets skipped and left as a "dangling" pointer into fromspace.

We'll fix that later, as we don't want to keep this situation after the garbage collection process is done.

Normally we'd pick up shr next, but that is not a root anymore. Let's continue with x instead:

Then, as we scan over the contents of x, we encounter what used to be shr and move it, updating the pointer in x:

Next, we move y:

You'll notice that because it's a weak pair, its car slot does not get updated to the new location of what used to be shr, even though that vector has already been copied. As mentioned before, we'll fix this up later.

Finally, z gets moved:

I'm sure you've noticed that now we're left with a big spaghetti bowl of pointers, some of which are still pointing into fromspace. So let's investigate how we can fix this mess.

Fixing up the dangling weak pairs, naively

It just so happens that CHICKEN internally already has support for weak pairs. Those are pairs whose car field holds objects that may be collected. These are used solely in the symbol table, so that symbols aren't maintained indefinitely. This is important because it stops symbol table stuffing attacks. Those are a form of Denial of Service (DoS) by forcing the system to eat up all memory.

In this current implementation, the only weak pairs in existence are those in the symbol table's hash bucket chains. So, the solution there is very simple: traverse all the symbol table buckets and update weak pointers.

So we traverse the slots like during normal GC's mark operation, but with two exceptions:

  • When we mark the slot, we don't cause the contents to be copied over into tospace, but only chase any forwarding pointers.
  • When we encounter a pointer into fromspace, we replace the slot with a special "broken weak pair" marker.

That would yield this final picture:

As you can see, all the weak pairs have their car slots updated, even the ones that we could consider as garbage, like the two at the top. Those get the special value #!bwp, or "broken weak pointer" to indicate that the value originally stored there is no longer live.

Smarter fixing up of weak pairs

Traversing all the weak pairs is fine when we know every single one in the system. When only hash table buckets in the symbol table may be weak, that's the case. But now we want to expose weak pairs to the user, what do we do?

The simplest solution would be to add a table or list of all the weak pairs as they're constructed, or as the GC encounters them. This has a few problems:

  • It is wasteful of memory, as the extra pointers in this table would be redundant.
  • If done during garbage collection, we'll have to dynamically allocate during a GC, when memory is already tight (and it's slow).
  • We'd end up traversing potentially loads and loads of dead weak pairs that would be themselves collected, as in the example above.

Alternatively, you could traverse the entire heap again after GC and mark the weak pairs. In fact, that's what Chez Scheme does. However, Chez has separate heaps for different types of objects (known as a BIBOP scheme). This has the advantage of traversing only the live weak pairs, instead of every single live object.

In CHICKEN, we'd be traversing every single object, as we store everything on a single heap. If the heap is large, this would be wasteful. But at least we wouldn't have to traverse dead weak pairs!

When looking into this problem, I decided to study the Schemes which have weak pointers. Those were Chez Scheme, Chibi Scheme and MIT Scheme. Of those, only MIT Scheme has a Cheney-style copying garbage collector with a single heap like CHICKEN does. The sources to MIT Scheme are well-documented, so I quickly found how they do it, and boy is it clever!

MIT Scheme's solution:

  • Traverses only the still-live weak pairs,
  • introduces only a single word of memory overhead,
  • and is brilliantly simple once you get how it works.

Garbage collecting weak pairs, MIT style

Let's take a closer look at that end state of that last GC run. Remember that a forwarding pointer overwrites the original header of the object that used to live there. But the memory of the slots is still sitting there, allocated but unused!

The great idea in MIT Scheme is that we can "recycle" that unused memory and store a linked list of still-live weak pairs in that space. The extra word of overhead I mentioned before acts as the head of that list. Let's look at how that looks during the GC, by replaying that last run, but with MIT Scheme's algorithm for tracking weak pairs:

The initial state is the same, except we have the additional weak pair chain head. The GC again starts at the top, copying zero, same as before, but with one addition. When the forwarding pointer gets created, we notice that it's a weak pair. This causes the weak pair chain head to get modified so it points to the new forwarding pointer. The original value of the chain head (the empty list marker) gets stored where the old weak pair's car slot used to be.

Again, we continue with x, which is a regular pair, so nothing interesting happens to the space behind the forwarding pointer:

Then, as we scan over the contents of x, we again encounter what used to be shr and move it, updating the pointer in x. The forwarding pointer for shr is also not put in the chain because shr is not a weak pair:

Next, we move y. Because it is a weak pair, we again link up its forwarding pointer into the chain that we're building up. This means the pointer in the weak chain head (which pointed to zero's original address) gets put in the old car field of y, and the head itself now points to y's forwarding pointer:

Finally, z gets moved. Since it is also a weak pair, when making the forwarding pointer we also link it up into the chain:

Finally, we now only need to traverse the weak pairs by following the trail of pointers starting at the head of the live weak pair chain. For each of the forwarding pointers in the chain:

  • Follow the forwarding pointer to its corresponding live weak pair,
  • take that pair's car,
  • check if the car refers to a forwarding pointer. If it does:
  • update the car to the forwarding pointer's destination.

I'm not going to do this one by one as it's pretty obvious. I will show you the result of this walk:

You can see that we've only updated three out of four weak pairs - exactly the number of weak pairs that got copied during this GC and are still considered live. Because only weak pairs that exist in tospace have to be updated, we are doing the absolute minimum work necessary.

The nice thing about this is, that even if you produce lots of garbage weak pairs, they don't have any additional impact on GC performance. Only those weak pairs that survive garbage collection will be scanned and updated.

When implementing this improvement I noticed that our performance on benchmarks improved a little bit, even though they don't use weak pairs directly. This was all due to only scanning copied weak pairs and not having to scan the entire symbol table on every GC.

With the support for user-facing weak pairs in place, supporting line numbers in the interpreter was relatively easy. This means that debugging your code will be easier starting with the upcoming 5.4 release of CHICKEN!

Further reading


Recently I joined bevuta IT, where I am now working on a big project written in Clojure. I'm very fortunate to be working in a Lisp for my day job!

As I've mostly worked with Scheme and have used other Lisps here and there, I would like to share my perspective on the language.

Overall design

From a first view, it is pretty clear that Clojure has been designed from scratch by (mostly) one person who is experienced with Lisps and as a language designer. It is quite clean and has a clear vision. Most of the standard library has a very consistent API. It's also nice that it's a Lisp-1, which obviously appeals to me as a Schemer.

My favourite aspect of the language is that everything is designed with a functional-first mindset. This means I can program in the same functional style as I tend to do in Scheme. Actually, it's even more functional, because for example its maps (what would be hash tables in Scheme) are much less clunky to deal with. In Scheme, SRFI-69 hash tables are quite imperative, with hash-table-set! and hash-table-update! being the ways to insert new entries, which of course mutate the existing object. Similarly, Clojure vectors can easily be extended (on either end!) functionally.

The underlying design of Clojure's data structures must be different. It needs to efficiently support functional updates; you don't want to fully copy a hash table or vector whenever you add a new entry. I am not sure how efficient everything is, because the system I'm working on isn't in production yet. A quick look at the code implies that various data structures are used under the hood for what looks like one data structure in the language. That's a lot of complexity! I'm not sure that's a tradeoff I'd be happy to make. It makes it harder to reason about performance. You might just be using a completely different underlying data structure than expected, depending on which operations you've performed.

(non) Lispiness

To a seasoned Lisp or Scheme programmer, Clojure can appear positively bizarre. For example, while there is a cons function, there are no cons cells, and car and cdr don't exist. Instead, it has first and rest, which are definitely saner names for a language designed from scratch. It has "persistent lists", which are immutable lists, but in most day to day programming you will not even be using lists, as weird as that sounds!

Symbols and keywords

One thing that really surprised me is that symbols are not interned. This means that two symbols which are constructed on the fly, or when read from the same REPL, are not identical (as in eq or eq?) to one another:

user> (= 'foo 'foo)
true
user> (identical? 'foo 'foo)
false

Keywords seem to fulfil most "symbolic programming" use cases. For example, they're almost always used as "keys" in maps or when specifying options for functions. Keywords are interned:

user> (= :foo :foo)
true
user> (identical? :foo :foo)
true

Code is still (mostly) expressed as lists of symbols, though. When you're writing macros you'll deal with them a lot. But in "regular" code you will deal more with keywords, maps and vectors than lists and symbols.

Numeric tower

A favorite gotcha of mine is that integers are not automatically promoted to bignums like in most Lisps that support bignums. If you need bignums, you have to use special-purpose operators like +' and -':

user> (* (bit-shift-left 1 62) 2)
Execution error (ArithmeticException) at user/eval51159 (REPL:263).
integer overflow
user> (*' (bit-shift-left 1 62) 2)
9223372036854775808N

user> (* (bit-shift-left 1 62) 2N) ; regular * supports BigInt inputs, though
9223372036854775808N
user> (* 1N 1) ; but small BigInts aren't normalized to Java Longs
1N

This could lead to better performance at the cost of more headaches when dealing with the accidental large numbers in code that was not prepared for them.

What about rationals, you ask? Well, those are just treated as "the unusual, slow case". So even though they do normalize to regular integers when simplifying, operations on those always return BigInts:

user> (+ 1/2 1/4)
3/4
user> (+ 1/2 1/2)
1N
user> (/ 1 2) ; division is the odd one out
1/2
user> (/ 4 2) ; it doesn't just punt and always produce bignums, either:
2

The sad part is, bitwise operators do not support bignums, at all:

user> (bit-shift-right 9223372036854775808N 62)
Execution error (IllegalArgumentException) at user/eval51167 (REPL:273).
bit operation not supported for: class clojure.lang.BigInt
user> (bit-shift-right' 9223372036854775808N 62) ; does not exist
Syntax error compiling at (*cider-repl test:localhost:46543(clj)*:276:7).
Unable to resolve symbol: bit-shift-right' in this context

There's one benefit to all of this: if you know the types of something going into numeric operators, you will typically know the type that comes out, because there is no automatic coercion. Like I mentioned, this may provide a performance benefit, but it also simplifies reasoning about types. Unfortunately, this does not work as well as you would hope because division may change the type, depending on whether the result divides cleanly or not.

Syntax

For many Lispers, this is the elephant in the room. Clojure certainly qualifies as a Lisp, but it is much heavier on syntax than most other Lisps. Let's look at a small contrived example:

(let [foo-value (+ 1 2)
      bar-value (* 3 4)]
  {:foo foo-value
   :bar bar-value})

This is a let just like in Common Lisp or Scheme. The bindings are put inside square brackets, which is literal syntax for vectors. Inside this vector, key-value pairs are interleaved, like in a Common Lisp property list.

The lack of extra sets of "grouping" parentheses is a bit jarring at first, but you get used to it rather quickly. I still mess up occasionally when I accidentally get an odd number of entries in a binding vector. Now, the {:foo foo-value :bar bar-value} syntax is a map, which acts like a hash table (more on that below).

There doesn't seem to be a good rationale about why vectors are used instead of regular lists, though. What I do really like is that all the binding forms (even function signatures!) support destructuring. The syntax for destructuring maps is a bit ugly, but having it available is super convenient.

What I regard as a design mistake is the fact that Clojure allows for optional commas in lists and function calls. Commas are just whitespace to the reader. For example:

(= [1, 2, 3, 4] [1 2 3 4]) => true
(= '(1, 2, 3, 4) '(1 2 3 4)) => true
(= {:foo 1, :bar 2, :qux 3} {:foo 1 :bar 2 :qux 3}) => true
(= (foo 1, 2, 3, 4) (foo 1 2 3 4)) => true
;; A bit silly:
(= [,,,,,,1,,,2,3,4,,,,,,] [1 2 3 4]) => true

Maybe this is to make up for removing the extra grouping parentheses in let, cond and map literal syntax? With commas you can add back some clarity about which items belong together. Rarely anybody uses commas in real code, though. And since it's optional it doesn't make much sense.

This has an annoying ripple effect on quasiquotation. Due to this decision, a different character has to be used for unquote, because the comma was already taken:

`(1 2 ~(+ 1 2)) => (1 2 3)
`(1 2 ~@(list 3 4)) => (1 2 3 4)

This might seem like a small issue, but it is an unnecessary and stupid distraction.

Minimalism

One of the main reasons I enjoy Scheme so much is its goal of minimalism. This is achieved through elegant building blocks. This is embodied by the Prime Clingerism:

  Programming languages should be designed not by piling feature on
  top of feature, but by removing the weaknesses and restrictions
  that make additional features appear necessary.

Let's check the size of the clojure.core library. It clocks in at 640 identifiers (v1.10.1), which is a lot more than R5RS Scheme's 218 identifiers. It's not an entirely fair comparison as Scheme without SRFI-1 or SRFI-43 or an FFI has much less functionality as well. Therefore, I think Clojure's core library is fairly small but not exactly an exercise in minimalism.

Clojure reduces its API size considerably by having a "sequence abstraction". This is similar to Common Lisp's sequences: you can call map, filter or length on any sequence-type object: lists, vectors, strings and even maps (which are treated as key/value pairs). However, it is less hacky than in Common Lisp because for example with map you don't need to specify which kind of sequence you want to get back. I get the impression that in Common Lisp this abstraction is not very prominent or used often but in Clojure everything uses sequences. What I also liked is that sequences can be lazy, which removes the need for special operators as well.

If you compare this to Scheme, you have special-purpose procedures for every concrete type: length, vector-length, string-length etc. And there's no vector-map in the standard, so you need vector-map from SRFI 43. Lazy lists are a separate type with its own set of specialized operators. And so on and so forth. Using concrete types everywhere provides for less abstract and confusing code and the performance characteristics of an algorithm tend to be clearer, but it also leads to a massive growth in library size.

After a while I really started noticing mistakes that make additional features appear necessary: for example, there's a special macro called loop to make tail recursive calls. This uses a keyword recur to call back into the loop. In Scheme, you would do that with a named let where you can choose your own identifier to recur. It's also not possible to nest such Clojure loops, because the identifier is hardcoded. So, this called for adding another feature, which is currently in proposal. Speaking of recur, it is also used for tail recursive self-calls. It relies on the programmer rather than the compiler to mark calls as tail recursive. I find this a bit of a cop-out, especially in a language that is so heavily functional. Especially since this doesn't work for mutually tail-recursive functions. The official way to do those is even more of a crutch.

I find the special syntax for one-off lambdas #(foo %) just as misguided as SRFI 26 (cut and cute). You often end up needing to tweak the code in such a way that you have to transform the lambda to a proper fn. And just like cut, it doesn't save that many characters anyway and makes the code less readable.

The -> macro is a clever hack which allows you to "thread" values through expressions. It implicitly adds the value as the first argument to the first forms, the result of that form as the first argument for the next, etc. Because the core library is quite well-designed, this works 90% of the time. Then the other 10% you need ->> which does the same but adds the implicit argument at the end of the forms. And that's not always enough either, so they decided to add a generic version called as-> which binds the value to a name so you can put it at any place in the forms. These macros also don't compose well. For example, sometimes you need a let in a -> chain to have a temporary binding. That doesn't work because you can't randomly insert forms into let, so you have to split things up again.

And as I note below, the minimalism is kind of "fake" because some essentials simply aren't provided; you have to rely on Java for that.

Java integration

Clojure was originally designed as a "hosted language", so it leverages the JVM. It does this admirably well; Java classes can be seamlessly invoked through Clojure, without any ceremony:

user> (java.util.UUID/randomUUID)
#uuid "bb788bae-5099-4a64-9c37-f6219d40a47f"

;; alternatively:
user> (import 'java.util.UUID)
java.util.UUID
user> (UUID/randomUUID)
#uuid "0bfd2092-14e1-4b88-a465-18698943ea4e"

The downside is that the above is the way to generate a random UUID. So even though uuids have literal syntax in Clojure (as #uuid "..."), there is no Lispy API for them in the Clojure standard library. This can be pretty frustrating, especially in the beginning. There's no clear indication where to look; sometimes you'll be poring over Java language docs for random stuff you thought would have a Clojure interface (like, say, creating temporary files or dealing with byte arrays). At those moments, you're basically programming Java with parentheses.

Having said that, there will often be community-provided nicer APIs for many of those things, but then you need to decide between adding an extra dependency just for a slightly nicer syntax.

Development style

REPL-driven development

Speaking of Java, one thing that constantly bothers me is the slow startup times of the REPL. In my current project, it takes almost 30 seconds to boot up a development REPL. Half a minute!

Luckily, there's great Slime-like Emacs integration with CIDER. Basically, the only sane way to do iterative development is by connecting to a REPL first thing you do and then sending your code to it all the time.

Now, this may sound weird from a Scheme programmer, but I never fully bought into the REPL style of developing. Sure, I experiment all the time in the REPL to try out a new API design or to quickly iterate on some function I'm writing. But my general development style tends more towards the "save and then run the test suite from an xterm". Relying solely on the REPL just "feels" jarring to me. I also constantly run into issues where re-evaluating a buffer doesn't get rid of global state that was built up on a previous run. When this happens, I'm testing an old version of some function without realising it. Keeping track of the "live" state versus the textual code I'm looking at is a total mind fuck for me. I don't understand how others can do this.

Another thing I seem to constantly do is write some code, have the tests go all green, only to see the CI crash on some cyclic dependency in my namespaces. The REPL does not always see those, because reloading a buffer with a namespace declaration works just fine when you loaded the imported namespaces before, even though they refer to the namespace being re-evaluated.

One thing I really find very nice when you're using CIDER is that everything (and I do mean everything) from Clojure is just a "jump to source" away. Most of the builtin functions seems to be written in Clojure itself. For example, if you want to know how map is implemented, you can just press M-. to see it.

Maps and keywords for everything

One thing you'll really notice is that in idiomatic Clojure code, maps are used for everything. A map is a functionally updateable hash table. It looks like this:

{:key-1 "value 1"
 :key-2 "value 2"}

This lends to a very dynamic style of programming, very much like you would in (dare I say it?) PHP. A bit of a strange comparison, but PHP also makes dealing with arrays (which double as maps in a weird way) extremely ergonomic. There, missing nested keys are automatically created on the fly and because of a strange quirk in its developmental history, arrays are the only objects which are passed by value. This means you can program in a referentially transparent way, while still mutating them inside functions at will. Not exactly the same mechanism, but the end effect on programming style feels very similar: you reach for them whenever you want to bunch some stuff together. It is the go-to data structure when you need flexibility.

In other Lisps you'd use alists (or plists, or SRFI-69 hash tables) for this, but they don't deal so well with nested maps and the library is not as convenient. For example, you can easily select, drop and rename keys in a map:

(-> {:key-1 "value 1" :key-2 "value 2"}
    (set/rename-keys {:key-1 :key})
    (dissoc :key-2)
    (assoc :foo "bar")) => {:key "value 1" :foo "bar"}

This -> notation took me a while to get used to by the way, and I'm still not entirely comfortable with it. I explained how it works above. It's a macro for "threading" expressions. In Scheme, you'd probably use a let* for this, or something. In Clojure that would look like this:

(let [map {:key-1 "value 1" :key-2 "value 2"}
      map (set/rename-keys map {:key-1 :key})
      map (dissoc map :key-2)
      map (assoc map :foo "bar")]
  map) => {:key "value 1" :foo "bar"}

As you can see, the version with -> is much more convenient and less repetitive. Unfortunately, it doesn't compose that well (duh, it's a macro), but because of the way the standard library is designed it is more useful than it would seem at first glance.

Anyway, the way maps are typically used everywhere in a project means that there's a lot less "structure" to your data structures. It is extremely convenient to use maps, even though there are also things like records and protocols. Because of their convenience, you'll end up using maps for everything. As I've noticed in my refactorings, when you change the structure of maps, a lot of code is going to break without a clear indication of where it went wrong.

This is made extra painful by "nil punning". For example, when you look up something in a map that doesn't exist, nil is returned. In Clojure, many operations (like first or rest) on nil just return nil instead of raising an error. So, when you think you are looking up something in a map, but the "map" is actually nil, it will not give an error, but it will return nil.

Now like I said, sometimes you may get an error on nil. It's a bit unclear which operations are nil-punning and which will give a proper error. So when you finally get a nil error, you will have a hell of a time trying to trace back where this nil got generated, as that may have been several function calls ago. This is an example where I really like the strictness of Scheme as compared to some other Lisps, as nil-punning is traditionally a dynamic Lisp thing; it's not unique to Clojure.

Multimethods with keywords

Initially, I was quite impressed by the way multimethods work; they're super simple and clean, yet powerful. First, you declare the multimethod and a "decision procedure", which returns a value that can be compared:

(defmulti say-hi :kind)

(defmethod say-hi :default [animal]
  (println (:name animal) "says hello"))

(defmethod say-hi :duck [animal]
  (println (:name animal) "says quack"))

(defmethod say-hi :dog [animal]
  (println (:name animal) "says woof"))

(say-hi {:name "Daffy" :kind :duck})  => "Daffy says quack"
(say-hi {:name "Pluto" :kind :dog})   => "Pluto says woof"
(say-hi {:name "Peter" :kind :human}) => "Peter says hello"

Using multimethods takes some care and taste, because it splits up your logic. So instead of having one place where you have decisions made with an if or cond tree, you have a function call and then depending on how the multimethod was defined, a different function will be called. This is basically what makes C++ so difficult to deal with in large projects: when people use function overloading, it can get really messy. You need to figure out which of the many things called "say-hi" is actually called in a situation, before you can dive into that implementation.

Compared to the insane amount of customizability that e.g. CLOS offers you, the design restraint shown in Clojure multimethods was nice to see, but then I realised this simplicity can be completely defeated by building hierarchies. That is, Clojure allows you to define a hierarchy on keywords. This was a huge wtf for me, because to me, keywords are just static entities that are unrelated to eachother.

When you realise how Clojure keywords can be namespaced, it makes slightly more sense: this gives them some separation.

A keyword can appear in "bare" form like :foo. This is a globally scoped keyword that belongs to no particular code. It's definitely not smart to hang a hierarchy onto such a keyword, and you're also better off not adding any "meta attributes" to them.

The other form is ::foo, which puts the keyword in the current namespace, which is shorthand for ::more-magic.net/foo if you are in the more-magic.net namespace.

Conclusion

All in all, Clojure is a well-designed language with neat features and it's certainly a lot better than most other JVM languages. There are things in it that I wish Scheme had, and it's certainly functional and modern. As a general programming language, I just can't get over the JVM and all its Java trappings, which is just not my cup of tea.

Apart from the JVM, there are some gratuitous departures from traditional Lisps, especially the "rich syntax" and the extreme reliance and overloading of keywords and maps.

As always, such things are a matter of taste, so take my opinion with a large grain of salt.


As you may know, I co-maintain the uri-generic egg, together with Ivan Raikov. We had just been working on fixing a bug and porting it to CHICKEN 5 when I stumbled across the WHATWG URL specification, an evolution over RFC 3986. I found it hard to believe they dropped the formal grammar from the RFC, so I checked the issue queue and found a closed ticket from 2015.

They replaced the BNF with a series of steps which is several pages long and overly concerned with implementation-specific details.

It really got to me that such an important and basic part of the web stack is so informally specified. So I wrote an appeal to them to restore a formal grammar in this ticket. I think the reasons are worth being spread more widely, so I'm reproducing it here on my blog.

My request

I would like to offer my opinion from an implementor's perspective and hopefully convince the WG to restore a formal grammar. Let me start by providing some background on where I'm coming from. Feel free to skip this next section.

My background

I am the co-maintainer of the uri-generic egg for CHICKEN Scheme. This implementation attempts to follow RFC 3986 to the letter, and this has resulted in what IMO is a very high-quality implementation (at least, as far as parsing is concerned; URL construction still has some known issues). Oftentimes when we ran into issues, we've compared it with other implementations. It turns out that many of these are lacking in some way or another. I think the main reason is that they're not attempting to really implement the formal grammar (even if they claim to be RFC compliant), while we do. We even have a growing repository of alternative implementations using different parser generators which all pass the same test suite! (feel free to now call me a smug Lisp/Scheme weenie :) )

I wasn't aware of the WHATWG spec until I saw it mentioned in a libcurl post. It piqued my interest because I'm always looking for more test cases. The web platform test suite looks like a big, juicy set to start using in our egg's tests. I'd also consider implementing the WHATWG spec if this increases compatibility with other implementations.

What I expect from a spec

As an implementor, I routinely check the RFC's ABNF as a guide to determine what a valid URL should look like. If someone finds a certain URL our implementation doesn't parse, or if it parses an URL that it shouldn't, the first thing I do is go back to the ABNF in the RFC to verify the behaviour. It is compact, to the point and, for a trained eye, it is trivial to quickly determine if a parser should accept a given (sub)string or not.

The collected ABNF of RFC 3986 is a brief three screenful. In contrast, the algorithm in the WHATWG spec is roughly eighteen screenful. It is an overly detailed and nonstandard way of defining a grammar. This makes it harder to determine which language is accepted by this algorithm. It also makes it hard for me to determine what the changes are, compared to the RFC. Implementing the WHATWG spec would (for me) involve a complete rewrite.

The specification is so focused on the mechanics of a specific manual parsing technique that it almost precludes parser generators or other implementations. Parser generators have a long tradition in theory and practice, and can generate efficient language recognisers. Even today, it is an active research field; PEG grammars for example have been "discovered" as recently as 2004.

The way I think about it is that the purpose of this spec is to define what a URL "officially" looks like. So, as an implementor, I don't understand the hesitation to supply a formal grammar. Not having one will likely result in different people interpreting the spec differently. This results in _less_ interoperability, which defeats the point of a spec.

Other reasons why I think a formal grammar is important

Finally, I would like to emphasise the importance of parsers based on formal grammars over ad hoc ones for security reasons. Let's say you have a pipeline of multiple processors which use different URL parsers. For example, you might have a HTML parser on a comment form which cleans URLs by dropping JavaScript and data URLs, among other things, or a mail client which blocks intranet or file system-local URLs before invoking an HTML viewer. If these are all ad hoc "informal" parsers that try to "fix" syntactically invalid URLs, it is nigh-impossible to verify that filtering them for "safe" URLs is correct. That's because it's impossible to decide which language is really accepted by an ad hoc implementation. An implementation further down the stack might interpret an URL (radically) different from one up the stack and you have a nice little exploit in the making.

If you're not convinced by my measly attempts at explaining this idea, please watch the talk "The Science of Insecurity". Meredith Patterson states the case much more eloquently than I ever could. This talk was an absolute eye-opener for me.

With this context, it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm". This means you cheerfully drag the halting problem into what should be a simple, straightforward notation (come on, URLs aren't that ill-defined!). As far as I can tell, the RFC defines a regular grammar. The decision to go from a regular to an unrestricted grammar should not be taken lightly!


Older articles...
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License. All code fragments on this site are hereby put in the public domain.