Today I want to talk to you about the Carp compiler. After my last blog post spent some time on the front page of various internet forums and incited healthy and not so healthy discussions, my inbox was filled with requests for clarification. In general, people seem to be fairly interested in how the compiler works; miscommunication—possibly on my part—led to a confused picture of how it functions in some corners of the internet, so I would like to give you some general information about it.
A quick disclaimer before we start: I’m not the original author of the language, nor am I its principal maintainer. The pure fact that people read my blog post might have made them jump to conclusions; but I do not want to usurp anyone else’s work, and so here is a shout-out to the brilliant and kind-hearted Swede Erik Svedäng, who you can usually find on the Gitter channel, which is our primary means of communication these days. He is the author of Carp and the person I bug with my questions. If you are interested in learning more about Carp or just to hang around and chat, I suggest you go there.
Also, if this is too much for you that’s alright! I plan on writing more byte-sized updates from now on, but first I feel the need to get the overview out of the way. Everything is moving relatively quickly in Carpland at the moment, and what you’re about to read will probably require an update in the near future anyway.
Now for the main event: let’s talk about the compiler, from front to back.
The parser
The first stop in any compiler is the parser; Lisp parsers are
generally straightforward, and I don’t want to spend too much time on
its implementation. Because Haskell is the implementation language of
the compiler we use Parsec as our primary parser framework. This means
that we don’t go the more traditional route of using a
Bison/Yacc-compatible BNF grammar but rather use a parser combinator
framework. I generally prefer parser combinators, although admittedly
BNF grammars make for a nice specification of a language. Parsec spits
out an enriched AST—the regular AST is called Obj, the
extended version XObj—that stores a little bit of
additional information, such as where the node is coming from (line,
column, and file information), memory management info that we will talk
about later, and a unique identifier that we will need in the
emitter.
The only thing that I consider of note here is that the reader macros
@ and &, which you can read about in my
previous blog post, are defined in the parser. Carp does not have an
extensible reader macro system and is as such closer to Clojure than
Common Lisp.
The Eval module
This is where it gets a little icky. The single largest module in the
Carp codebase is the Eval module. It is around a thousand
lines of code, which isn’t terrible. It’s pretty long for a Haskell
module, but it’s not the largest I’ve seen. What’s bad about it is that
it does a lot of things and has no clear boundaries around its goals. A
nice way to put it would be that it’s “the heart of the compiler”, but
I’d argue that it’s just in dire need of a refactor.
Once invoked, it will perform a fold over the parsed source code in
AST form to generate and execute Commands. Commands are the concept that
makes Carp’s REPL so different1, more of a
view into the compiler than an interactive expression evaluator. There
are two commands that will be generated at this time:
ReplEval and ListOfCallbacks.
ReplEval is where it’s at, so I will start with
ListOfCallbacks. It is a simple data structure that holds a
list of commands that will be performed in sequence, enabling us to
chain shortcuts. You can see that in action when typing, for instance,
:bx at the REPL, which is a shorthand for writing
:b and :x in sequence, or (build)
and (run). It is thus just a convenient wrapper for us at
the REPL.
As you might have guessed at this point, this leaves
ReplEval with the thankless job of doing literally
everything else. It will be passed into the eval
function, which performs some of the things that you might expect an
interpreter to do, but some are suspiciously absent—this is just the
first pass after all, and we are not trying to interpret the source code
so much as to make it palatable for our backend.
ev(a|i)l
I’m not going to cover all the things that the evaluator is doing,
but I’ll try to give you a brief overview of a function-abomination that
is 450 lines of code long, and spends most of its time in various
case statements. That’s not quite as horrifying as
CPython’s infamous switch, but it’s bad enough for me to
rant—sorry Erik, but I know you feel similarly.
Disclaimer: this section is fairly mean to Carp. I love the language, but I try to be open about its current imperfect state. This is not a marketing campaign.
First, the evaluator will check what general flavour of
Obj it has received. If it’s a list, it will call the local
function evalList, which does most of the work. The other
two flavors it cares about are symbols and arrays, which get their own
special functions; if it receives something else, it will just parrot it
back to the caller; this is one of the REPL’s “features” that will
confuse you the first time you run it, and we should probably do
something about it.
If we have a symbol in front of us, evalSymbol will try
to look it up, and if it fails give you an error. If it succeeds, it
will return the value, but the REPL prints… nothing. Let’s try not to
think about that one too much. It makes sense if not used on the
top-level, where we actually might need bindings, for example in dynamic
functions—we’ll talk about those in the section about macro
evaluation.
Arrays are weird as well, at least on the top-level. They are
homogeneous by design, because, you know, they’re C arrays. If you work
on the REPL, however, the type checker will not take effect, so we can
do things such as writing [1 "hi" \2], and Carp won’t think
twice about it. That’s all well and good, except it really isn’t in the
REPL sense. For macros, again, this makes sense, as we will see
later.
Alright, I’ve put it off for long enough: it is time to talk about
evalList.
It uses two more data structures that we’ll need to cover to
understand much of what follows. The first is the environment
Env and the second the similar, but not quite the same,
type environment TypeEnv. The type environment registers
and saves types and is needed for type inference and checking. The
regular environment is used for values and functions.
Let’s quickly go through a few of the behaviours of the function,
just to give you an intuition for how it operates: if
evalList encounters a function or value definition, it will
expand macros in the form, try to infer its type, save the definition,
and move on. If it encounters a type definition, it will save this type
in its type environment and generate getters, setters, updaters,
str, and copy for this type and register those
as well. If it encounters external type or function definitions that are
implemented in C and exposed through register or
register-type, it will do the same as for regular values,
but take your definitions at face value and not try to check
anything2.
This should give you a little insight into how this function generally works; it is, however, also responsible for macro expansion and activating the type checker as needed. Let’s talk about both of those before entering the land of C.
Macro evaluation
First off the bad news: Carp’s macro system is not currently
hygienic 3. It also does not have any
gensym capabilities yet. Generally, it is a system of its
own, closer to Clojure than anything else, but with a few twists and
embellishments that aren’t found anywhere else, and some things that are
still sorely lacking.
There are two general concepts that macro evaluation has to deal
with: macros and dynamic functions. Macros are fairly straightforward:
they are simple rewrite rules that, given one or more forms as input,
return an output form. The magic happens in expand, where a
single form gets rewritten. I don’t want to dwell on this for too long,
just know that it calls eval whenever it finds a macro,
which calls apply, which calls eval again, and
it all makes sense in the end. The first call to eval will
defer work to apply, which takes care of generating a new
environment in which the parameter names are bound to the actual given
arguments, and it will call eval with the body and that
environment. This is akin to what an evaluation loop might do in an
interpreter when it encounters a function.
Dynamic functions are similar, but they actually evaluate their arguments before being applied. That simple idea makes them similar to normal functions and useful when applied recursively, and for proper metaprogramming. Macros are thus usually only useful for simple rewrites or as entry points to dynamic functions.
Type inference and checking
We’ve been working with a dynamically typed AST thus far, but this won’t do for compilation. Instead, we’re now going to infer our types, based on only the type information that we have from the standard functions and any annotations the user might have left behind. As I said before, this happens after macro evaluation, which means that macros know nothing about the types of the AST.
The type system relies on a simple unification solver to infer the missing types. Generic types are still valid at this point. The inference engine just annotates the objects it receives and, perhaps surprisingly, takes care of memory management information. It also returns any dependencies the inferred function might have from Concretization. If those last two tasks don’t make any sense to you, don’t worry, I’ll go over them in the following sections.
The annotation procedure relies entirely on generating and then
solving the contstraints. Generating constraints is simple, and it takes
all that we know from our function types. Let me give you a simple
example: given the statement (= a b), we know that the type
of a must be the same as b, because equality
checking doesn’t coerce its types in Carp; it is of type
(λ [a a] Bool). We will generate this constraint for our
solver. Should either one of the two variables be used in a non-generic
setting somewhere we can solve the constraints, and assign both of the
variables the same type—which in this particular case must happen during
initialization anyway. This simple yet powerful mechanism turns out to
be surprisingly useful.
Should the solver fail or find any typed
holes4, the engine will produce an error.
Otherwise the solver will produce a TypeMappings object,
which is just a mapping from generic to concrete types. The inference
engine then assigns those types to the values, and goes on to
Concretization and memory management.
Concretization
Concretization is a fairly large module again, but not nearly as big
as the evaluator. It will take care of emitting multiple versions of the
same polymorphic function with different concrete types, one for every
usage in our code. If you define a function sum that takes
a list of numbers and sums all of them, for instance, and then use them
with Double and Int types, it will both
generate the appropriate functions and symbols for the specialized code,
as well as change the code to use the specialized names instead.
This is where the generation of dependencies kicks in; dependencies are returned from type inference if additional specialized functions need to be generated.
This leads us to the last item on our list: memory management.
Memory management
The ManageMemory module is responsible for finding out
if and when any destructors/deleters are needed. Like the type
inference, concretizer, and evaluator, it is yet another tree walking
function5 .
Now, I haven’t actually worked on this module all that much yet either, so you’ll have to do with what I observed and what I found out by having Erik explain its magic to me. Pardon if this is not as instructive as you’d hope.
To find out where we need to add deleters we first need to figure out which variables pertain to which scope. To achieve this, we walk through the AST, doing something that Erik calls “abstract interpretation”. We take the set of variables that were defined before the scope and the set of variables that are defined at the end of the scope, and take the difference. This difference is what was created inside of the scope and needs to be deleted, so we add deleters for them—unless they’re unmanaged types, such as numbers, booleans, or characters.
There are three special cases we need to handle there, though: function calls, loops, and conditionals. Function calls take ownership of the variables passed into them, unless they were passed as references, and we can remove them from our set. And if they are being passed as references, we have to make sure they’re still in the set, otherwise it has been given away and we produce an error.
while loops are interesting because we visit their
conditions and bodies twice to simulate looping; this looks
algorithmically crude, but it seems to work
fine.6
Conditions were the most surprisingly complex to me, because they produce two interdependent scopes, one for each clause. Those need to be diffed to make sure that we clean up properly in all cases.
When we’ve found out what we need to delete where, we put this
information directly into our XObj for use by the
emitter—which brings me to our most fun section: we get to emit some
C!
Emitting C
While Erik would like the compiler to use an LLVM backend at one point, I personally am a fan of the C backend. This is partly because it emits fairly legible code and, all things considered, it is easier for me to read C than assembly or LLVM IR, and partly because it doesn’t lock us into the LLVM framework.7 But enough of the future, let’s talk about now!
The emitted C code is all put into one file called
out/main.c. This file is split into three acts, like any
good piece of dramaturgy: first come the types, a dramatis personae of
our play; then the declarations, like an overture, serving as an
exposition of what is about to happen; and then the definitions show us
what they got. All the file’s a stage, and all the definitions are mere
players. Let me talk about types and declarations first, and then we
will have a word about definitions.
Types & declarations
For both types and declarations, the process—and indeed the
function—we are using is the same: first we sort the bindings in the
environment, ranking them based on their dependencies, then we will
convert them to declarations. The only difference is what kind of
environment we are operating on. For declarations this will be the
regular environment Env, and for types the type environment
TypeEnv.
Sorting the bindings is relatively easy; all of the objects have a base score, which is a bit higher for modules than for values. Type definitions and aliases are ranked based on their dependency depth, i.e. how many other types they have references to internally and how those, in turn, score. We then sort them starting with the lowest and proceed to convert them to declarations.
There are some intricacies to the way how declarations are generated—we don’t generate anything for generic functions for instance, because we already specialized—, but the basic idea is that for regular definitions we generate function heads for functions and left hand sides of assignments for value definitions.
Types are a little more complex, because we generate both the struct
and the typedef at the same time. There, we will go through
the members one by one, get the type we inferred for it, convert it to
C, and mangle the name of the member. Generally, whenever we emit a Lisp
symbol in C, we will first mangle the name. Mangling is in this case
just a simple string replacement that will replace certain characters
that are not allowed in C symbols with their all-caps name, surrounded
by underscores. This means, for instance, that the omnipresent dash
- will become _MINUS_.
That is all there is to emitting declarations and types. Let’s move on to definitions!
Definitions
Emitting definitions is arguably the most important part of the compiler, and it makes sense that it’s thus also the most complex8 . Conceptually, the emitter is fairly simple: we take all the bindings in the environment, fold over them, weed out the ones that are not important to us, like external definitions and generic types, and emit the code for each binding.
In the following I will walk you through a few of the most important
pieces of machinery, but by no means all of them. A complete definition
can be found in the function toC of the Emit module.
One magic detail that makes this work is that the recursive emitter always returns the name of the variable it bound the last statement to, so that the callers know what was last assigned in case they need it.
Modules
Modules are conceptually simple: they just add another level of recursion, because they are their own environment and we thus have to do the whole process again.
Literals
Literals are also straightforward; most of the primitive literals we have can be compiled to C without any changes. The only literals that need some extra attention are strings and arrays.
For strings, we always emit two fresh variables and bind the string
literal of type char* and its reference of type
char** to those. This way we can have statically allocated
strings in our binary, in a straightforward way. It might not be the
best way going forward, but thus far has proved to be fairly
reliable.
Arrays are the most complex of the literals. This is due to two reasons: they are not just C arrays—I lied a little above—and they can contain any kind of statement in their definition.
Carp arrays are a two-member struct of data and length. When we emit an array literal, we first emit a literal struct initializer; that’s trivial because the length is known at compile time. We then go through the list of array contents and emit a statement that assigns the contents of the array at the index to the statement at that index.
(let [x [1 2 (+ 3 4)]]
;...
)
That means that the array definition depicted in Figure 1 would compile to the C code depicted in Figure 2. In the end, it comes down to a fairly straightforward mechanism. The temporary variables reduce the legibility a little bit, but it’s still relatively easy to look at the compilate and see what’s happening.
Array _11 = {.len=3, .data=CARP_MALLOC(sizeof(int)*3)};
((int*)_11.data)[0] = 1;
((int*)_11.data)[1] = 2;
int _10 = Int__PLUS_(1, 2);
((int*)_11.data)[2] = _10;
Array__int x = _11;
Symbols
The next item on our list is symbols. They are simple as well, because all we have to do is take their name and path—the path being the list of modules it is defined in—, mangle and concatenate everything, and we have a perfectly valid C symbol.
Control Flow
In Carp, the only control flow primitives baked into the language
proper are if and while. Both of these are
present in the language we are compiling to, so it shouldn’t be too hard
for us to translate from Carp to C.
In regards to if, we have the additional restriction
that it is always a two-branch form, due to the type checker. This makes
emitting easy and regular: we create a fresh variable, and append it to
the source code unless the type inference engine told us that the return
type of this if expression is Unit in
Carp-speak or void in C. We then visit the condition, and
assign it to a result variable—that is because the condition is less
restricted than in C, possibly resulting in multiple C statements—, put
the result variable into the head of the if, visit and
generate code for both clauses of the expression, and, if the return
type isn’t void, assign the return values of the clause
bodies to the fresh variable we created. This works because, as stated
above, visiting the body will tell us what we need to assign.
while is a little bit more complex; because the head
will possibly be executed multiple times, we put a declaration and
initialization at the top, like in the if, and then
reassign it at the bottom of the loop, running the same expression
again.
All of this requires a fair bit of machinery, but is in itself quite simple; as with most of the parts of the Carp compiler it can be a little tricky to keep it all in your head because it’s highly recursive.
Functions and Variables
You probably expect this to be the hairiest part of the emitter, but it is actually quite straightforward; it builds on the basic blocks we’ve already defined. Of course it is the most recursive, though.
Let’s start with function definitions. We first need to get the
declaration, but we already know how to do that from earlier; we then
visit the body, see whether our return type is void, and if
not, add a return statement that returns whatever our body did, again
using the name of the last variable that visit
returned.
After this surprisingly brief definition, let’s look at variables.
Normal def-style variables are easy: we have the type,
name, and assignment expression, so all we need to do is convert the
type to its C equivalent, mangle the name, and take the result of a
quick visit of the expression. Then we stick all of that in a row and we
have our value. Please note that this obviously is more restricted,
because it doesn’t support multiple statement assignments; this is
alright because defn is only used for top-level definitions
and C is equally limited in having expressions on the top level.
Our way of creating local variables is let. Carp’s
version of let is comparable to let* in Scheme, which means
that subsequent bindings can use prior bindings. That makes things
simple. All we truly need to do is create a bunch of bindings one after
the other inside a {} scope, visit the body of the
expression, and assign it to a fresh variable that we again created
prior, unless it is void, in which case we can skip that part.
It is amazing that these simple building blocks are all that we need to have definitions, but that’s how it is. In a few dozen paragraphs, I was able to describe the code emitter, skipping over some parts, but not shying away from telling you about the more complex parts either. I hope that I won’t ever stop being amazed by this.
Compiling C
Once we’ve arrived at a C file, we can compile it using one of the C
compilers on the target machine. Although Carp aims to support all C
compilers and doesn’t use any fancy or non-portable C
features8, its default compiler is
clang. Since early December, this can be changed by putting
(project-set! "compiler" "my-compiler") somewhere either
into the source or the compiling REPL session. My personal favorite
behavior would be if Carp honored the CC environment flag,
but that’s thus far not the case.
Carp produces a fat binary from a single C file, which means that, to
my knowledge, there is currently no mechanism to include another C
source file. Libraries can be linked to by adding a libflag
or cflag, also with the project-set! command.
Most binaries I personally produce have currently no dependencies except
libm, which Carp itself depends on, and only use
header-only C dependencies.
That means that, by default, Carp artifacts are more like Go’s fat binaries that include everything, but this is not the only way to do things. You could just link to all kinds of libraries willy-nilly if you wanted to; Carp gives you this kind of freedom. But I’d suggest only linking to libraries if it’s necessary, for instance if you wrap a big C library like OpenSSL; the default should always be largely self-contained binaries. But then again, that’s only because this is one of my favorite features of Go, and I’m biased. Carp is not, and you have the power to disagree.
Whatever the process, in the end a binary will be produced and land
in the out directory, conveniently named
a.out.
Fin
We talked about a lot of stuff today. While Carp certainly is a complex piece of machinery, it works most of its magic in a surprisingly straightforward and concise way. The whole compiler is just under 5000 lines of code long, and you can definitely read through it yourself if you want or need to. Its simplicity comes from both a relatively simple—meaner spirits than I might even call it limited—language design and the absence of—at this point surely premature—optimizations.
I hope I’ve whetted your appetite to jump into Carp, or even to look around its compiler internals. If you need advice or just want to chat, you can find all of us on Gitter; alternatively, Erik is on Twitter and my e-mail address should be floating around here somewhere!
See you around next time, when we talk about how to use Carp’s C interface!
Footnotes
1. Some on the internet would even go so far as to call it worse names, like “useless”, but I’d beg to differ. You’ll get a more familiar REPL soon enough.
2. While trusting the user to be correct when defining their type signatures can lead to errors at C compile time, and we try to minimize those, the alternative would be to parse the actual included C, and that sounds like an even more horrendous task to me. YMMV.
3. I would love to fix this when I’ve got both the time and the understanding of the current system that is required to build on top of it. I don’t think I’m quite there yet. Drop me a message if you want to team up!
4. Typed holes are a concept borrowed from other type-inferred languages. It enables the user to replace any statement with a special symbol. The constraint solver will then fail and tell you which type it inferred for the hole. This is useful when debugging type errors.
5. You’d think that all of this AST traversal makes for a slow compiler, but in my experience the Carp compiler is reasonably fast—then again, no huge projects exist to benchmark Carp’s performance at the time of writing.
6. For the people reading the code: there is an
interesting bit of fiddling to work around Haskell’s laziness in the
while branch of the walk function.
7. Also, I’ve found the Haskell bindings to LLVM unbearable, and I’m not alone: when I gave an introductory workshop into LLVM while at the Recurse Center, I told people that using it might not be the best idea. One of the people there did not heed my advice, and he emerged six weeks later, battle-hardened and weary.
8. Of course those properties aren’t always correlated, but very often I experience that the most important feature seems to have the most edge cases—or maybe that is just because I’ve thought about it more deeply than the rest.