A walkthrough of writing a basic compiler with LLVM. No prior experience assumed.
In a short space of time, I was able to go from zero C++ knowledge, and no experience with LLVM, to a fully-fledged compiler. It’s a lot of fun, let me show you how!
Our compiler will accept programs written in BF. This is a classic toy language for compilers, and there is even a BF compiler in LLVM’s examples directory! In this post, I’ll explain the process of writing it.
A first LLVM program
The LLVM toolchain is built around programs written in LLVM IR. First, we’ll write a basic LLVM IR program that just exits.
Since LLVM tools already use LLVM IR, we can simply use the
clang
compiler to show us what a basic program looks like. Here’s a
simple C program:
We run this through Clang:
This creates a file forty_two.ll
which looks like this:
Our first LLVM program! We can use lli
to run it:
Writing a skeleton
We’ll compile BF commands to sequences of LLVM IR instructions.
However, those instruction will need to be inside a main
function so
LLVM knows our entry point. We will also need to allocate and
initialise our memory cells and cell index.
Again, we can simply write the equivalent C and look at the LLVM IR instructions Clang generates. Here’s the skeleton we’ll use:
Hand-compiling >
>
is the easiest BF command to compile. Let’s fire up a text editor
and write out the LLVM IR equivalent.
If your BF is rusty, >
simply increments the cell index.
To check this code is correct, we want to run it. We can simply
wrap it in our skeleton
and run with it with lli
to see what happens. Implementing <
is
now straightforwards, and we write a
test program for <
too.
Hand-compiling +
BF’s +
command increments the current cell. This
requires us to dereference the current cell, calculate new value, then
store it. In C, this would look like:
LLVM provides the getelementptr instruction to calculate the pointer. The translation then looks like this:
Again, we test this by
wrapping it in our skeleton,
and do the same for -
.
I/O
BF has two I/O commands: ,
reads from stdin into a cell, and .
writes from a cell onto stdout. We need to call C functions for this: putchar
and getchar
.
We need to declare these functions, as we did with malloc
earlier:
To implement ,
we call getchar
, truncate it to a char, and write it
to the current cell.
.
is the reverse: we read the cell, sign extend it, then call putchar
.
Loops
LLVM IR instructions are organised into basic blocks; sequences of instructions that don’t contain jumps. At the end of each basic block, you must either jump to another basic block or return.
To compile [ x ] y
, we need to inspect the current cell,
then either jump to x
, which is our loop body, or to y
. At the end of x
,
we need to jump back to the start.
Note that this is recursive, x
may also contain loops. There’s a
sample loop program here.
Putting it all together
We’ve done the hard bit! All that’s left is to use the LLVM API generate these instructions. Each IR instruction has a corresponding C++ object that you can instantiate and add to your basic block.
LLVM’s API also has the convenient concept of an
IRBuilder.
The IRBuilder
class provides a create method for every IR instruction, making IR
generation easy.
Here’s the C++ for generating LLVM IR for >
. The
excellent LLVM tutorial
includes instructions for compiling a basic C++ program that uses LLVM APIs.
Compiling the other BF commands is a simple translation of our hand-compiled snippets. You can view the full working compiler here.
Emitting machine code
Finally, our compiler is only emitting LLVM IR. A proper compiler
emits machine code. We can use llc
to convert to an object file,
then link it to produce an executable.
And that’s all there is to it!
Addendum: Optimising
There’s lots that can be done to produce faster executable from BF programs. However, LLVM is already smart enough to compile simple loop-free BF programs to optimal LLVM IR!
This produces: