Almost everyone’s written a BF implementation these days. In fact, BF is now so widespread that LLVM and libjit both include tutorials for a basic implementation!
However, what would an industrial strength compiler look like? I set out to write a highly optimising BF compiler using Rust and LLVM. bfc was the result, and it includes a whole range of optimisations. I believe some are completely novel. Let’s take a look.
Baseline Performance
I’ve written a basic BF interpreter in C, so we’ll use that as our reference point for BF performance. Let’s compare bfc, compiling without optimisations, with this interpreter.
Clearly we’re off to a good start, even without optimisations. However, interpreters exist that are much faster than this dumb implementation, so let’s look at performance improvements for our compiled output.
Combining Increments
There are several peephole optimisations possible with BF. The most
widely implemented technique is combining increments. Consider the BF
program ++
. This increments current cell twice. Our baseline
compiler would convert this to:
cell_val := load(cell_ptr)
cell_val := cell_val + 1
store(cell_ptr, cell_val)
cell_val := load(cell_ptr)
cell_val := cell_val + 1
store(cell_ptr, cell_val)
However, the underlying hardware is capable of adding numbers other than one! We expand our intermediate representation (IR) to handle increments, so we can compile this to:
cell_val := load(cell_ptr)
cell_val := cell_val + 2
store(cell_ptr, cell_val)
We can do the same thing with pointers too. Rather than compiling >>
to this:
cell_ptr := cell_ptr + 1
cell_ptr := cell_ptr + 1
We compile it to this:
cell_ptr := cell_ptr + 2
Clear Loops
A common BF idiom is [-]
, which zeroes the current cell. bfc’s
optimiser knows about this, and it’s smart enough to compile it to:
store(cell_ptr, 0)
Setting a constant value is great because it opens up new opportunities
for collapsing instructions. We can go a step further and combine
with increments, so [-]+++
can be compiled to:
store(cell_ptr, 3)
Multiply Loops
BF doesn’t provide a multiply instruction, so programs have to use a
loop. [->++<]
is equivalent to:
cell#1 := cell#1 + 2 * cell#0
cell#0 := 0
Our optimiser can also detect multiply loops and generate efficient
code. This also captures useful patterns like [->+<]
(move cell 0 to
cell 1) and [->+>+<<]
(move cell 0 to both cell 1 and cell 2).
As a result, we compile [->++<]
to:
cell0_val := load(cell_ptr)
cell1_old_val := load(cell_ptr + 1)
cell1_new_val := cell1_old_val + 2 * cell0_val
store(cell_ptr + 1, cell1_new_val)
store(cell_ptr, 0)
Loops are far more expensive than multiplication, so this improves performance considerably.
Dead Code Elimination
It’s common to generate BF code with other tools, which can be redundant. It’s also common to add an introductory comment at the beginning of a BF program:
[Since we know that cells are initialised to zero,
we can write whatever we want here because this
loop is dead. We can use any of +-<>.,[] safely,
as long as our brackets are matched.]
actual code here!
bfc aggressively removes dead code. For example, cells in BF are initialised to zero, so a loop at the beginning of a program is dead:
[dead]not-dead
We also know that if a loop terminates, the current cell is zero. As a result, consecutive loops are dead:
foo[bar][dead][also dead]
Redundant Code Elimination
bfc can also remove instructions if it can prove they have no effect.
We already know that [-]+
is setting the current cell to a constant
value. Repeated instances of this will clobber the previous value. For
example, this code is assigning to the same cell three times:
[-]+++[-]++[-]+
We can replace this with just [-]+
.
I/O can also clobber values, making previous instructions
redundant. ,
reads a byte from stdin, and stores it in the current
cell. As a result, the following snippets can all be reduced to just
,
:
+,
-,
[-]+,
Instructions at the end of a program can also be redundant. The only possible side effects in a BF program are reading, writing and infinite loops. If we can prove the final instructions have no observable side effects, we remove them. For example:
foo.<[-]+>
This can be reduced to just foo.
Astute readers will observe that it’s also possible for BF programs to
crash if they access cells out of the range permitted (between 0 and
30,000). For example, <+
.
bfc considers out-of-range cell access to be undefined. In many cases
these programs will segfault, but in the case of <+
it will be
optimised away by this redundant code elimination pass.
Bounds Analysis
One novel optimisation that bfc offers is bounds analysis. In BF, programs have 30,000 cells available, initialised to 0. However, most programs don’t use the full 30,000.
bfc uses static analysis to work out the highest cell that a program could possibly reach. When we can prove a program only uses a smaller range of cells, we can allocate less memory for them and need less time to zero-initialise.
For example, this program never reaches beyond cell #2:
>><<
If loops have a net cell movement of zero, then we’re still limiting ourselves to a fixed number of cells. This program never reaches beyond cell #2, regardless of how many loop iterations are executed:
[>><<]
We apply this analysis recursively, so [><[<>]]
is still bounded in
the cells it can reach.
Of course, we can’t always find a lower bound. For example, this program may use an arbitrary number of cells:
[,>]
In these cases, bfc provides 30,000 cells for the program to use.
LLVM Optimisations
LLVM offers a suite of optimisations that we can run, and bfc uses all
of them by default (essentially llc -O3 foo.ll
). Surprisingly, they don’t always offer a
performance benefit.
I suspect this is due to the LLVM IR generated by bfc. bfc generates a single large function with few external function calls, so many LLVM optimisation passes aren’t applicable.
Testing Optimisations
With all these different optimisations, how can we know that we have combined them in the optimal order? There’s a great Rust implementation of Quickcheck which we use to verify the ordering. Quickcheck generates random BF programs and reports if this test function ever returns false:
This proved to be a fantastic way of finding corner cases in the
optimisation code. After various attempts at ordering optimisations,
Quickcheck found a BF program of the form +>+-<-
. This showed bfc
had to run optimisations repeatedly to ensure it exploited all
available opportunities.
Speculative Execution
Many BF programs don’t take any inputs at all. For programs that do take input, there’s an initialisation phase that sets cells to certain values first.
bfc exploits this by executing as much as it can at compile time. We
run a fixed number of steps (to avoid problems with infinite loops)
and terminate if we encounter a ,
(a read instruction).
As a result, we compile the classic BF hello world to:
bfc performs speculative execution after peephole optimisations and dead code removal, to maximise how much we can execute before hitting the execution step limit.
Even if we cannot speculatively execute the whole program, partial execution is useful. We can discard any instructions executed at compile time. We can also initialise cells to the values reached during speculative execution.
For example, consider the program +>+>+>++>,.
. bfc compiles this to:
Future Work
There are a number of other optimising BF projects (notable implementations include 1, 2, 3 and 4) and bfc has benefited from seeing their ideas.
bfc still has scope for further optimisations: we don’t apply the ‘scan loops’ or ‘operation offsets’ optimisations discussed by Mats Linander. We only detect multiplication by constant values, and don’t detect division at all.
The bounds detection pass is also very pessimistic. It’s currently
limited to loops with a net cell movement of zero. As a result, [>]
is treated as unbounded, whereas this loop cannot access cells which
haven’t previously been modified.
Speculative execution is not currently smart enough to partly execute
a loop. This means BF programs with a big outer loop don’t yet benefit
from this optimisation, because we eventually hit a ,
or reach the
step limit before the loop terminates.
Finally, bfc does not provide profile guided optimisation nor adaptive optimisation (in the style of HotSpot or pypy). bfc also assumes a 32-bit word size.
Closing Thoughts
BF is small enough to implement in a short space of time, but it’s a real language with programs you can play with. It’s a fantastic testbed for compiler techniques.
Rust proved to be a great implementation language for bfc. The Rust compiler prevents many bugs, and warns about many others, which helps tremendously. Its FFI makes interfacing with LLVM very straightforward, and makes it easy to separate ‘code that could segfault’ from the rest of the codebase.
Rust’s pattern matching also makes bfc optimisations very readable. Here’s the dead code removal:
Writing an optimising compiler can seem like a daunting project. Building a compiler can take years. By limiting ourselves to BF, we’ve been able to explore many of the interesting problems in compiler engineering:
- Halting problem (speculative execution)
- Compilation performance (speculative execution is expensive)
- The limits of static analysis (cell bounds analysis)
- Ordering of optimisation passes
- Designing a good intermediate representation (bfc uses its own IR and lowers to LLVM IR)
- Writing effective tests
Not bad for a side project!
Want to find out more? bfc is available on GitHub, and you can also view the benchmarks used for the graphs.