The path to implementing a programming language

July 9, 2023

This blog tries to summarize all the choices and paths you could take to implement your next programming language, more specifically the frontend for your language.

There are a lot of factors that will influence your choices. Maybe you have your favorite host language that you would like to use for implementing your language, whether your language is dynamically or statically typed, or you are designing your own plugin system that needs a small DSL, or you have a custom CPU that needs a programming language and so on.

This blog is by no means complete, it simply is summary of everything that I have found while trying to implement my own programming language, so please do some of your own research also. This blog might be helpful for others to get started in planning their next language.

Language implementation paths

If you have read anything about writing your own compiler or interpreter, the first two steps are obvious. You would have to write your own lexer and parser. The lexer would take your language’s source code as input and spit out a list of tokens or words. Then the parser would take these list of tokens and constructs a tree like structure called the abstract syntax tree.

If you have already decided the host language you are going to use to implement your language, you can write the lexer and parser right away without worrying about the rest of the implementation.

You may have the choice to generate your lexer and parser from a grammar file using tools such as bison or antlr or you can hand write a recursive descent parser. I would recommend handwriting the parser since writing one is very simple, you can craft better error messages and you don’t make major changes to it once it is written.

After the syntax tree, there are many paths you can take to implement your language and it gets more interesting. You can follow any path in the below illustration starting from the syntax tree till the an end is reached.

Let’s clarify some of the terms used in this illustration,

A tree interpreter executes the program by walking over the syntax tree or visiting each node, and its child node and so on. The tree interpreter will halt if it finds an error or the program has finished. This is really only used for extremely simple languages and small DSLs.
Semantic analyzer and type checker walks over the syntax tree and does type checking and semantic checking. It throws errors such as “Integer type not subscriptable” or “Object member does not exist” or “Type integer cannot be called” or “Cannot assign string to integer” and so on unlike parser errors which look like “Missing comma” or “Missing semicolon”.
Code generator walks over the syntax tree and generates intermediate representation (IR) or bytecode directly. The code generator and the semantic analyzer could be combined.
An intermediate representation is really another micro language specifically designed to be representable as some data structure and types within the compiler backend internally for optimizations. The IR might have a readable textual format for pretty printing and debugging but it is not meant to be written or read by humans. The IR will also not have any real features such as expressions like a real language since they are meant to be generated. You can take a look at QBE’s IR as good example.
A compiler backend accepts IR as input, performs optimizations and outputs either bytecode for VM(s) or machine instructions for CPU(s). I will use the word runtime to refer to both VMs and CPUs in rest of the blog. The IR language is defined by the compiler backend. The compiler backend library will also provide helper and builder functions to generate the IR.
A translator converts one IR to another IR. So you can design your own IR language but then translate it to another IR and use an off the shelf compiler backend instead of writing your own compiler backend for the IR you have designed. There are some valid reasons to do this, which is covered later in the blog.
A transpiler accepts either a syntax tree or IR as input and generates equivalent code in another high level language meant for humans such as C or JS.

The simplest path you can take is to write lexer, parser and a tree interpreter. The most complex path you can take is to write lexer, parser, semantic analyzer and type checker, design a custom IR, write your own compiler backend and design a bytecode VM or a custom CPU. You don’t need to follow one path, you can also implement multiple paths. One may design a custom IR and use an off the shelf compiler backend like LLVM via a translator by default but transpile to C for other runtimes not supported by the default backend (like JS in the browser or microcontrollers like Arduino).

The rest of the blog will go through a series of questions, options and information to help navigate the implementation paths.

Do you write your own bytecode VM?

If your language is dynamically typed or you would like to embed your languages within other languages, it might be relevant to write your own bytecode VM and have the code generator directly generate the bytecode from the syntax tree. If you have a custom CPU and you are trying to implement a language for it, you could write a bytecode VM in assembly and have the compiler output the bytecode for it.

Unless you are planning to implement JIT or some optimizations on the output bytecode or a custom compiler backend, using a bytecode VM has a huge performance cost. You will also have to provide interface for calling C functions if you want to write some basic libraries for your language.

If your language is statically typed and you would like to embed your language in other languages, you might want to look into using something like wasmtime as your runtime and using WebAssembly first before deciding to write your own VM.

To transpile or not to transpile

Transpiling your language to something like C will give you support for lot of platforms out of the box because C is everywhere and you can easily leverage C libraries. You might not have a choice if your target runtime is some obscure CPU or a microcontroller which only has support for C or C++. Transpiling is also very easy to get started with, you probably already know your target language like C to some extent. If you use a compiler backend, there is some amount of learning curve and setup involved. It will also be easier to self host your language because you don’t need to write and maintain a wrapper for the compiler backend.

If you have decided to use a particular host language to implement your language but it does not have any well maintained compiler backend libraries, then only options are transpiling or writing your own bytecode VM. You could generate the IR as text file and use the backend’s CLI, but that somewhat has the same limitations as transpiling.

One of the limitations with transpiling is reduced compile times. You are running two compilers and lexing, parsing and code generation has to happen twice for your language and the target language. You are also dependent on having a C compiler installed on the user’s computer.

You also might not be able to represent some types nicely in the debugger, like a tagged union. You can choose some other langauage like Zig if that is important. You can also choose other target languages like Rust or C++ or JS if it suites your needs.

If the limitations don’t bother you, then transpiling is probably the best option.

Options for compiler backends

A very good reason for using a backend is if you would like to tune an existing backend and write your own optimizations for your language. If you also want faster compile times, then you might want to use a backend.

Here are some options if you are looking for a compiler backend

LLVM is the most popular one, it supports x86, ARM, WebAssembly and several other targets.
If you want to write your compiler in typescript or C, binaryen is a very good choice, but it supports only WebAssembly. If you don’t mind running your programs on WebAssembly or you want to run your compiler in the browser, you might want to consider using this instead of using LLVM.
There is also QBE, which is a light weight alternative to LLVM.

There are probably lot of other backends that I did not mention here.

Do I design my own custom or high level IR?

If you are writing your own compiler backend or you are planning to write your own backend in the future but don’t have the time right now, you will have to design a custom IR.

Sometimes it might be difficult to lower the syntax tree to the target IR, so you might have to create another high level IR to make this process easier. Many languages have multiple IRs with each level of IR moving away from the code and getting closer to machine instruction.

Having the code generator emit a custom or high level IR gives you options to use multiple compiler backends and transpilers to support different platforms, it is easier to write multiple translators rather than writing multiple code generators. It also makes it easier to switch compiler backends or switch from a transpiler to a compiler backend in the future.

You might want to do some optimizations specific to your language, so you create a custom IR, run some language specific optimizations on the custom IR, translate it to another IR and use an off the shelf compiler backend for generic optimizations (this is not shown in the illustration).

Having a custom IR will never hurt, so design one if you are not sure and your language is going to be maintained for a long time.

Memory management

If your language has collections like list, strings, dictionaries, you will need memory management to deal with heap memory and allocations.

If you are transpiling to C, you can use hboehm GC library. You will just have to replace calls to malloc with function from this GC library. This GC library has support for only some desktop OS runtimes, so if your language is for some custom CPU or microcontroller this might not be an option.

You can also implement automatic reference counting, but this can’t handle cyclical reference and you will have to provide weak pointer types. This can be implemented easily for any runtime.

If you are transpiling to JS or C++ or Rust, you can offload the memory management to the target language, using the runtime’s GC for JS, using smart pointers like shared_ptr for C++ and Rc in rust.

You can also not implement memory management like C if you like.

Interoperability with other languages

If your language targets WebAssembly, you already have interoperability with JS and bunch of other languages because of wasmtime and other similar WebAssmbly runtimes.

You can also generate python wrappers from your high level or custom IR using ctypes, this is very useful because you can leverage python to write tests so you don’t have to come up with a testing framework.

It is very important to have interoperability with C if you are planning to have some basic standard library to make calls to the OS. For languages that transpile to C or compile to executable, you just need to provide a mechanism to declare functions and the linker will take care of the rest. For bytecode interpreter you will need to provide some sort of API to load and declare function from shared object files, similar to the ctypes API in python.