Personally, I love how working on compilers requires a combination of both practical and theoretical knowledge.
In this project, we will be building a JIT (Just-In-Time) compiler for a very small subset of C that I nick named μC to gain confidence in recursive descent parsing and generating machine code programmatically. I tried to keep the compiler as basic as possible because I believe it's important to start with the simplest solution before moving on to more complex ones. Building a very simple compiler will show us some of the problems we are going to encounter with more advanced compilers.
The full source code for the project is here.
As a side note, I am currently building a more complicated compiler that uses the Single Static Assignment form and also does proper-ish register allocation, I will try to write about some of the algorithms that I implemented when I finish it, but it still needs a lot of work.
Mandatory warning: If you just want to create your own programming language and don't care too much about making something from scratch, You are better off just writing a LLVM Frontend. LLVM Can generate really beautifully optimized machine code. Instead of the horribly inefficient machine code we will generate. The section on recursive descent parsing would still be useful for you though.
The basic anatomy of a compiler (optional)
We don't really have to worry about any of this, this is just to give you an idea, may be slightly inaccurate.
- Tokenization & Parsing (we can consider tokenization and parsing as a single step).
- Intermediate Representation Construction The intermediate representation is how the compiler represents the program internally.
- Usually Some Optimizations Where the compiler makes changes to the program to make it run faster without changing the defined behavior of the program(hopefully :D).
- Legalization Usually some of the architecture specific quirks needs to be handled before instruction selection.
- Instruction Selection Where the compiler selects which architecture specific instructions to use to represent the program.
- Register Allocation Usually there is no limit on how many variables we can have in a program, however there are limited number of hardware registers available. This is the stage where we store some variables in memory loading them back to registers when they are needed.
- Instruction Encoding Encode all instructions into a byte stream getting the program ready to be read by the cpu.
- Static Linking We don't know addresses of all functions before encoding the program, Since the offsets of functions depend on the sizes of instructions which which we usually don't know in advance.
- JIT execution, executable file or module creation
Hopefully, I will go over each of these steps in the future but now let's take a look at how the our most basic compiler is going to look like.
Anatomy of our simple compiler
We can combine most of the steps above into a single step, and skip a few of them (at the cost of quality of the code generated of course)
- Parsing & Instruction Selection & encoding We will combine parsing and code generation into a single step.
- Simple Linking We will have a very simple linking step at the end.
- JIT execution It's easier to do jit execution instead of creating an executable file.
This is it !
The simplifications of our compiler:
- Horribly inefficient machine code generation.
- No types, every type is a 64-bit integer, floating point numbers and structs are not supported.
- Every value is stored in the stack, registers are not utilized properly.
- No Intermediate Representation or Abstract Syntax Tree. Code generation combined with parsing.
Modern compilers are often millions of lines of code compared to our ~1000 lines.
Tokenization(Lexing)
It's useful that we think of the code as a series of tokens. Tokens are like words. We essentially assign each token a class like; identifier, number, left parenthesis, right parenthesis, operator, etc. This helps us understand the contents of code (which is just text) and makes it easier for us to parse it later.
Essentially the tokenizer takes in text in the form of:
And outputs:
That's easier to understand than {'a', 's', 'd', ' ', '(', ')', ...} right?
It makes sense to have tokenizer as a separate component if it involves complex logic. But since my tokenizer only needs one character to determine token type in most cases, I opted to not have a separate tokenizer and just combined tokenization and parsing.
To give an example on how we can combine tokenization and parsing:
It's as simple as that !
Recursive descent parsing
Most modern compilers (clang, gcc, etc) don't use parser generators and instead use handwritten(🧿) recursive descent parsers. I think once it "clicks" in your mind it be trivial to write any recursive descent parser you want!
For example, let's imagine that a function can be defined like this in our language:
We can parse it like this:
Well, that was simple enough right? If not, keep in mind that you can go over the finished code with a debugger, I am sure that will help.
Operator Precedence Parsing (and very cool parsing animation demo)
I think the most of difficult part of parsing is to get operator precedence parsing right.
123 + 10 * 5 + 120 / 2 should be parsed as 123 + (10 * 5) + (120 / 2) not as ((123 + 10) * 5) + 120) / 2.
I made a very cool demo that demonstrates operator precedence parsing, make sure you try it (I spent way too much time on it, so pls try it lol).
In the demo above examine how combining different operators result in different parse trees, try an expression with parentheses. Please note that only simple expressions with numbers are supported.
Simple operator precedence parsing
Let's start by writing a function to only parse multiplication and division operations. Since both multiplication and division have the same precedence, we just have to parse it from left to right. We will return if we ever see an operator that is not the correct precedence level.
Also, we have to consider that a number by itself is a valid expression(like "123"), we need to be able to handle that case as well.
Now let's handle parsing additions and subtractions. We need to be able to handle "10 * 20 + 10".
Using what you have learned so far, you can try to create a calculator program that takes in a expression like '33 / (10 + 23)' calculates the result. If you want, you can see the example calculator program I wrote to demonstrate the C dialect we are creating here
Generalized operator precedence parsing and compilation
As you might have seen, the parse_multiply and parse_add functions share too much in common, and since we will have multiple levels of operator precedences, our code would get really ugly real quick. We can generalize the binary operator parsing to make it very easy to add new operators.
The parse_atom function parses an integer, string, or expression in parenthesis. For int and string values, the parse_atom function emits assembly to store the constant value in the stack and returns the stack position corresponding to the position where the value is stored.
As described in the comments compile_operator_t is a function pointer that takes in stack positions of input values and generates assembly to do the operation, and returns the stack slot where the result value is stored. I will show the implementation of these functions later.
Please note that I will try to port the code to arm whenever I have time, but I wanted to start with x86. Follow me on Twitter if you are interested.
The x86 architecture has its roots in 1970s, and a modern x86 CPU can still execute a 16-bit operating system that was built in the 1980s. Being so old and so backward compatible does have some caveats. Encoding x86 instructions is kinda difficult and frankly it took me a long time to understand.
Intel's Architecture Software Developer Manual describes the x86 architecture in detail, including how to encode each instruction. I strongly suggest that you take a look because I won't go over how each instruction we use is encoded.
Let's take a look at how to encode a simple ADD instruction. The ADD instruction is listed under Chapter 3 of Volume 2 of the Architecture Software Developer Manual. We can see that there is a table.
04 ib | ADD AL, imm8 |
05 iw | ADD AX, imm16 |
... | ... |
01 /r | ADD r/m16, r16 |
01 /r | ADD r/m32, r32 |
REX.W + 01 /r | ADD r/m64, r64 |
... | ... |
I suggest that you take look at Chapter 2 of Architecture Software Developer Manual which describes the instruction format and also the first part of Chapter 3 which describes how to interpret the instruction listings.
We are specifically interested in the ADD r/m64, r64 variation of the instruction. r/m64 means that one of the operands is ModRM which I will explain later. And r64 means that the other operand is a 64-bit register.
And although we can guess that 01 means 0x01 in hex what does /r mean? "/r" means that the second register operand is encoded as a part of the ModRM byte.
Also, there seem to be 3 different instructions that use the same encoding.
As I said before, a modern x86 processor can run a 16-bit operating system. Meaning that the processor has a 16-bit mode. And when you use the 01 /r encoding the processor will interpret that as a 16-bit instruction if it's running in 16-bit mode. The 16-bit version is still usable in 64-bit mode however, you need to use a specific prefix for it.
Next, what does the REX.W in REX.W + 01 /r mean? Differently from the 16-bit mode, instructions are 32-bit by default in 64-bit mode, and we need to use the REX prefix in order to access the 64-bit data type version of the instruction. I believe this is for backward compatibility (you can run a 32-bit program in a 64-bit mode through special compatibility mode).
The REX prefix can be encoded with the function below.
NOTE: push_intXX() family of functions just emit a byte to the output stream. We assume they are already defined in the rest of the article.
As specified in the comment, we need to give 1 to the W field to make the instruction 64-bit.
What is ModRM.
ModRM is what makes x86 a CISC (Complex Instruction Set Computer) architecture. ModRM allows a single opcode to have different addressing modes. The / in the REX.W + 01 /r means that, this instruction uses the ModRM byte.
The ModRM byte has 3 parts. * mod Specifies which addressing mode we are using. * regop Either register for other side or a constant value for unary operations (/1 etc). * rm Register used for the addressing mode.
Please note that for unary instructions the encoding will be like /4 which means that regop must be 4. An example of this is shown with the encoding of the call r/m64 instruction.
The code below encodes the ModRM byte.
0 | [rm] | Value in memory addressed by register rm. |
1 | [rm + disp8] | Value in memory addressed by, registerrm plus an offset. |
2 | [rm + disp32] | Same as above but a 32-bit offset instead of 8-bit. |
3 | rm | Just value in the register. |
I said before that rm is the register and regop is sometimes used as a register but which value corresponds to which register? We can use the table bellow.
RAX | 0 | The accumulator. Some instructions require that one of the operands is in this register. |
RCX | 1 | |
RDX | 2 | |
RBX | 3 | Frame pointer, points to the start of the current frame. |
RSP | 4 | Stack pointer, points to the end of the current stack. Can't be used in ModRM. |
RBP | 5 | |
RSI | 6 | |
RDI | 7 | |
R8 | 8 | See note below. |
R9 | 9 | |
... | ... | |
R15 | 15 |
Important Note: For encoding registers R8-R15 you must use the REX prefix R and B fields. The reason we have to do this is because the registers R8 to R15 were added with the 64-bit mode. And the instruction encodings are same for 64 and 32 bit modes due to compatibility reasons.
From now on, I will assume that the registers are defined in a enum. Let's put it all together and try to encode ADD RCX, RDX
Before we move on let's check that our code is working properly, we can save the byte stream to a file and disassemble it using the objdump command.
Aaand the output is:
Yaay ! Let's try to encode ADD [RCX], RDX this will add RDX to the value in memory addressed by register rm.
Now let's try something a little bit more difficult, as I said before, we will be storing every variable in the stack. So let's try RBP (Frame pointer) relative addressing. This is the most important one because we will be primarily using this addressing mode.
ADD [RBP - offset], RCX
Most instructions only accept 32-bit constant values as operands, to use 64-bit constant values, we can use the MOVABS reg64, imm64 instruction to load the value to a register.
If you don't know which instruction to use, you can cheat a little bit and use godbolt.com to compare C code to assembly.
The processor will reject to execute code located in pages that are not marked as executable, and in modern operating systems, allocated memory is not marked as executable unless we specifically ask for it to be. This is a security feature to try to make it more difficult to build exploits. We need to ask the kernel specifically to map executable memory to our process.
This is how you do it linux:
We looked at parsing, and encoding x86 instructions now let's put everything together.
Setting up and cleaning stack frames
When we enter a function, we must first setup the stack frame and on exit we must remove it. Since we will be keeping all values in the stack we also need to allocate stack space.
One problem here is that we need to emit the code for stack frame allocation before we compile the rest of this function. We don't actually know the amount of space we need to allocate.
The solution is to hold a pointer to the point where the constant value lives in the output stream and fill it in later.
Implementing binary operators
The add_slots function that was referenced in the operator table is as follows. Remember that, this function will be called by compile_expression_. It takes in two stack offsets a and b and returns another stack offset containing the result value.
I won't show the code to encode all instructions, you can check the source code.
Calling functions
We can look at the instruction reference and find the encoding for the "call" instruction. We can see that there are several variations of the call instruction. I preferred to go with CALL r/m64 since we also want to support function pointers. We can see that the variation we chose is encoded as FF /2 meaning that the opcode is 0xFF and /2 means that there is a ModRM with the regop being 0x2.
Finding pointers of external functions
We want to be able to call LibC functions such as puts, printf, scanf, etc as well as other functions from other libraries. To find the address of an external function, we can use the dlsym function. For example:
Passing arguments
We now know the address of a function, but how will we pass arguments to it?
To know where to put the arguments, we need to look at the calling convention. Linux uses the System V Application Binary Interface. And for X86 we need to look at System V Application Binary Interface AMD64 Architecture Processor Supplement.
TLDR: For integer and pointer types we can use RDI, RSI, RDX, RCX, R8, R9 in that exact order, and the result value is used in the RAX register. Since we keep everything in stack, we don't have to worry about callee and caller saved registers.
Constructing a hello world program
Let's write a program that generates a function that calls the puts function and passes a hello world message as an argument.
Handling Control flow
To compile an if block, we compile the conditional expression, compare it's value to zero, and if the value is zero we jump over the code block where the contents of the if block is located. One small problem is that we don't know the length of the code block since we haven't even parsed it yet !
What we can do instead is temporary emit zero, and fill in the value later.
Handling a while loop is just the same, except that you have a jmp to the point where you evaluate the condition.
Static Linking
When handling the compilation of an if block we had to fill in the value of jump_offset_point later because we didn't know it's offset yet. This is called a forward reference. In this particular case since there is only one reference to the unknown address we just used a pointer. But there are cases where there could be arbitrary number of references to unknown positions in our program.
To give examples:
Forward declarations of local functions:
Return statments: To implement the return statement we must evaluate the returned expression, move the result to RAX and the jump to the end of the function (we can't just have a ret instruction because we need to clean the stack).
Break statments: To implement the break statement, we need to insert a jump to the position where the loops ends.
Since there could be an arbitrary number of references, we can't get away with a simple pointer this time. We need to keep a list of references to a particular unknown value, this is called a relocation list. At the end of compilation, we go over the list of relocations and fill in the real values.
Variable Handling
To implement variables, we just need to keep a map between stack slots and variable names. In the compile_atom function, to compile a variable, we just have to look it up in the variables array.
The contents in this article should be enough ammunition for you to build esoteric compilers of your own.
I suggest you checkout Fabrice Bellard's Obscured Tiny C Compiler which I drew some inspiration from. It can compile it's own source code but the code itself is very difficult to understand.
In this article I mainly focused on practical stuff. I strongly suggest that you read textbooks on compilers if you want to learn more about the more advanced or theoretical subjects. I enjoyed reading "Engineering a Compiler 2nd Edition by Keith D. Cooper (Author), Linda Torczon".
If you think there where any errors in the article, feel free to send me an email (you can find it in the index page).
- Add more language features (add else if, for loops etc) (easy).
- Port it to other architectures (hard).
- Instead of executing it jit, write the output into an elf file (medium).
- Add support for other types, such as floating point numbers. You need to figure out how encode floating point instructions, and extend the compiler to support different variable types. (hard/medium).
- Instead of emitting assembly directly, figure out how to emit LLVM IR instead (medium).