Hybrid Register/Stack VM for Faster Function Calls

4 months ago 5

I’m building another programming language - purple-garden - this time less lispy, but still s-expr based. It’s not inherently functional, but has pure functions. I wanted to make a fast programming language that’s fun to implement and fun (for me) to program in. So far, this is the result.

Some virtual machines are stack based and implement function calls via this approach, others are using registers for all function arguments. I decided to go with a hybrid approach and use registers for function calls with a single argument and a stack for functions with n>1 arguments. This will hopefully allow me to keep the vm impl simple, optimize for the common case of a single argument and use the least amount of instructions possible.

Since some consider programming languages, and implementing one modern day magic, I’ll introduce some basic concepts while going over the architecture of purple gardens interpretation pipeline.

(@println "Hello World") or (* 3.1415 (* 1.5 1.5)) for a bytecode interpeter, would generally pass through the following stages:

Tokenisation
Parsing
Compiling to Bytecode
Executing Bytecode

For instance (* 3.1415 (* 1.5 1.5)):

1. Tokenisation

TEXT

1[T_DELIMITOR_LEFT] 2[T_ASTERISKS] 3[T_DOUBLE](3.1415) 4[T_DELIMITOR_LEFT] 5[T_ASTERISKS] 6[T_DOUBLE](1.5) 7[T_DOUBLE](1.5) 8[T_DELIMITOR_RIGHT] 9[T_DELIMITOR_RIGHT] 10[T_EOF]

Tokenisation refers to - as the name implies, splitting the input up into tokens with meaning. In purple garden a token is simply a tagged value container. Generally it would also hold data related to its source location, like a line and a column.

1typedef enum { 2 // ( 3 T_DELIMITOR_LEFT = 1, 4 // assigned OP numbers because we directly map these in the compiler, see 5 // vm.h#VM_OP 6 T_PLUS = 2, 7 T_MINUS = 3, 8 T_ASTERISKS = 4, 9 T_SLASH = 5, 10 // = 11 T_EQUAL = 6, 12 // ) 13 T_DELIMITOR_RIGHT, 14 // [ 15 T_BRAKET_LEFT, 16 // ] 17 T_BRAKET_RIGHT, 18 // anything between "" 19 T_STRING, 20 T_TRUE, 21 T_FALSE, 22 // floating point numbers 23 T_DOUBLE, 24 // whole numbers 25 T_INTEGER, 26 // builtins in the format @<builtin> 27 T_BUILTIN, 28 // any identifier 29 T_IDENT, 30 // end marker 31 T_EOF, 32} TokenType; 33 34typedef struct { 35 TokenType type; 36 union { 37 // filled when .type=T_STRING|T_IDENT 38 Str string; 39 // filled when .type=T_DOUBLE 40 double floating; 41 // filled when .type=T_INTEGER 42 int64_t integer; 43 }; 44} Token;

2. Parsing

The parser uses the tokens produced by the tokenizer to build an abstract syntax tree (AST):

TEXT

1N_BIN[T_ASTERISKS]( 2 N_ATOM[T_DOUBLE](3.1415), 3 N_BIN[T_ASTERISKS]( 4 N_ATOM[T_DOUBLE](1.5), 5 N_ATOM[T_DOUBLE](1.5) 6 ) 7)

Now we know:

the precedence of the s-expr: 1.5*1.5 needs to be executed before multiplying the result with 3.1415
how many elements to apply which s-expr to
what element belongs to what s-expr

3. Compiling to Bytecode

Bytecode is generally defined as a list of operators and their respective arguments encoded as bytes.

Since this vm is registerbased, the bytecode operates on the accumulator register (r0) and its operand. The compiler emits bytecode for nodes of the AST from bottom to top:

Atoms

N_ATOM[T_DOUBLE](1.5)

Compiles to:

ASM

1; load value at index 3 of global pool into r0 2LOAD 3

Where the argument 3 refers to the index into the global pool at which Double(1.5) is stored at.

“Binary” operators

TEXT

1 N_BIN[T_ASTERISKS]( 2 N_ATOM[T_DOUBLE](1.5), 3 N_ATOM[T_DOUBLE](1.5) 4 )

The atoms parts will be compiled as explained above, the only change is moving the temporary value needed for addition into another register:

ASM

1; load value at index 3 of global pool into r0 2LOAD 3 3; move value in r0 into r1 4STORE 1 5; load value at index 4 of global pool into r0 6LOAD 4

The multiplication itself is done by MUL. Lhs is assumed to be r0, rhs is the argument passed to the opcode:

ASM

1; load value at index 3 of global pool into r0 2LOAD 3 3; move value in r0 into r1 4STORE 1 5; load value at index 4 of global pool into r0 6LOAD 4 7; muliply r0 and r1 8MUL 1

This means the first multiplication s-expr (* 1.5 1.5) compiles to the above and leaves the result in the accumulator register r0. Applying this to the root s-expr, we get:

ASM

1; (* 1.5 1.5) 2 ; 1.5 3 LOAD 3 4 ; r0->r1 5 STORE 1 6 ; 1.5 7 LOAD 4 8 ; r0*r1 9 MUL 1 10 11; (* 3.1415 r0) 12 ; r0->r1 13 STORE 1 14 ; 3.1415 15 LOAD 5 16 ; r0*r1 17 MUL 1

4. Executing Bytecode

At a high level the bytecode virtual machine:

indexes into bytecode array at vm->pc (program counter) for the current op code
indexes into bytecode array at vm->pc+1 for the current argument
execute corresponding logic for the opcode and manipulate registers or other internal state

The whole pipeline uses a two byte tuple: OP and ARG, this keeps the fetching, decoding and emitting lean + fast.

Since there may be some not obvious instruction meanings, here’s a table:

Opcode / InstructionDescription

STORE <reg>	Moves the contents of r0 to reg
ARGS <amount>	Instructs the vm to expect amount of arguments to calls and builtins
LEAVE	Leaves a scope
LOAD <global pool idx>	Loads a global into r0
BUILTIN <hash>	Calls a builtin by the hash of its name
CALL <hash>	Calls a function by the hash of its name
VAR <reg>	Creates a variable by its name in r0 and its value in reg
LOADV <hash>	Loads a variable by its hash from the variable table into r0
POP	Pops a value from the stack into r0
PUSH	Pushes r0 to the stack
PUSHG <global pool idx>	Pushes a global to the stack

PUSHG is an optimisation to merge LOAD and PUSH, which is a common pattern in function calling setup&mldr;

Since every good language needs a way to structure and reuse code, purple garden (pg) needs functions.

SCHEME

1; (@function <name> [<list of arguments>] <s-expr body>) 2(@function no-args [] (@None)) 3(@function one-arg [a] (@Some a)) 4(@function two-args [a b] (@Some a)) 5(@function three-args [a b c] (@Some a)) 6(@function four-args [a b c d] (@Some a)) 7; ...

And we just call them in the

SCHEME

1(no-args) 2(one-arg 1) 3(two-args 1 2) 4(three-args 1 2 3) 5(four-args 1 2 3 4)

Once a function is defined the compiler emits bytecode for said function including its body and the setup for all params, this differs for 0, 1 and n arguments:

No argument

SCHEME

1(@function no-args [] (@None))

No argument, thus we can omit prep for argument passing:

ASM

1__globals: 2 False; {idx=0} 3 True; {idx=1} 4 Option(None); {idx=2} 5 6__entry: 7__0x000000[01EC]: no-args 8 JMP 6 9 ARGS 0 10 BUILTIN 1019: <@None> 11 LEAVE

Single Argument

SCHEME

1(@function one-arg [a] (@Some a))

A singular argument is passed by a pointer to its value being stored in r0, this makes calling functions with a single argument way faster than using the stack based method for enabling n arguments:

ASM

1__0x000008[02B6]: one-arg 2 JMP 20 3 ; function head / parameter creation 4 STORE 1 ; 1. move &Value from r0 to r1 5 LOAD 4: Str(`a`) ; 2. load name of variable into r0 6 VAR 1 ; 3. create an entry in variable table with the hash of its name 7 ; (computed at compile time) and its &Value in r1 8 9 ; function body 10 LOADV 796972: $a 11 BUILTIN 145: <@Some> 12 LEAVE

n Arguments

SCHEME

1(@function three-args [a b c] (@Some a))

This is where the mix of registers and stack is used. Since the value for the last argument is always in r0. Any further arguments are stored on the stack:

TEXT

1args: a b c 2vals: 1 2 3 3 4stack: 1 2 5r0: 3

In the disassembled bytecode this results in simply one POP before comencing with the already known argument setup:

ASM

1__entry: 2__0x000000[02CB]: three-args 3 JMP 28 4 5 ; function head 6 7 ; create variable 'c' pointing to &Value 8 STORE 1 9 LOAD 3: Str(`c`) 10 VAR 1 11 12 ; pop &Value for variable 'b' from the stack top 13 POP 14 STORE 1 15 LOAD 4: Str(`b`) 16 VAR 1 17 18 ; pop &Value for variable 'a' from the stack top 19 POP 20 STORE 1 21 LOAD 5: Str(`a`) 22 VAR 1 23 24 ; function body 25 LOADV 796972: $a 26 BUILTIN 145: <@Some> 27 LEAVE

No argument

SCHEME

1(@function no-args [] (@None)) 2(no-args)

Since we don’t pass any arguments, the call is just the OP_CALL and bytecode index for the function definition, here 0:

ASM

1__entry: 2 ; ... 3 CALL 0: <no-args>

Single Argument

SCHEME

1(@function one-arg [a] (@Some a)) 2(one-arg 1)

As introduced before, a single argument can just be used from r0, thus the callsite loads the global and invokes the call, no stack interaction necessary:

ASM

1__entry: 2 ; ... 3 4 LOAD 4: Int(1) 5 CALL 0: <one-arg>

n Arguments

SCHEME

1(@function two-args [a b] (@Some a)) 2(@function three-args [a b c] (@Some a)) 3(@function four-args [a b c d] (@Some a)) 4(two-args 1 2) 5(three-args 1 2 3) 6(four-args 1 2 3 4)

Like above, the last argument is always in r0, thus all other arguments will need to be on the stack. This is implemented by PUSH (or PUSHG if LOAD and PUSH are emitted consecutively) at the callsite. For each argument except the last, a PUSH is required. The instruction before CALL or BUILTIN is ARGS, instructing the vm to pop the right amount of arguments of the stack.

ASM

1__entry: 2 ; ... 3 4 PUSHG 7: Int(1) 5 LOAD 8: Int(2) 6 ARGS 2 7 CALL 0: <two-args> 8 9 PUSHG 9: Int(1) 10 PUSHG 10: Int(2) 11 LOAD 11: Int(3) 12 ARGS 3 13 CALL 22: <three-args> 14 15 PUSHG 12: Int(1) 16 PUSHG 13: Int(2) 17 PUSHG 14: Int(3) 18 LOAD 15: Int(4) 19 ARGS 4 20 CALL 52: <four-args>

Warning

This is a microbenchmark done on a high noise & high load laptop system, take everything here with a grain of salt, either way, here are the specs:

TEXT

1System: 2 Kernel: 6.11.0-24-generic arch: x86_64 bits: 64 3 Desktop: i3 v: 4.23 Distro: Ubuntu 24.04.2 LTS (Noble Numbat) 4Machine: 5 Type: Laptop System: LENOVO product: 21F8002TGE v: ThinkPad T14s Gen 4 6CPU: 7 Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64 8 type: MT MCP cache: L2: 8 MiB 9 Speed (MHz): avg: 1950 min/max: 400/5132 cores: 1: 1996 2: 1997 3: 1996 10 4: 1996 5: 1996 6: 1852 7: 1848 8: 1852 9: 1852 10: 1853 11: 1996 12: 1996 11 13: 1996 14: 1996 15: 1996 16: 1996

TLDR: 25k function calls with builtins in each of them (~50k calls) in ~4.2ms vm runtime, which is ~15.5% of the total runtime, it also scales very well, for 250k function calls and 500k total calls with 13ms vm runtime.

Terminal

$ head -n10 examples/bench.garden

(@function no-args [] (@None)) (@function one-arg [a] (@Some a)) (@function two-args [a b] (@Some a)) (@function three-args [a b c] (@Some a)) (@function four-args [a b c d] (@Some a)) (no-args) (one-arg 1) (two-args 1 2) (three-args 1 2 3) (four-args 1 2 3 4)

Terminal

$ wc -l examples/bench.garden

25010 examples/bench.garden

Terminal

$ make bench PG=examples/bench.garden

// [...] [ 0.0000ms] main::Args_parse: Parsed arguments [ 0.0360ms] io::IO_read_file_to_string: mmaped input of size=380261B [ 0.0080ms] mem::init: Allocated memory block of size=612368384B [ 3.8690ms] lexer::Lexer_all: lexed tokens count=125085 [ 5.1560ms] parser::Parser_next created AST with node_count=25010 [ 3.4340ms] cc::cc: Flattened AST to byte code/global pool length=180148/50017 [ 4.1530ms] vm::Vm_run: executed byte code [ 3.7390ms] mem::Allocator::destroy: Deallocated memory space [ 0.0000ms] vm::Vm_destroy: teared vm down [ 0.0000ms] munmap: unmapped input

Terminal

$ make release $ hyperfine "./purple_garden examples/bench.garden"

Benchmark 1: ./purple_garden examples/bench.garden Time (mean ± σ): 27.2 ms ± 2.6 ms [User: 9.1 ms, System: 15.6 ms] Range (min … max): 19.7 ms … 32.8 ms 93 runs

Read Entire Article

Hybrid Register/Stack VM for Faster Function Calls

1. Tokenisation

2. Parsing

3. Compiling to Bytecode

Atoms

“Binary” operators

4. Executing Bytecode

No argument

Single Argument

n Arguments

No argument

Single Argument

n Arguments

Warning

Related

Generative vocal cleanup that runs 10.5x faster than real-ti...

Show HN: Simple Business Idea Validation

Nevada Shuts Down Fortress Trust over Insolvency