CodeSwitch bytecode and interpretation

Published on 2015-08-27
Tagged: codeswitch gypsum interpreter virtual-machines

View All Posts

I expect people will think the interpreter is the most interesting part of CodeSwitch, but it's actually the simplest. The interpreter is essentially a loop with a big switch-statement. In each iteration, it reads one instruction, switches on the opcode, branches to the appropriate case, then executes some code for that instruction.

For readers who don't follow this blog regularly, CodeSwitch is the virtual machine that executes Gypsum code. Gypsum is an experimental programming language I'm working on. Both can be found on GitHub.

Stack-based vs register-based interpreters

The CodeSwitch interpreter manages temporary values in a stack. Each instruction pops its operands off the stack (if it has any), computes a result (or performs some side-effect), then pushes the result (if there is one) back onto the stack. Local variables and parameters have permanent locations and are not pushed and popped.

Using a stack is one of two common strategies for implementing an interpreter. The other is to use registers: a fixed set of slots for storing local variables and temporary values. Any register can be accessed at any time, unlike a stack, where only the top values are accessible. Register-based interpreters tend to have more efficient execution than stack-based interpreters since there is less pushing, popping, loading, and storing, especially when local variables can be assigned to registers. Consequently, the same source expression may compile to fewer instructions for a register machine. Virtual registers generally cannot be mapped to hardware registers in an interpreter because interpreters generally need to access registers using indices from decoded instructions, and to my knowledge, no modern architecture provides a mechanism to do that. However, this mapping is very easy for JIT compilers.

The main advantage of a stack machine is that it's easy for the compiler to generate bytecode. When you compile an expression, you just compile each sub-expression (expecting results to be pushed onto the stack), then emit an instruction that pops all those values, does something with them, then pushes a new value. The Gypsum compiler works this way.

To generate bytecode for a register machine, you need a register allocator: an algorithm that assigns register numbers to values and variables. Register allocation algorithms are complicated, and if you're writing a compiler for a simple language, this can easily be the most complicated part of the compiler.

So the tradeoff is basically simplicity of implementation versus efficiency of execution. But this tradeoff only applies at the interpreter level. An optimizing compiler can easily take either kind of bytecode and generate efficient native code for it. So when you're JITing, it really doesn't matter.

The decision boils down to whether the compiler or the VM should have the complicated implementation. CodeSwitch is intended to be a cross-language VM with many different compilers targeting it. It doesn't have an optimizing JIT compiler yet, but it will some day. So the decision becomes obvious: it's better to have one complicated VM than to have many complicated compilers. That's why CodeSwitch has stack-oriented bytecode.

Bytecode format

A function in CodeSwitch is basically a sequence bytecode instructions with a little metadata at the front (mostly type information). Each instruction consists of an opcode and zero or more immediate values. An opcode is a byte which indicates which operation should be performed. You can see a full list of opcodes in opcodes.yaml, and I'll explain most of them in the sections below. An immediate value is a number encoded directly into the instruction stream. Immediate values are used to represent constant values, function ids, method indices, instruction offsets, and several other things.

Immediate values are usually 64-bit signed integers, but to save space, they are usually encoded as variable bit numbers. A VBN is encoded with a different number of bytes, depending on its magnitude. Each byte of a VBN has seven significant bits; the high bit of each byte just indicates whether there are more bytes in the number. VBNs are decoded by concatenating all the significant bits (in little-endian order), then sign-extending the number to 64 bits. This encoding lets small numbers (which are more common) take less space than large numbers in the instruction stream.

Here are a few examples of VBNs:

Logical value Encoded VBN
0 0x00
1 0x01
-1 0x7f
1000 0xe8 0x07
-1000 0x98 0x78
0xFF, 0xFF, 0xFF, 0xFF, 0x00
0x8000000000000000 0x80, 0x80, 0x80, 0x80, 0x80,
0x80, 0x80, 0x80, 0x80, 0x01


Below is a list of nearly all the instructions the CodeSwitch virtual machine can execute. I will use some notation below to indicate instructions that have immediate values and instructions that pop operands from the stack. For example:

opc imm1, imm2

The instruction above has two immediate values (with the abbreviation imm) and does not push or pop anything from the stack.

opci64 pop1, pop2 →

The instruction above operates on 64-bit integers (indicated by the i64 suffix), pops two values from the stack (abbreviation pop) and pushes some result (indicated by the → symbol).

Families of instructions that perform the same operation on various types (for example, addi8, addi16, addi32, ...) are abbreviated with a * suffix (like add*).

Constant instructions

Constant instructions push a value on the stack without performing any computation. These are used to implement numeric literals and special values. The constant instructions are:

Arithmetic instructions

These instructions pop one or two values from the stack, perform some computation, then push the result back on the stack. These all correspond to the unary and binary operators you'd expect to find on any integer or floating point type.

Conversion instructions

CodeSwitch doesn't allow implicit conversions between numeric types. Conversions must be done explicitly with instructions.

Note that many of these conversions may be inexact. For example, if you convert the floating point number 1.5 to an integer, you'll lose the fractional part. This is not an error. If you need to check for inexact conversion, convert the number back check for equality with the original value.

Memory instructions

The stack is useful for temporary values, but you'll still want to access local variables and objects. CodeSwitch provides the following memory instructions to deal with this:

Memory instructions involving objects may throw NullPointerException if the object being loaded from or stored to is null. Load instructions may also throw UninitializedException if the variable or field being loaded hasn't been initialized yet. Only variables with non-nullable object types need to be initialized. Primitive types and nullable object types are implicitly initialized to 0, false, or null.

Control flow instructions

These instructions alter the flow of execution in a function by telling the interpreter which instruction should be executed next. They are used to implement things like if statements and while loops.

Most control flow instructions have immediate operands which are block indices. Each function has a block offset table, which is an array of offsets in the bytecode. Block indices are used to load block offsets from this table. Control flow instructions can cause execution to jump to one of these offsets.

Call instructions and type instructions

CodeSwitch provides a several instructions for calling functions. When calling a function, we push all its arguments on the stack, then execute one of the call instructions. The function can access the arguments using ldlocal instructions. When the function returns, all arguments are popped, and the result is pushed.

If the function has type arguments, things get a little more complicated. The CodeSwitch bytecode contains all the information needed to figure out the type of any value on the stack. This is necessary for validation and for garbage collection; the garbage collector needs to be able to find and update pointers on the stack. Since the return type of a function may depend on its type arguments, the type arguments are encoded into the bytecode using type instructions.

Type instructions operate on a logical type stack, separate from the value stack. The interpreter doesn't actually do anything with this stack when it executes these instructions, which is why I use the word "logical". The instructions just act as annotations. When a call instruction is executed, types are popped from the logical type stack.

Note that the instructions above can only describe object types; there are no instructions to describe primitive types like i64 or boolean. Primitive types can't be used as type arguments, so instructions to describe them aren't necessary.

There are a few more instructions useful for type checking and casting.

Future work

At this point, CodeSwitch has a pretty complete instruction set which can be used to implement most static languages. However, there are no instructions yet to support dynamic languages. At the very least, I'll need dynamic load, store, and call instructions which accept property name strings. I'll also need a data type which can represent dynamic values. Maybe some kind of NaN boxing will work. I'll probably implement this when adding dynamic features to Gypsum.

Aside from that, the other task to be done is the JIT compiler. The interpreter will always be part of CodeSwitch, since the JIT compiler will need to run on top of it (I intend to write the JIT compiler in Gypsum), and something will need to run code while the JIT compiler is working in parallel. It will probably be a while before I get started on that, though.