CodeSwitch bytecode and interpretation

Published on 2015-08-27
Tagged: codeswitch gypsum interpreter virtual-machines

I expect people will think the interpreter is the most interesting part of CodeSwitch, but it's actually the simplest. The interpreter is essentially a loop with a big switch-statement. In each iteration, it reads one instruction, switches on the opcode, branches to the appropriate case, then executes some code for that instruction.

For readers who don't follow this blog regularly, CodeSwitch is the virtual machine that executes Gypsum code. Gypsum is an experimental programming language I'm working on. Both can be found on GitHub.

Stack-based vs register-based interpreters

The CodeSwitch interpreter manages temporary values in a stack. Each instruction pops its operands off the stack (if it has any), computes a result (or performs some side-effect), then pushes the result (if there is one) back onto the stack. Local variables and parameters have permanent locations and are not pushed and popped.

Using a stack is one of two common strategies for implementing an interpreter. The other is to use registers: a fixed set of slots for storing local variables and temporary values. Any register can be accessed at any time, unlike a stack, where only the top values are accessible. Register-based interpreters tend to have more efficient execution than stack-based interpreters since there is less pushing, popping, loading, and storing, especially when local variables can be assigned to registers. Consequently, the same source expression may compile to fewer instructions for a register machine. Virtual registers generally cannot be mapped to hardware registers in an interpreter because interpreters generally need to access registers using indices from decoded instructions, and to my knowledge, no modern architecture provides a mechanism to do that. However, this mapping is very easy for JIT compilers.

The main advantage of a stack machine is that it's easy for the compiler to generate bytecode. When you compile an expression, you just compile each sub-expression (expecting results to be pushed onto the stack), then emit an instruction that pops all those values, does something with them, then pushes a new value. The Gypsum compiler works this way.

To generate bytecode for a register machine, you need a register allocator: an algorithm that assigns register numbers to values and variables. Register allocation algorithms are complicated, and if you're writing a compiler for a simple language, this can easily be the most complicated part of the compiler.

So the tradeoff is basically simplicity of implementation versus efficiency of execution. But this tradeoff only applies at the interpreter level. An optimizing compiler can easily take either kind of bytecode and generate efficient native code for it. So when you're JITing, it really doesn't matter.

The decision boils down to whether the compiler or the VM should have the complicated implementation. CodeSwitch is intended to be a cross-language VM with many different compilers targeting it. It doesn't have an optimizing JIT compiler yet, but it will some day. So the decision becomes obvious: it's better to have one complicated VM than to have many complicated compilers. That's why CodeSwitch has stack-oriented bytecode.

Bytecode format

A function in CodeSwitch is basically a sequence bytecode instructions with a little metadata at the front (mostly type information). Each instruction consists of an opcode and zero or more immediate values. An opcode is a byte which indicates which operation should be performed. You can see a full list of opcodes in opcodes.yaml, and I'll explain most of them in the sections below. An immediate value is a number encoded directly into the instruction stream. Immediate values are used to represent constant values, function ids, method indices, instruction offsets, and several other things.

Immediate values are usually 64-bit signed integers, but to save space, they are usually encoded as variable bit numbers. A VBN is encoded with a different number of bytes, depending on its magnitude. Each byte of a VBN has seven significant bits; the high bit of each byte just indicates whether there are more bytes in the number. VBNs are decoded by concatenating all the significant bits (in little-endian order), then sign-extending the number to 64 bits. This encoding lets small numbers (which are more common) take less space than large numbers in the instruction stream.

Here are a few examples of VBNs:

Logical value	Encoded VBN
0	0x00
1	0x01
-1	0x7f
1000	0xe8 0x07
-1000	0x98 0x78
0x7FFFFFFFFFFFFFFF	0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00
0x8000000000000000	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x01

Instructions

Below is a list of nearly all the instructions the CodeSwitch virtual machine can execute. I will use some notation below to indicate instructions that have immediate values and instructions that pop operands from the stack. For example:

opc imm1, imm2

The instruction above has two immediate values (with the abbreviation imm) and does not push or pop anything from the stack.

opci64 pop1, pop2 â†’

The instruction above operates on 64-bit integers (indicated by the i64 suffix), pops two values from the stack (abbreviation pop) and pushes some result (indicated by the â†’ symbol).

Families of instructions that perform the same operation on various types (for example, addi8, addi16, addi32, ...) are abbreviated with a * suffix (like add*).

Constant instructions

Constant instructions push a value on the stack without performing any computation. These are used to implement numeric literals and special values. The constant instructions are:

unit â†’ - the unit value. Used when a function or expression doesn't have any meaningful value to return.
true â†’, falseâ†’ - Boolean values
null â†’ - the null pointer
uninitialized â†’ - a special value used for local non-null pointer variables which aren't immediately defined.
i* â†’ - integer values (encoded in a VBN immediate value).
f* â†’ - floating point values (encoded raw, not using a VBN).
string â†’ - pushes a pointer to a string. Has an immediate value (encoded as a VBN) which is an index into the package's string list.

Arithmetic instructions

These instructions pop one or two values from the stack, perform some computation, then push the result back on the stack. These all correspond to the unary and binary operators you'd expect to find on any integer or floating point type.

add* pop1, pop2 â†’ - addition instructions for integers and floating point numbers of various widths.
sub* pop1, pop2 â†’ - subtraction.
mul* pop1, pop2 â†’ - signed multiplication.
div* pop1, pop2 â†’ - signed division.
lsl* pop1, pop2 â†’ - logical shift left.
lsr* pop1, pop2 â†’ - logical shift right (zero-extending the result).
asr* pop1, pop2 â†’ - arithmetic shift right (sign-extending the result).
and* pop1, pop2 â†’ - bitwise and for integers and Booleans.
or* pop1, pop2 â†’ - bitwise inclusive or.
xor* pop1, pop2 â†’ - bitwise exclusive or.
eq* pop1, pop2 â†’ - equality for integers, floating point numbers, and Booleans. The result is Boolean.
ne* pop1, pop2 â†’ - negated equality for integers, floating point numbers, and Booleans.
lt* pop1, pop2 â†’ - numeric less than.
le* pop1, pop2 â†’ - numeric less than or equal.
gt* pop1, pop2 â†’ - numeric greater than.
ge* pop1, pop2 â†’ - numeric greater than or equal.
eqp pop1, pop2 â†’ - reference equality for pointers.
nep pop1, pop2 â†’ - negated reference equality for pointers.
neg* pop1 â†’ - negation for integers and floating point numbers.
inv* pop1 â†’ - bitwise inversion for integers.
notb pop1 â†’ - negation for Booleans.

Conversion instructions

CodeSwitch doesn't allow implicit conversions between numeric types. Conversions must be done explicitly with instructions.

trunc* pop1 â†’ - truncates an integer or floating point value to a narrower type.
sexti16_8 pop1 â†’, sexti32_8, sexti64_8, sext32_16, sext64_16, sext64_32 - opcodes for sign-extending an integers to wider types. The first opcode sign-extends an 8-bit integer to a 16-bit integer.
zext* pop1 â†’ - zero-extends an integer to a wider type.
extf64 pop1 â†’ - widens a 32-bit floating point value to 64 bits.
fcvti32 pop1 â†’, fcvti64 - converts a 32-bit or 64-bit floating point value to an integer of the same width.
icvtf32 pop1 â†’, icvtf64 - converts a 32-bit or 64-bit integer to a floating point value of the same width.
ftoi32 pop1 â†’, ftoi64, itof32, itof64 - raw, bitwise conversions between integers and floating point numbers.

Note that many of these conversions may be inexact. For example, if you convert the floating point number 1.5 to an integer, you'll lose the fractional part. This is not an error. If you need to check for inexact conversion, convert the number back check for equality with the original value.

Memory instructions

The stack is useful for temporary values, but you'll still want to access local variables and objects. CodeSwitch provides the following memory instructions to deal with this:

ldlocal imm1 â†’ - loads a local variable or parameter and pushes its value on the stack. The immediate value indicates which variable to load. Positive values are parameters, and negative values are locals.
stlocal imm1, pop1 - pops the value on top of the stack and stores it in a local variable.
ldg imm1 â†’ - loads a global variable defined in the same package and pushes its value on the stack. The immediate indicates which global to load.
ldgf imm1, imm2 â†’ - loads a global variable defined in another package and pushes its value on the stack. imm1 indicates which package contains the global, imm2 indicates which global should be loaded. The f suffix indicates a foreign definition is being accessed.
stg imm1, pop1, stgf imm1, imm2, pop1 - pops a value from the stack and stores it in a global variable in this package (stg) or another (stgf).
ld8 imm1, pop1 â†’, ld16 imm1, pop1 â†’, ld32 imm1, pop1 â†’, ld64 imm1, pop1 â†’ - loads a primitive value from a field inside an object and pushes the value on the stack. The object pointer is popped off the stack. The immediate value indicates which field to load.
ldp imm1, pop1 â†’ - loads a pointer from field inside an object. Bytecode doesn't know how big pointers are, so this needs a separate instruction.
st8 imm1, pop1, pop2, st16 imm1, pop1, pop2, st32 imm1, pop1, pop2, st64 imm1, pop1, pop2 - stores a primitive value to a field inside an object. The value and the object pointer are popped from the stack. The immediate value indicates which field to write to.
stp imm1, pop1, pop2 - stores a pointer to a field inside an object. The garbage collector has a write barrier, so this involves some extra bookkeeping that isn't necessary for primitives.
allocobj imm1 â†’ - creates a new instance of a class in the same package and pushes a pointer to it on the stack. The immediate operand indicates which class to use.
allocobjf imm1, imm2 â†’ - creates a new instance of a class in another package and pushes a pointer to it on the stack. The first immediate operand indicates which package contains the class, and the second indicates which class.

Memory instructions involving objects may throw NullPointerException if the object being loaded from or stored to is null. Load instructions may also throw UninitializedException if the variable or field being loaded hasn't been initialized yet. Only variables with non-nullable object types need to be initialized. Primitive types and nullable object types are implicitly initialized to 0, false, or null.

Control flow instructions

These instructions alter the flow of execution in a function by telling the interpreter which instruction should be executed next. They are used to implement things like if statements and while loops.

Most control flow instructions have immediate operands which are block indices. Each function has a block offset table, which is an array of offsets in the bytecode. Block indices are used to load block offsets from this table. Control flow instructions can cause execution to jump to one of these offsets.

branch imm1 - jumps to another place in the function without altering the stack. The immediate value is a block index, indicating where to jump.
branchif imm1, imm2, pop1 - jumps to one of two locations (indicated by the immediate operands), depending on a Boolean value popped from the stack.
label imm1 â†’ - loads an offset from the function's offset table and pushes it on the stack. This is used with the branchl instruction to implement continuations.
branchl pop1 - branches to an offset popped from the stack that was pushed by a previous label instruction.
pushtry imm1, imm2 - enters a try-statement. The first immediate value is a block index to jump to. The second immediate value is the block index of the exception handler. This location is pushed onto a separate exception handler stack. If an exception is thrown before the corresponding poptry instruction is reached, execution will resume at the exception handler.
poptry imm1 - leaves a try-block by popping an exception handler from the handler stack. The immediate operand is a block index to jump to.
throw - throws an exception, popped from the top of the stack. If the exception handler stack is not empty, the handler on top is popped. Everything on the value stack will be popped down to where the exception handler was created, and execution will resume at the location indicated by the handler. The exception is pushed before execution resumes. If the handler stack is empty, a C++ exception will be thrown. Note that several other instructions may throw exceptions (e.g., NullPointerException); they operate in the same way.

Call instructions and type instructions

CodeSwitch provides a several instructions for calling functions. When calling a function, we push all its arguments on the stack, then execute one of the call instructions. The function can access the arguments using ldlocal instructions. When the function returns, all arguments are popped, and the result is pushed.

callg imm1 â†’ - calls a function in the same package. The immediate operand indicates which function to call.
callgf imm1, imm2 â†’ - calls a function in a different package. The first operand indicates which package contains the function, and the second operand indicates which function to call.
callv imm1, imm2 â†’ - call a virtual method on an object. The first immediate operand indicates how many arguments are being passed to the function. This is used to find the object which has the virtual method. The second immediate operand indicates which virtual method to call on that object.
ret - returns from the current function. The value on top of the stack is returned and pushed in the caller's stack frame.

If the function has type arguments, things get a little more complicated. The CodeSwitch bytecode contains all the information needed to figure out the type of any value on the stack. This is necessary for validation and for garbage collection; the garbage collector needs to be able to find and update pointers on the stack. Since the return type of a function may depend on its type arguments, the type arguments are encoded into the bytecode using type instructions.

Type instructions operate on a logical type stack, separate from the value stack. The interpreter doesn't actually do anything with this stack when it executes these instructions, which is why I use the word "logical". The instructions just act as annotations. When a call instruction is executed, types are popped from the logical type stack.

tycs imm1, tycsf imm1, imm2 - push a class type (from this package or another). Class types may have type arguments, so this may pop a number of types from the type stack. The s suffix indicates these instructions are static and don't affect the value stack.
tycd imm1, pop? â†’, tycdf imm1, imm2, pop? â†’ - same as above, but these instructions also push a real Type object on the value stack. The d suffix stands for dynamic. A variable number of values are popped, depending on how many type parameters the class has. These instructions are useful for checked casts and non-static type arguments.
tyvs imm1 - push a variable type (corresponding to a type parameter whose actual type is unknown) on the type stack. This is a static instruction, so the interpreter treats it as a no-op.
tyvd imm1 â†’, tyvdf imm1, imm2 â†’ - same as above, but these instructions also push a real Type object on the value stack. This is useful for describing type arguments of classes with static type parameters, since you won't know the real types at run-time. More on this in a future post.
tyflags imm1, tyflagd imm1, pop1 â†’ - sets flags for the type on top of the type stack. tyflagd also sets flags for a Type object on the top of the value stack. Useful for expressing nullable types.

Note that the instructions above can only describe object types; there are no instructions to describe primitive types like i64 or boolean. Primitive types can't be used as type arguments, so instructions to describe them aren't necessary.

There are a few more instructions useful for type checking and casting.

cast pop1, â†’ - static cast to a supertype. This is needed when storing into a local variable whose type is a supertype of the value being stored (for example, storing a String into an Object variable). This is effectively a type annotation for the local variable slot, since the type isn't specified anywhere else. The interpreter treats this as a no-op, but logically, a value is being popped from the stack, then immediately pushed with a more general type.
castc pop1, pop2 â†’ - checked cast to a subtype. The type being checked is popped from the value stack, so it must be constructed using the dynamic instructions above. The object being checked is left on the stack. If the check fails, a CastException is thrown.
castcbr imm1, imm2, pop1, pop2 â†’ - checked cast and branch. Same as above, but instead of throwing a CastException or not, this instruction branches to one block or another. This is useful for pattern matching.

Future work

At this point, CodeSwitch has a pretty complete instruction set which can be used to implement most static languages. However, there are no instructions yet to support dynamic languages. At the very least, I'll need dynamic load, store, and call instructions which accept property name strings. I'll also need a data type which can represent dynamic values. Maybe some kind of NaN boxing will work. I'll probably implement this when adding dynamic features to Gypsum.

Aside from that, the other task to be done is the JIT compiler. The interpreter will always be part of CodeSwitch, since the JIT compiler will need to run on top of it (I intend to write the JIT compiler in Gypsum), and something will need to run code while the JIT compiler is working in parallel. It will probably be a while before I get started on that, though.