A tour of V8: full compiler

Published on 2012-11-04
Edited on 2015-11-28
Tagged: javascript optimization v8 virtual-machines

This article has been translated into Serbo-Croatian.
Thanks to Anja Skrba of Web Hosting Geeks!
Also translated into Chinese. Thanks to liuyanghejerry!
And Polish! Thanks to Natasha Singh at couponmachine.in.
And now, Italian. Thanks to Calvina Bice.

Over the last five years, JavaScript performance has increased incredibly quickly, largely due to a transition from interpretation to JIT compilation in JavaScript virtual machines. This has greatly increased the usefulness of JavaScript and web apps in general. As a result, JavaScript is now the driving force of HTML5, the next wave of web technologies. One of the first JavaScript engines to generate and execute native code was V8, which is used in Google Chrome, the Android browser, WebOS, and other projects such as Node.js.

A little over a year ago, I joined a team at my company which optimizes V8 for our own ARM microarchitecture. Since that time, I've seen SunSpider performance double, and V8 benchmark performance increase by about 50%, due to contributions from both hardware and software.

V8 is a really interesting project to work on, but unfortunately, the documentation for it is a little sparse. In the next few articles, I'll provide a high level overview, which will hopefully be interesting for anyone curious about the internals of virtual machines or compilers.

High level architecture

V8 compiles all JavaScript to native code before executing it. There is no interpretation and no bytecode. Compilation is done one function at a time (as opposed to trace-based compilation as used in TraceMonkey, the old FireFox VM). Usually, functions aren't actually compiled until the first time they are called, so if you include a big library script, the VM won't waste time compiling the unused parts of it.

V8 actually uses two different JavaScript compilers. I like to think of them as the simple compiler and the helper compiler. The full compiler (which is the focus of this article) is an unoptimizing compiler. Its job is to produce native code as quickly as possible, which is important for keeping page load times snappy. Crankshaft is the optimizing compiler. V8 compiles everything first with the full compiler, then uses a built-in profiler to select "hot" functions to be optimized by Crankshaft. Since V8 is mostly single-threaded (as of version 3.14), execution is paused while either compiler is running. Consequently, both compilers are designed to produce code quickly instead of spending a lot of time producing very efficient code. In future versions of V8, Crankshaft (or at least portions thereof) will run in a separate thread, concurrently with JavaScript execution, enabling more expensive optimization.

Why no bytecode?

Most VMs contain a bytecode interpreter, but this is notably absent from V8. You might wonder why the full compiler exists when it might be easier to compile to bytecode and execute that. It turns out that compiling to unoptimized native code isn't actually that much more expensive than compiling to bytecode. Consider the compilation processes for both:

Bytecode compilation:
  • Syntax analysis (parsing)
  • Scope analysis
  • Translate syntax tree to bytecode
Native compilation:
  • Syntax analysis (parsing)
  • Scope analysis
  • Translate syntax tree to native

In both cases, we need to parse the source code and produce an abstract syntax tree (AST). We need to perform scope analysis, which tells us whether each symbol refers to a local variable, context variable (used by closures), or global property. The translation step is the only part that is different. You can do very elaborate things here, but if you want the compiler to be as fast as possible, you pretty much need to do a direct translation: every syntax tree node would get translated to a fixed sequence of bytecodes or native instructions.

Now consider how you would write an interpreter for bytecode. A na�ve implementation would be a loop which fetches the next bytecode, enters a big switch statement, and executes a fixed sequence of instructions. There are various ways to improve on this, but they all boil down to the same structure.

Instead of generating bytecode and using an interpreter loop, what if we just emitted the appropriate fixed sequence of instructions for each operation? As it happens, this is exactly how the full compiler works. This removes the need for an interpreter, and simplifies transitions between unoptimized and optimized code.

In general, bytecode is useful in situations where you can do some of the compiler's work ahead of time. This isn't the case inside a web browser, so the full compiler makes more sense for V8.

Inline caches: accelerating unoptimized code

If you look at the ECMAScript spec, you'll find that most operations are absurdly complicated. Take the + operator for instance. If both operands are numbers, it performs addition. If either operand is a string, it performs string concatenation. If the operands are something other than numbers or strings, some complicated (possibly user defined) conversion to primitive occurs before either number addition or string concatenation. Just by looking at the source code, there's no way to tell what instructions should be emitted. Property loads (example: o.x) are a good example of another potentially complicated operation. From looking at the code, you can't tell whether you're loading a normal property within the object (an "own" property), a property of a prototype object, a getter method, or some magic browser-defined callback. The property may not even exist. If you were to handle all possible cases in full-compiled code, even this simple operation would require hundreds of instructions.

Inline caches (ICs) provide an elegant solution to this problem. An inline cache is basically a function with multiple possible implementations (usually generated on the fly) which can be called to handle a specific operation. I previously wrote about polymorphic inline caches for function calls. V8 uses ICs for a much broader set of operations: the full compiler uses ICs to implement loads, stores, calls, binary, unary and comparison operators, as well as the ToBoolean implicit operation.

The implementation of an IC is called a stub. Stubs behave like functions in the sense that you call them, and they return, but they don't necessarily set up a stack frame and follow the full calling convention. Stubs are usually generated on the fly, but for common cases, they can be cached and reused by multiple ICs. The stub which implements an IC typically contains optimized code which handles the types of operands that particular IC has encountered in the past (which is why it's called a cache). If the stub encounters a case it's not prepared to handle, it "misses" and calls the C++ runtime code. The runtime handles the case, then generates a new stub which can handle the missed case (as well as previous cases). The call to the old stub in the full compiled code is rewritten to call the new stub, and execution resumes as if the stub had been called normally.

Let's take a simple example: a property load.

function f(o) {
  return o.x;
}

When the full compiler first generates code for this function, it will use an IC for the load. The IC starts in the uninitialized state, using a trivial stub which doesn't handle any cases. Here's how the full compiled code calls the stub.

  ;; full compiled call site
  ldr   r0, [fp, #+8]     ; load parameter "o" from stack
  ldr   r2, [pc, #+84]    ; load string "x" from constant pool
  ldr   ip, [pc, #+84]    ; load uninitialized stub from constant pool
  blx   ip                ; call the stub
  ...
  dd    0xabcdef01        ; address of stub loaded above
                          ; this gets replaced when the stub misses

(Sorry if you are not familiar with ARM assembly. Hopefully the comments make it clear what's happening.)

Here's the code for the uninitialized stub:

  ;; uninitialized stub
  ldr   ip,  [pc, #8]   ; load address of C++ runtime "miss" function
  bx    ip              ; tail call it
  ...

The first time this stub is called, it will "miss", and the runtime will generate code to handle whatever case actually caused the miss. In V8, the most common way to store a property is at a fixed offset within an object, so let's see an example of that. Every object has a pointer to a map, which is a mostly immutable structure which describes the structure of the object. The in-object load stub will check the object's map against a known map (the one seen when the uninitialized stub missed) to quickly verify the object has the desired property in the right position. This map check lets us avoid an expensive hash table lookup.

  ;; monomorphic in-object load stub
  tst   r0,   #1          ; test if receiver is an object
  beq   miss              ; miss if not
  ldr   r1,   [r0, #-1]   ; load object's map
  ldr   ip,   [pc, #+24]  ; load expected map
  cmp   r1,   ip          ; are they the same?
  bne   miss              ; miss if not
  ldr   r0,   [r0, #+11]  ; load the property
  bx    lr                ; return
miss:
  ldr   ip,   [pc, #+8]   ; load code to call the C++ runtime
  bx    ip                ; tail call it
  ...

As long as this expression only has to deal with in-object property loads, the load can be performed quickly with no additional code generation. Since the IC can only handle one case, it is in a monomorphic state. If another case comes up, and the IC misses again, a megamorphic stub will be generated which is more general.

To be continued...

As you can see, the full compiler fulfills its goal of quickly generating reasonably well-performing baseline code. Since ICs are used extensively, full compiled code is very generic, which makes the full compiler very simple. The ICs make the code very flexible, able to handle any case.

In the next article, we will look at how V8 represents objects in memory, allowing O(1) access in most cases without any structural specification from the programmer (such as a class declaration).