A tour of V8: Crankshaft, the optimizing compiler

Published on 2013-04-10
Edited on 2013-12-13
Tagged: javascript optimization v8 virtual-machines

This article has been translated into Chinese. Thanks to liuyanghejerry!

In the last two articles, we discussed V8's non-optimizing full-compiler and the object representation. Several years ago, the full compiler alone would generate relatively decent code from JavaScript, but performance demands are always increasing, and standards are higher now. Hence, Crankshaft was born.

Crankshaft is V8's optimizing compiler. If you'll recall, V8 actually has two compilers. The full-compiler generates unoptimized code as quickly as possible. This is ideal for code that only executes once, or a few times at most. Once full-compiled code has been running for a while, V8 identifies "hot" functions and recompiles them with Crankshaft, greatly improving performance.

You're getting warmer...

If you only look at source code, it's difficult to tell which functions will be most in need of optimization. V8 employs a runtime profiler which identifies hot functions as they execute.

When Crankshaft was first deployed, V8 used a stack sampling profiler, which ran in a separate thread. Every so often (1 ms on desktop, 5 ms on mobile), the profiling thread would wake up and send a SIGPROF signal to the execution thread. The signal handler on the execution thread would reset the stack limit (an address which normally marks the end of the stack). JITed code checks the stack limit whenever a function is called and on each iteration of a loop. If the stack pointer is past the stack limit, the JITed code calls the runtime to report the error. This provided the profiler with a useful mechanism to interrupt execution; the runtime would check if the stack overflow was due to SIGPROF and invoke the profiler if so. Once invoked, the profiler would sample the top few frames of the stack. If the profiler saw the same function several times, it would mark that function for optimization.

There were some significant drawbacks to the sampling profiler. Since sampling is mostly random, performance was non-deterministic. Although the profiler is statistically likely to identify the hottest functions, it might identify them in a different order or at different times on each page load. As you can imagine, benchmark scores varied wildly; certain tests in the V8 benchmark suite could vary by as much as 50% between runs. There was also a systematic bias against large functions without loops due to the way interruption worked.

V8 now uses a counter-based profiler. Every full-compiled function contains a counter which is decremented when the function returns or on each iteration of a loop. The number the counter is decremented by depends on the size of the function or loop, so there is less bias against large functions or loops. The profiler is invoked when the counter reaches zero. It identifies hot functions just as well as the sampling profiler (better, actually), and it does so in a deterministic fashion. There is also no profiling overhead for optimized code, since only unoptimized code contains counters.

When a function is "marked for optimization" by the profiler, its code pointer is rewritten to point to LazyRecompile, a built-in function which invokes the compiler. This causes the function to be optimized next time it is called.

Anatomy of Crankshaft

Crankshaft generates code in the following phases:

This structure is fairly typical of an optimizing compiler. However, there are some aspects which may be surprising. First, Crankshaft doesn't have any truly low-level representation. Hydrogen and Lithium instructions mostly correspond to high-level JavaScript operations. In some cases, dozens of native instructions may be emitted for each Lithium instruction. This leads to odd redundancies in the generated code. Second, Crankshaft lacks any form of instruction scheduling. This is not much of a problem on x86 CPUs, which tend to have good out-of-order execution and large issue windows, but it can cause problems on simpler RISC architectures like ARM.

These omissions are most likely due to Crankshaft's conspicuous position on the critical path of performance. JavaScript execution is halted while Crankshaft runs, which means Crankshaft must do its job quickly. Extra stages and optimizations may cost more time than they save. V8 developers are working on parallelizing Crankshaft, but that functionality hasn't been activated yet. The V8 heap doesn't support access by multiple threads, and without access to the heap, only the optimization stage can be executed in parallel. Making the heap thread-safe is a complicated task and would probably introduce substantial synchronization overhead.

Hydrogen and Lithium

V8 has a high level intermediate representation called Hydrogen, and a low level representation called Lithium. If you've worked with LLVM before, Hydrogen's structure may seem somewhat familiar. A function is represented as a control flow graph of basic blocks where each block contains a sequence of instructions in static single assignment (SSA) form. Each instruction has a list of operands and a list of uses, so you can think of a data flow graph being layered on top of the control flow graph. Each Hydrogen instruction represents a fairly high-level operation, such as an arithmetic operation, a property load or store, a function call, or a type check.

Most optimizations are performed on Hydrogen or while Hydrogen is being constructed. Here is a brief list of optimizations that Crankshaft performs.

Lithium is V8's low level, machine-specific representation. Actually, it's not very low level at all: each Hydrogen instruction is lowered into at most one Lithium instruction. Some Hydrogen instructions such as HConstant and HParameter just become Lithium operands. Other Hydrogen instructions are lowered to Lithium instructions, which are connected to each other by operands (as opposed to Hydrogen, where instructions are connected to each other directly). An operand can be a constant, a register, or a stack slot. The register allocator determines the type and location of each operand. Most instructions only support register operands, so the register allocator automatically inserts move, load, and store instructions between Lithium instructions as needed.

Native code is generated directly from Lithium instructions. Simple Lithium instructions may generate only one native instruction, but some operations may produce on the order of 10-20 instructions (on ARM; other platforms may vary).

It may surprise you how complicated JavaScript operations end up being, even in optimized code that only handles common cases. For instance, here are the instructions generated for adding a new property to an object.

;; HCheckNonSmi: check low bit to see if receiver is integer
;; Integers always have the low bit clear.
 40  tst r0, #1
 44  beq +188 -> 232

;; HCheckMaps: check receiver's map against known value
 48  ldr r9, [r0, #-1]
 52  movw ip, #41857             ;; object:  <Map(elements=3)>
 56  movt ip, #17120
 60  cmp r9, ip
 64  bne +172 -> 236

;; HCheckPrototypeMaps: check receiver's prototype chain in case
;; a read-only property with the same name has been added.
 68  movw r2, #38241             ;; object:  Cell for  <an Object>
 72  movt r2, #15696
 76  ldr r2, [r2, #+3]
 80  ldr r1, [r2, #-1]
 84  movw ip, #41937             ;; object:  <Map(elements=3)>
 88  movt ip, #17120
 92  cmp r1, ip
 96  bne +144 -> 240
100  movw r2, #46869             ;; object:  <an Object>
104  movt r2, #14674
108  ldr r1, [r2, #-1]
112  movw ip, #36057             ;; object:  <Map(elements=3)>
116  movt ip, #17120
120  cmp r1, ip
124  bne +116 -> 240

;; HConstant: this is the value we're storing
128  mov r1, #198

;; HStoreNamedField: everything below is the store operation
;; First, we load the transition map.
132  movw r9, #41977             ;; object:  <Map(elements=3)>
136  movt r9, #17120

;; Next, store it in the first word of the object.
140  str r9, [r0, #-1]

;; The garbage collector may need to be notified of the map change.
;; This is called a write barrier. It is necessary for incremental
;; marking to work.
144  sub r2, r0, #1
148  bfc r9, #0, #20
152  ldr r9, [r9, #+12]
156  tst r9, #4
160  beq +36 -> 196
164  mov r9, r0
168  bfc r9, #0, #20
172  ldr r9, [r9, #+12]
176  tst r9, #8
180  beq +16 -> 196
184  movw ip, #37056
188  movt ip, #10611
192  blx ip

;; Now we can finally store the value. Since we are only storing an
;; integer, we don't need to tell the garbage collector, but normally
;; we would need another write barrier here.
196  str r1, [r0, #+11]

On Stack Replacement

When we discussed the runtime profiler, I noted that the compiler is normally invoked the next time a function is called after the profiler marks it as "hot". When the compiler finishes, execution resumes at the beginning of the newly optimized code. This works well for most functions, but it doesn't help with functions that contain hot loops but are only called once.

function calledOnce() {
  for (var i = 0; i < 100000; i++)
    // ... complicated things in need of optimization here
}

These functions still need to be optimized, so a more complicated method of control transfer was devised. If the profiler is triggered by a function which has already been marked for optimization but has not been called again, the profiler will attempt on-stack replacement. The profiler invokes Crankshaft directly, and optimized code is generated immediately. After Crankshaft finishes, V8 constructs a new stack frame for the optimized code based on the contents of the unoptimized frame currently on the top of the stack. The old stack frame is popped, the new stack frame is pushed, and execution resumes in the middle of the optimized code.

This process requires some support from both compilers. The full compiler needs to generate some information about the location of each value in the stack frame. Crankshaft needs to generate information about where these values should be stored. Crankshaft also needs to generate a "landing pad", which loads values out of the stack frame into the correct registers.

Deoptimization

As mentioned above, Crankshaft generates code which is specialized to handle only the types of data previously encountered by the full-compiled code. Since Crankshaft code cannot handle every possible value, there needs to be a mechanism to gracefully fall back to full-compiled code when a type check fails or an arithmetic operation overflows. This mechanism is called deoptimization, and it is basically the opposite of on-stack replacement. However, there is a twist! Since Crankshaft supports inlining, and type checks may fail inside inlined functions, the deoptimizer may need to replace a stack frame for optimized code with multiple stack frames for unoptimized code.

To make the deoptimizer's job easy, Crankshaft generates deoptimization input data. Each point where deoptimization can occur has a sequence of commands associated with it. The commands are used by the deoptimizer to translate registers and stack slots in the optimized frame into values in full compiled frames. Each command indicates something should be fetched and pushed onto the stack. For example, "get the value in register r6, box it as a number, and push it into the next stack slot" would be a typical command. The deoptimizer simply looks up the correct commands, executes them in sequence, pops the optimized frame, then pushes the unoptimized frames.

Conclusion

Crankshaft is V8's secret performance sauce. Using the runtime profiler, V8 can identify which functions are most in need of optimization and spend time speeding up execution where it really counts. Since deoptimization allows V8 to fall back to unoptimized code at any time, Crankshaft can be speculative, generating code which only handles cases which are actually observed.

In the future, Crankshaft will likely be more parallelized. This will allow V8 to achieve even greater performance since more time can be spent optimizing individual functions and more functions can be optimized in the background.