A tour of V8: Crankshaft, the optimizing compiler
This article has been translated into Chinese. Thanks to liuyanghejerry!
In the last two articles, we discussed V8's non-optimizing full-compiler and the object representation. Several years ago, the full compiler alone would generate relatively decent code from JavaScript, but performance demands are always increasing, and standards are higher now. Hence, Crankshaft was born.
Crankshaft is V8's optimizing compiler. If you'll recall, V8 actually has two compilers. The full-compiler generates unoptimized code as quickly as possible. This is ideal for code that only executes once, or a few times at most. Once full-compiled code has been running for a while, V8 identifies "hot" functions and recompiles them with Crankshaft, greatly improving performance.
You're getting warmer...
If you only look at source code, it's difficult to tell which functions will be most in need of optimization. V8 employs a runtime profiler which identifies hot functions as they execute.
When Crankshaft was first deployed, V8 used a stack sampling profiler, which ran in a separate thread. Every so often (1 ms on desktop, 5 ms on mobile), the profiling thread would wake up and send a SIGPROF
signal to the execution thread. The signal handler on the execution thread would reset the stack limit (an address which normally marks the end of the stack). JITed code checks the stack limit whenever a function is called and on each iteration of a loop. If the stack pointer is past the stack limit, the JITed code calls the runtime to report the error. This provided the profiler with a useful mechanism to interrupt execution; the runtime would check if the stack overflow was due to SIGPROF
and invoke the profiler if so. Once invoked, the profiler would sample the top few frames of the stack. If the profiler saw the same function several times, it would mark that function for optimization.
There were some significant drawbacks to the sampling profiler. Since sampling is mostly random, performance was non-deterministic. Although the profiler is statistically likely to identify the hottest functions, it might identify them in a different order or at different times on each page load. As you can imagine, benchmark scores varied wildly; certain tests in the V8 benchmark suite could vary by as much as 50% between runs. There was also a systematic bias against large functions without loops due to the way interruption worked.
V8 now uses a counter-based profiler. Every full-compiled function contains a counter which is decremented when the function returns or on each iteration of a loop. The number the counter is decremented by depends on the size of the function or loop, so there is less bias against large functions or loops. The profiler is invoked when the counter reaches zero. It identifies hot functions just as well as the sampling profiler (better, actually), and it does so in a deterministic fashion. There is also no profiling overhead for optimized code, since only unoptimized code contains counters.
When a function is "marked for optimization" by the profiler, its code pointer is rewritten to point to LazyRecompile
, a built-in function which invokes the compiler. This causes the function to be optimized next time it is called.
Anatomy of Crankshaft
Crankshaft generates code in the following phases:
- Parsing: this phase translates the source code into an abstract syntax tree (AST). Crankshaft and the full-compiler share the same parser. V8 doesn't retain the AST (or any other intermediate representation) in memory between compilations because it takes up space, it's not frequently needed, and it's fairly easy to regenerate.
- Scope analysis: in this phase, V8 determines how variables are used and links them back to their definitions. Local variables, captured variables, and global variables are treated differently. This phase is also shared with the full compiler. Some usage patterns (like dynamically introducing variables with
eval
) may disqualify a function from optimization or inlining. - Graph generation: Crankshaft builds the Hydrogen control flow graph using the AST, scope info, and type feedback data extracted from the full-compiled code. Inlining also happens at this stage. Hydrogen is Crankshaft's high-level architecture-independent intermediate representation.
- Optimization: most optimizations are performed on the Hydrogen graph during this stage. This is the only stage of Crankshaft that can be done in parallel to the JavaScript execution thread. The parallel functionality is not enabled as of this writing though.
- Lowering: once optimization is complete, the Lithium graph is built out of the Hydrogen graph. Lithium is Crankshaft's low-level architecture-specific intermediate representation. Register allocation is performed at this stage.
- Code generation: in this final stage, Crankshaft emits a sequence of native instructions for each Lithium instruction. Meta-data, such as relocation info and deoptimization data, is also generated at this stage. Once complete, the JITed code is wrapped in a
Code
object, and execution can resume.
This structure is fairly typical of an optimizing compiler. However, there are some aspects which may be surprising. First, Crankshaft doesn't have any truly low-level representation. Hydrogen and Lithium instructions mostly correspond to high-level JavaScript operations. In some cases, dozens of native instructions may be emitted for each Lithium instruction. This leads to odd redundancies in the generated code. Second, Crankshaft lacks any form of instruction scheduling. This is not much of a problem on x86 CPUs, which tend to have good out-of-order execution and large issue windows, but it can cause problems on simpler RISC architectures like ARM.
These omissions are most likely due to Crankshaft's conspicuous position on the critical path of performance. JavaScript execution is halted while Crankshaft runs, which means Crankshaft must do its job quickly. Extra stages and optimizations may cost more time than they save. V8 developers are working on parallelizing Crankshaft, but that functionality hasn't been activated yet. The V8 heap doesn't support access by multiple threads, and without access to the heap, only the optimization stage can be executed in parallel. Making the heap thread-safe is a complicated task and would probably introduce substantial synchronization overhead.
Hydrogen and Lithium
V8 has a high level intermediate representation called Hydrogen, and a low level representation called Lithium. If you've worked with LLVM before, Hydrogen's structure may seem somewhat familiar. A function is represented as a control flow graph of basic blocks where each block contains a sequence of instructions in static single assignment (SSA) form. Each instruction has a list of operands and a list of uses, so you can think of a data flow graph being layered on top of the control flow graph. Each Hydrogen instruction represents a fairly high-level operation, such as an arithmetic operation, a property load or store, a function call, or a type check.
Most optimizations are performed on Hydrogen or while Hydrogen is being constructed. Here is a brief list of optimizations that Crankshaft performs.
- Dynamic type feedback: During graph generation, most operations are specialized to work on a single type. The types are extracted by examining the status of inline caches in full-compiled code. For instance, if the IC implementing a load operation only ever loaded a property from a single type of object, the specialized Hydrogen code will be optimized for handling only that kind of object. Type checks will be inserted as necessary.
- Inlining: Performed during the graph generation phase. The inlining heuristic is very simple: basically, if the function being called is known, and it is safe to inline, the function will be inlined. Large functions (>600 source characters including whitespace or 196 AST nodes) will not be inlined. Cumulatively, a total of 196 AST nodes can be inlined in a single function.
- Representation inference: Hydrogen supports three representations for values in registers: tagged, integer, and double. Tagged values are boxed. Strings and objects are always tagged, but numbers don't have to be. Using raw integers or doubles is more efficient, but all values must be tagged before they are stored in memory or passed to another function. This pass decides what representation each value should have.
- Static type inference: In this pass, Crankshaft attempts to determine types of values in the function. This pass is fairly ineffective, since it only operates on one function at a time, and the types of most operations (such as function calls and property loads) cannot be inferred. However, some type checks can be eliminated with accurate type information.
- UInt32 analysis: V8 represents small signed integers using 31 bits. The low bit is reserved by the garbage collector to distinguish pointers and integers (0 indicates an integer, 1 indicates a pointer). Crankshaft can use the full signed 32-bit range for local variables and temporary values which don't get stored in normal memory. Special handling is required when this range is overflowed, though. Semantically, JavaScript considers all numbers to be 64-bit doubles, so allowing integer wrap-around would be incorrect. This pass marks some operations as "unsigned" so that small overflows don't cause any trouble in certain situations. This is particularly useful for applications like cryptography, compression, and graphics.
- Canonicalization: This is a simplification pass. It eliminates unnecessary operations and simplifies others.
- Global value numbering (GVN): This is a standard optimization which eliminates redundancy. Each instruction is processed sequentially. When an instruction is processed, a hash code is generated based on its opcode, its inputs, and any data associated with it. It is then added to a hash table. If there is already an equivalent instruction in the hash table, GVN deletes the later instruction and updates its uses to point to the one already in the hash table. Instructions which have side effects can't be eliminated by GVN, but when such an instruction is encountered, the GVN table is cleared of instructions sensitive to those kinds of side effects. For example, it does not make sense to consolidate two identical loads if there is a store between them.
- Loop invariant code motion (LICM): This is done at the same time as GVN. Instructions which are in a loop but don't depend on anything inside the loop are hoisted into the loop pre-header. Type checks for loop invariant values may also be hoisted out of loops, but this may cause them to be executed in situations they normally wouldn't execute, so the compiler may be conservative about this.
- Range analysis: This pass determines an upper and lower bound for each integer operation. This can eliminate some overflow checks. In practice, it's not very accurate.
- Redundant bounds check elimination: Eliminates some extra array bounds checks done for array element access operations.
- Array index dehoisting: This reverses the effect of LICM for very simple expressions: array indices plus or minus a constant. All architectures supported by V8 have addressing modes that handle loads or stores with these indices with one or two instructions, so hoisting these operations is usually less efficient.
- Dead code elimination: This is a "clean up" pass that eliminates Hydrogen instructions with no uses or side effects, which are a byproduct of other optimizations. This can't usually eliminate dead code written by the programmer; V8 may need to switch between optimized and unoptimized code in the middle of a function, and if the optimized code doesn't compute a value which the unoptimized code expects, the unoptimized code would crash.
Lithium is V8's low level, machine-specific representation. Actually, it's not very low level at all: each Hydrogen instruction is lowered into at most one Lithium instruction. Some Hydrogen instructions such as HConstant
and HParameter
just become Lithium operands. Other Hydrogen instructions are lowered to Lithium instructions, which are connected to each other by operands (as opposed to Hydrogen, where instructions are connected to each other directly). An operand can be a constant, a register, or a stack slot. The register allocator determines the type and location of each operand. Most instructions only support register operands, so the register allocator automatically inserts move, load, and store instructions between Lithium instructions as needed.
Native code is generated directly from Lithium instructions. Simple Lithium instructions may generate only one native instruction, but some operations may produce on the order of 10-20 instructions (on ARM; other platforms may vary).
It may surprise you how complicated JavaScript operations end up being, even in optimized code that only handles common cases. For instance, here are the instructions generated for adding a new property to an object.
;; HCheckNonSmi: check low bit to see if receiver is integer ;; Integers always have the low bit clear. 40 tst r0, #1 44 beq +188 -> 232 ;; HCheckMaps: check receiver's map against known value 48 ldr r9, [r0, #-1] 52 movw ip, #41857 ;; object: <Map(elements=3)> 56 movt ip, #17120 60 cmp r9, ip 64 bne +172 -> 236 ;; HCheckPrototypeMaps: check receiver's prototype chain in case ;; a read-only property with the same name has been added. 68 movw r2, #38241 ;; object: Cell for <an Object> 72 movt r2, #15696 76 ldr r2, [r2, #+3] 80 ldr r1, [r2, #-1] 84 movw ip, #41937 ;; object: <Map(elements=3)> 88 movt ip, #17120 92 cmp r1, ip 96 bne +144 -> 240 100 movw r2, #46869 ;; object: <an Object> 104 movt r2, #14674 108 ldr r1, [r2, #-1] 112 movw ip, #36057 ;; object: <Map(elements=3)> 116 movt ip, #17120 120 cmp r1, ip 124 bne +116 -> 240 ;; HConstant: this is the value we're storing 128 mov r1, #198 ;; HStoreNamedField: everything below is the store operation ;; First, we load the transition map. 132 movw r9, #41977 ;; object: <Map(elements=3)> 136 movt r9, #17120 ;; Next, store it in the first word of the object. 140 str r9, [r0, #-1] ;; The garbage collector may need to be notified of the map change. ;; This is called a write barrier. It is necessary for incremental ;; marking to work. 144 sub r2, r0, #1 148 bfc r9, #0, #20 152 ldr r9, [r9, #+12] 156 tst r9, #4 160 beq +36 -> 196 164 mov r9, r0 168 bfc r9, #0, #20 172 ldr r9, [r9, #+12] 176 tst r9, #8 180 beq +16 -> 196 184 movw ip, #37056 188 movt ip, #10611 192 blx ip ;; Now we can finally store the value. Since we are only storing an ;; integer, we don't need to tell the garbage collector, but normally ;; we would need another write barrier here. 196 str r1, [r0, #+11]
On Stack Replacement
When we discussed the runtime profiler, I noted that the compiler is normally invoked the next time a function is called after the profiler marks it as "hot". When the compiler finishes, execution resumes at the beginning of the newly optimized code. This works well for most functions, but it doesn't help with functions that contain hot loops but are only called once.
function calledOnce() { for (var i = 0; i < 100000; i++) // ... complicated things in need of optimization here }
These functions still need to be optimized, so a more complicated method of control transfer was devised. If the profiler is triggered by a function which has already been marked for optimization but has not been called again, the profiler will attempt on-stack replacement. The profiler invokes Crankshaft directly, and optimized code is generated immediately. After Crankshaft finishes, V8 constructs a new stack frame for the optimized code based on the contents of the unoptimized frame currently on the top of the stack. The old stack frame is popped, the new stack frame is pushed, and execution resumes in the middle of the optimized code.
This process requires some support from both compilers. The full compiler needs to generate some information about the location of each value in the stack frame. Crankshaft needs to generate information about where these values should be stored. Crankshaft also needs to generate a "landing pad", which loads values out of the stack frame into the correct registers.
Deoptimization
As mentioned above, Crankshaft generates code which is specialized to handle only the types of data previously encountered by the full-compiled code. Since Crankshaft code cannot handle every possible value, there needs to be a mechanism to gracefully fall back to full-compiled code when a type check fails or an arithmetic operation overflows. This mechanism is called deoptimization, and it is basically the opposite of on-stack replacement. However, there is a twist! Since Crankshaft supports inlining, and type checks may fail inside inlined functions, the deoptimizer may need to replace a stack frame for optimized code with multiple stack frames for unoptimized code.
To make the deoptimizer's job easy, Crankshaft generates deoptimization input data. Each point where deoptimization can occur has a sequence of commands associated with it. The commands are used by the deoptimizer to translate registers and stack slots in the optimized frame into values in full compiled frames. Each command indicates something should be fetched and pushed onto the stack. For example, "get the value in register r6, box it as a number, and push it into the next stack slot" would be a typical command. The deoptimizer simply looks up the correct commands, executes them in sequence, pops the optimized frame, then pushes the unoptimized frames.
Conclusion
Crankshaft is V8's secret performance sauce. Using the runtime profiler, V8 can identify which functions are most in need of optimization and spend time speeding up execution where it really counts. Since deoptimization allows V8 to fall back to unoptimized code at any time, Crankshaft can be speculative, generating code which only handles cases which are actually observed.
In the future, Crankshaft will likely be more parallelized. This will allow V8 to achieve even greater performance since more time can be spent optimizing individual functions and more functions can be optimized in the background.