 Hi everyone, my name is Anciano, this is Dominik, we both work in B8, which is a JavaScript engine that is used in many places, including Chrome. Today we will give you a high level overview of the JavaScript execution pipeline in B8. Okay, so in this talk we will mostly focus on JavaScript's execution pipeline, but we will also talk about memory management, web assembly a bit, and dev tools. And in the end we will give you an overview of what the B8 team is currently working on. So let's first define some terminology that often comes up with JavaScript and B8. So B8 is Chromium's JavaScript implementation, or actually Chromium's ECMAScript implementation. B8 is named after the car engine type, and you will see that many names of components in this project, have similar names and are very inspired by this project name. So these days B8 not only implements JavaScript, but also web assembly. And B8 is not only used in Chrome, but also in many other projects, most prominently in Node.js. And in Chrome and Blink we use a convenience library for interacting with B8, it's called GIN, but this library is also used in other projects, for example the PDF viewer. Blink then obviously adds another library on top of that, so-called Blink bindings, and this layer is responsible for exposing web APIs to JavaScript. So a single instance of B8 is called an isolate, and each isolate in B8 has a separate garbage-collected heap. That means it's not possible to pass an object directly from one isolate to another. In Blink we use one isolate per renderer process, and all websites in the same process also share the same isolate. And there's obviously an exception to that rule, that are web workers, since each worker gets its own isolate as well. Within an isolate there are then one or more B8 contexts. And in Chrome we have one context for each iframe, and in addition for each extension we also have a separate context for each iframe. So you can imagine that there can be quite a lot of contexts within a single isolate. And Blink bindings usually refers to a context through the script-state class, and there's a one-to-one relationship between the script-state and the B8 context. So let's start to look into how JavaScript code gets executed in B8. So in the first step of the execution pipeline we need to load some source code. And in B8 we actually don't know how to download a file. This is the embedded job to provide the script. And in the case of Blink the script is loaded like any other resource. So the resource might come from the network, the cache, or a service worker. B8 also might be able to cache the bytecode for a script. So instead on a warm load we might skip parsing and then directly load the bytecode for the file. Another feature that B8 supports is streaming parsing. So streaming parsing allows us to parse a file while the file is being downloaded. And this helps us to hide the cost of parsing in the network layer. The next step in the execution pipeline is the parser. Initially the scanner takes the stream of characters from the script and produces a stream of tokens. And that stream of tokens is then passed as an input to the parser. The parser is a handwritten recursive descent parser. It takes that stream of tokens and produces an abstract syntax tree. And in B8 there are actually two kinds of parsers, the preparser and the full parser. The preparser helps us to defer the cost of fully parsing a function to improve startup. Preparsing is cheaper because it only does a syntax pass and some early error checking. But for example it doesn't produce an abstract syntax tree yet. So let's look into a simple script and what the parser generates for that. So there's a simple script with a function full, a local variable p, an assignment to that local variable and a return statement. The parser generates this abstract syntax tree. So there's a function literal for the function itself, a block node for the function body, a variable declaration for the local variable p and two nodes for the assignment and the return statement. You can also see that there are variable proxies that are references to variables. Initially those two variable proxies are unresolved. That means we don't yet know what variable those two nodes refer to. And the parser not only generates the EST but also constructs scopes. And each scope contains a list of all declared variables in it. So in our little script the global scope would contain the function full and in the scope for the function we would only have the variable p. And after parsing is completed we run scope analysis. And in this phase we figure out that both variable proxies refer to the same variable p. So then we move on to the next part of the pipeline which is the interpreter called ignition. As we previously saw we have the output of the parser which is the EST. This is fed as input to the bytecode generator. The bytecode generator generates a stream of bytecodes. Then these bytecodes get executed one at a time by the interpreter. There are several ways of building an interpreter. One would be to have a huge switch case scenario in which we'll describe what to do for each particular bytecode. V8 has an indirect threaded interpreter. We have a huge table of handlers keyed by bytecode. We can look for a particular bytecode its handler, jump to it and start executing code. Our interpreter has registers. They function similarly to machine registers but they actually live in a stack. Bytecode is the source of truth for the interpreter. Our optimizing compiler called turbofan and the depth tools. This means that once we generate all this bytecode we can discard the EST because we are not going to be needing it again. We record all sorts of metadata in the bytecode such as the source position and the handler step. In this example we have a function and the bytecode that we generate from it. The bytecodes have been carefully selected to represent JavaScript efficiently and compactly as one of Ignition's main goals was to save memory. It has a single accumulator register that most operations reference. For instance, we have test input strict and you don't see the accumulator there, you just see the A. This is because the accumulator is used implicitly over there. We are comparing the A which is passed to the function as arguments against the accumulator. Then we also have a stack pointer that is updated once per call frame which provides the base of a stack base registers. These are often the local variables of a function. And JavaScript objects which are allocated in the JavaScript heap cannot be directly embedded in the bytecode. We have a fixed array called the constant pull. We store the value there and then their index in this array is what we embed in the bytecode. The optimizing compiler in V8 chips with a code generator which uses a custom machine independent portable assembler. The interpreter uses this assembler to generate code for the bytecode handlers to get portability and interrupt with the optimizing compiler. This fits in the lowest tier of two orphan, our optimizing compiler, so we get some optimizations for free. This is an example of how a handler for a bytecode is generated at compile time. This generates the machine code that then gets run by the interpreter. Here in the highlighted line, we generate the code to get the register by looking up the next byte in the bytecode. Then we generate the code to load the register, load the accumulator and generate the add stuff for the JS add operation. Then we generate the code to store the result back into the accumulator and dispatch for the next bytecode. Once we have all this bytecode generated, we can start executing it. JavaScript is a dynamic language and even simple things like loading an object.foo can have complex semantics. For instance, it could mean loading a simple property, calling a getter or walking through a prototype chain, figuring out what to do dynamically is rather slow. What we do instead is we cache it and hope that the next time we encounter it, we would like to do the same thing. What we do in this scenario is we call get property with two slots in the feedback cache. In the first slot, we will have the shape of the object, its structure. And in the second one, the pitfoo containing what to do. If the shape matches, so if the cache actually work, if it's a match, we already know what to do and don't have to do any dynamic lookups. In the case, the feedback is not there or if it's not a match, what we need to do is figure out dynamically what to do and then update the feedback cache. So the next time we encounter it, we will hopefully don't have to figure out everything dynamically again. So I talked about the structure of objects. How do we do it? Object shapes are also, which are also called hidden classes in literature. In V8, we call them maps, but it's roughly the same name. They represent the structure of a JS object, how the properties and the elements are stored. First of all, we create an empty shape for the empty object literal. It's empty because it has no properties whatsoever. Once we create a new JS object, we store a pointer to this shape in the JS object. Then, when the constructor hits this star x, it will have to add this property to the shape. It creates a new shape with the information about this new other property. Where to find it, its name and any attribute it may have, for instance, if it's rid only. Then it updates the shape pointer from the JS object to the new shape. It will now represent an object with the property x at offset 4. Then we also add the transition from the old shape to the new shape. This is done so for the next time we encounter this star x, we don't have to create a new shape, we just have to follow the transition tree. When we hit the second property, we do the same thing. We create a new shape, update the shape pointer, and then add the transition. Now we can detect object shapes and we are executing. While we are executing, we are collecting all this type of feedback. How do we do it? Again, we have an all dot foo. At first, we don't have any feedback, so everything is uninitialized. We have two slots, as I said. In the first one, we will start the shape or the structure of the object, and the second one, a bitfill contains what you do. So we encounter an object, has foo, perfect. We start the shape and we have the bitfill. We transition from an uninitialized state into a monomorphic state. This means we only have one shape. What happens when we encounter a second object? Well, now this object doesn't even have the full property, it's a totally different object, so we would need to return and define. And now we have two types. So we change from a monomorphic state into a polymorphic state. We store polymorphic in the slot where the shape would be, and the second slot now becomes a pointer. It's now a pointer, it's called a hundreds array, so it's a pair of shape bitfills. Well, we will now, we need to, in the case we hit this again, we will go through the hundreds array and check all the shapes. So the interpreter ignition only gets us so far. If a function gets high enough, we will optimize it, optimize it in our optimizing compiler to make it faster. And not only the type feedback helps the execution for the interpreter, but we can also use it as input for our optimizing compiler. It will generate optimized code based on the type feedback we have seen so far. Turofan is a CfNodes rough-based compiler. Nodes are expressed as being inputs to one another. At first, we only know that which nodes depend on which. Schedule happens way later. This gives us a lot of flexibility in terms of moving code around the graph. For instance, hoisting nodes outside of loops. As I said, functions are compiled individually once they're hot. We don't do a full coaching, we've found out that this is better. And everything gets generated from ignition by code, so there's no need to generate the AST all over again. Turofan works in stages or phases, where we go from a high level to a low-level representation. We start with a representation which is similar to JavaScript, for instance, load a value from an array. Then we lower it into a simplified, where we have, for instance, a bounce check followed by the load of this value from an array if the bounce check is successful, and lower even more into machine, which is roughly assembly-level. The last step is machine code, so we go into culture. That last part is architecture-dependent, and most of the rest of the phases are architecture-independent, so we can reuse the same optimizing compiler for all of the architectures. These are some examples of the reductions we can do. We can do constant folding, reducing several nodes into one. Strength reduction, we can do make some operations faster. GVN, or global value numbering, in where we reduce duplication of nodes in the graph. And we can also do algebraic re-association, which, in a way, is similar to constant folding, but it's a bit more complicated. So we optimize all of these functions based on the type feedback we have seen so far. But what happens when this assumption of types is not what we expect? Then we have to deopt or deoptimize and go back into the interpreter. Sadly, we have to throw away the optimized code, go back into the interpreted code and resume execution from there. We also update the type feedback, so the next time we encounter this point, hopefully we will not deopt and continue work in optimized code. So we have to go back to interpreter code and if this function gets hot enough, we will re-optimize and this will maybe yell out. To do all of this, we have to associate metadata in every point in optimized code and then cause one of these deots. So another important component in V8 is the garbage collector. The garbage collector manages memory and among other things, it's responsible for allocating and deallocating objects. V8's garbage collector is called Orinoco. Orinoco is a tracing garbage collector. Tracing garbage collector collection is opposed to reference counting, for example. That means during a GC cycle, Orinoco will compute the set of all reachable objects and all objects are not reachable and applications are considered garbage and dead and can be reclaimed by the garbage collector. V8 also uses a generational heap layout. That means that the heap is split into a new space and an old space. New objects are allocated in the new space while the old space is used for long-living objects. The generational heap layout is based on the assumption that most objects die on. Then there is another space in our heap, called code space. We have that because memory pages for generated code need to be set executable and we don't want to do that for regular objects. V8 also tries to aim for low latency and high throughput. But these goals are usually conflicting and we try to achieve this by performing work incrementally in parallel, that means on multiple threads, and concurrently, that means, for example, on a background thread while the application keeps running. V8's GC is also tightly integrated with Blink's OilPen GC. It's quite tricky to synchronize two independent GCs with each other. In V8, we solve this by simply computing the live objects across the JavaScript and C++ boundary. This makes it possible for us to handle reference cycles between JavaScript and OilPen objects gracefully and not leaking memory. In V8, we actually have two different kinds of collection, a minor collection and a major collection. The minor collection will only reclaim memory in the new space while the major GC will reclaim memory in the old spaces. The new space is again split into two equally sized semispaces and a minor GC will copy live objects from the currently used semispaces into the other one. Objects that survive the second minor GC are promoted into the old space. The typical parsed time of this collection are up to one millisecond. The major GC, as already mentioned, reclaims memory in old spaces. It usually consists of multiple phases. Initially, we mark all reachable objects and that work is performed incrementally and concurrently and in parallel. A major GC will also perform compaction. That means live objects on almost free pages are advocated to other pages to reduce memory. Since we move objects, we also need to update references to those objects that have been moved. A major GC also needs to sweep pages. That means we need to iterate the page and add a free area to the free list for subsequent allocations. The major GC also always runs Blinks OilPen GC. Each time a major GC is performed, also the OilPen's heap will be collected. The typical parsed time of these collections are up to 10 milliseconds. As already mentioned, we also support WebAssembly. WebAssembly is a low-level execution format. It's designed to be used as a compilation target. WebAssembly is statically typed and there's a well-defined text and binary inputting with a one-to-one translation between them. Here on the slides you see both encodings for a simple function that gets a 32-bit integer as an argument, multiplies it by two, and returns it. V8 fully validates each model before it gets executed. Similar to JavaScript, WebAssembly uses multiple tiers for compilation and execution. LiftOff is used as a baseline compiler. The goal of the baseline compiler is to make machine code as fast as possible to help with startup. WebAssembly also then newly uses Tuber Fan for optimizing hot functions. The machine code generated by Tuber Fan is faster compared to LiftOff, but on the other hand, Tuber Fan also takes a lot longer to compile code. Unlike JavaScript, there are no de-optimizations performed in WebAssembly. V8 also implements the backend part of Chrome DevTools. DevTools supports basic debugging functionality like setting breakpoints, single-stepping through code, but it also supports more powerful features like, for example, LiveEdit, where you can edit the source code without reloading a page. You can also take performance profiles of a website, and for better understanding memory usage, you can, for example, take heap snapshots or profile allocations. So we've now seen all the various parts of V8, but where does V8 itself actually spend most of its time during execution? And traditional benchmarks like Octane had a strong focus on executing optimized code. But when we look at the loading profile of real-world websites, we see that the profile looks quite different. So there's a lot less time spent in the generated code, but much more time in the runtime and in parsing. And based on these observations, we try to pick areas to focus on and optimize for. So as a result of these efforts, page load has improved quite a lot compared to previous releases. And here, for this particular Facebook site that we test often for, page load has more than half. So we have now seen where V8 spent its time, but where do the V8 developers spend their time on? So we are constantly implementing new features or ECMAScript features as they are being standardized. So for example, recently shipped class fields and currently we are working on big references and optional chaining. In WebAssembly, we are trying to improve and extend the standard and implement features like reference types and SIMB.