 Hi, I'm Lynn Clark, and I make code cartoons. I also work at Fastly, which is doing a ton of cool things with WebAssembly to make better edge compute possible. And I'm a co-founder of the Byte Code Alliance. We're working on tools for a WebAssembly ecosystem that extends beyond the browser. And it's one of those tools that I wanted to talk to you about today. JavaScript was first created to run in the browser so that people could add a little bit of interactivity to their web pages. No one would have guessed that 20 years later people would be using JavaScript to build all sorts of big, complex applications to run in your browser. What made this possible is that JavaScript in the browser runs a lot faster than it did two decades ago. And that happened because the browser vendors spent that time working on some pretty intensive performance optimizations. Now this started with the introduction of just-in-time compilers around 2008, and the browsers have built on top of that, continuing these optimization efforts. Now we're starting work on optimizing JavaScript performance for an entirely different set of environments where different rules apply. And this is possible because of WebAssembly. So today I want to explain what it is about WebAssembly that enables this. But first, I want to give you a heads up. This talk is structured a bit differently than speaking experts would tell me I should be structuring this presentation. I'm going to start with telling you how we're making this work at all. And once you've heard that, you might not be on board. You might think that this is a pretty ridiculous idea. So that's why I'm going to explain why. I'm going to explain why you would actually want to do this. And then once you're bought in, and I know you'll be bought in, then I'm going to come back and explain exactly how it is that we're making this fast. So let's get started with how we're running JavaScript inside a WebAssembly engine. Whenever you're running JavaScript, the JS code needs to be executed as machine code in one way or another. Now this is done by the JS engine using a variety of different techniques from interpreters and jic compilers. And I explained this all the more detail in my first set of articles about WebAssembly back in 2017. So if you want to understand more about how this works, you can go back and read those articles. Running this JavaScript code is really quite easy in environments like the web where you know that you're going to have a JavaScript engine available. But what if your target platform doesn't have a JavaScript engine? Then you need to deploy your JavaScript engine with your code. And so that's what we need to do to bring JavaScript to these different environments. So how do we do this? Well, we deploy the JavaScript engine as a WebAssembly module. And that makes it portable across a bunch of different machine architectures. And with Wazzy, we can make it portable across a bunch of different operating systems as well. This means that the whole JavaScript environment is bundled up into the WebAssembly module. And once you deploy it, all you need to do is feed in the JavaScript code. And that JavaScript engine will run that code. Now, instead of working directly on the machine's memory, like it would for a browser, the JavaScript engine puts everything from byte code to the garbage collected objects that the byte code works on into the WebAssembly memory's linear memory. For our JS engine, we went with SpiderMonkey. And that's the JS engine that Firefox uses. It's one of the industrial-strength JavaScript virtual machines because it's been battle-tested in the browser. And this kind of battle-testing and investment insecurity is really important when you're running untrusted code or running code that processes untrusted input. SpiderMonkey also uses a technique called precise stack scanning, which is important for some of the optimizations that I'll be describing a bit later in the talk. So far, there's nothing revolutionary about the approach that I've described. People have already been running JavaScript inside of WebAssembly like this for a number of years. The problem is that it's slow. WebAssembly doesn't allow you to dynamically generate new machine code and run it from within pure WebAssembly code. So this means that you can't use the JIT. You can only use the interpreter. Now, given this constraint, you might be asking, why? Since JITs are how the browsers made JS code run fast, and since you can't JIT compile inside of a WebAssembly module, this just doesn't make sense. But what if, even given these constraints, we could actually make this JavaScript run fast? Let's look at a couple of use cases where a fast version of this approach could be really useful. There are some places where you can't use a just-in-time compiler due to security concerns. So for example, iOS devices or some smart TVs and gaming consoles. On these platforms, you have to use an interpreter. But the kinds of applications that you run on these platforms are long-running, and they require lots of code. And those are exactly the kinds of conditions where historically you wouldn't want to use an interpreter because of how much it slows down your execution. If we can make our approach fast, then these developers could use JavaScript on JIT-less platforms without taking a massive performance hit. Now, there are other places where using a JIT isn't a problem, but where startup times are prohibitive. So an example of this is in serverless functions. And this is placed into that Cold Start latency problem that you might have heard people talking about. Even if you're using the most pared-down JavaScript environment, which is an isolate that just starts up a bare JavaScript engine, you're looking at about five milliseconds of startup latency. Now, there are some ways to hide this startup latency for an incoming request, but it's getting hard to hide as connection times are being optimized in the network layer with proposals such as Quick. And it's also harder to hide when you're chaining different serverless functions together. But more than this, platforms that use these kinds of techniques to hide latency also often reuse instances between requests. And in some cases, this means that global state can be observed between different requests, which can be a security issue. And because of this Cold Start problem, developers also often don't follow best practices. They stuff a lot of functions into one serverless deployment. So this results in another security issue, which is a larger blast radius. If one part of the serverless deployment is exploited, the attacker has access to everything in that deployment. But if we can get JavaScript startup times low enough in these contexts, then we wouldn't need to hide startup times with any tricks. We could just start up an instance in microseconds. With this, we can provide a new instance for each request, which means that there's no state lying around between requests. And because the instances are so lightweight, developers could feel free to break up their code into fine-grained pieces. And this would bring their blast radius down to a minimum for any single piece of code. So for these use cases, there's a big benefit to making JavaScript on WASM fast. But how can we do that? In order to answer that question, we need to understand where the JavaScript engine spends its time. We can break down the work that a JavaScript engine has to do into two different parts. Initialization and runtime. I think of the JS engine as a contractor. This contractor is retained to complete a job. And that job is running the JavaScript code and getting to a final result. Before this contractor can actually start running the project, though, it needs to do a little bit of preliminary work. This initialization phase includes everything that only needs to happen once at the very start of the project. So one part of this is application initialization. For any project, the contractor needs to take a look at the work that the client wants it to do and then set up the resources that it needs in order to complete that job. So for example, the contractor reads through the project briefing and other supporting documents and turns them into something that it can work with. So this might be something like setting up the project management system with all of the documents stored and organized and breaking things into tasks that go into the task management system. In the case of the JS engine, this work looks more like reading through the top level of the source code and parsing functions into bytecode or allocating memory for the variables that are declared and setting values where they're already defined. So that's application initialization. But in some cases, there's also engine initialization. And you see this in contexts like serverless. The JS engine itself needs to be started up in the first place and built-in functions need to be added to the environment. I think of this like setting up the office itself, doing things like assembling the IKEA chairs and tables and everything else in the environment before starting the work. Now this can take considerable time. And that's part of what can make the Cold Start such an issue for serverless use cases. Once the initialization phase is done, the JS engine can start its work, this work of running the code. And the speed of this part of the work is called throughput. And this throughput is affected by lots of different variables. So for example, which language features are being used, whether the code behaves predictably from the JS engine's point of view, what sorts of data structures are used, and whether or not the code runs long enough to benefit from the JS engine's optimizing compiler. So these are the two phases where the JS engine spends its time. Initialization and runtime. Now, how can we make the work in these two phases go faster? Let's start with initialization. Can we make that fast? And spoiler alert, yes, we can. We used a tool called Wiser for this. And I'll explain how that works in a minute, but first I want to show you some of the results that we saw. We tested with a small markdown application. And using Wiser, we were able to make startup time six times faster. If we look in more depth at this case, about 80% of this was spent on engine initialization. And the remaining 20% was spent on application initialization. And part of that is because this markdown render is a very small and simple application. As apps get larger and more complex, application initialization time just takes longer. So we would see even larger comparative speedups for real-world applications. Now, we get this fast startup using a technique called snapshotting. Before the code is deployed, as part of the build step, we run the JavaScript code using the JavaScript engine to the end of the initialization phase. And at this point, the JS engine has parsed all of the JS bytecode, or JS, and turned it into bytecode, which the JS engine module stores in the linear memory. And the engine also does a lot of memory allocation and initialization in this phase. Because this linear memory is so self-contained, once all of the values have been filled in, we can just take that memory and attach it as a data section to a WASM module. When the JS engine module is instantiated, it has access to all of the data in the data section. Whenever the engine needs a bit of that memory, it can copy the section, or rather, the memory page that it needs into its own linear memory. With this, the JS engine doesn't have to do any setup when it starts up. All of this is pre-initialized, ready and waiting for it to start its work. Currently, we attach the data section to the same module as the JS engine. But in the future, once WebAssembly module linking is in place, we'll be able to ship the data section as a separate module. So this provides a really clean separation and allows the JS engine module to be reused across a bunch of different JS applications. The JS engine module only contains the code for the engine. That means that once it's compiled, that code can be effectively cached and reused between lots of different instances. Now, on the other hand, the application-specific module contains no WebAssembly code. It only contains the linear memory, which in turn contains the JavaScript bytecode, along with all of the rest of the JS engine state that was initialized. This makes it really easy to move this memory around and send it wherever it needs to go. It's kind of like the JS engine contractor doesn't need to set up its own office at all. It just gets this travel case shipped to it. And that travel case has the whole office with everything in it all set up and ready to go for the JS engine to just get to work. And the coolest thing about this is that it doesn't rely on anything that's JS dependent. It's just using an existing property of WebAssembly itself. So you could use the same technique with languages like Python, or Ruby, or Lua, and other runtimes, too. So with this approach, we can get to this super fast startup time. But what about throughput? Well, for some use cases, the throughput is actually not too bad. If you have a very short running piece of JavaScript, it wouldn't go through the JIT anyways. It would stay in the interpreter the whole time. So in that case, the throughput would be about the same as in the browser. And so this will have finished before a traditional JavaScript engine would have finished initialization in the case where you need to do engine initialization. But for a longer running JavaScript, it doesn't take all that long before the JIT starts kicking in. And once this happens, the throughput difference does become pretty obvious. Now, as I said before, it's not possible to JIT compile code with a pure WebAssembly module at the moment. But it turns out that we can apply some of the same thinking that comes with just-in-time compilation to an ahead-of-time compilation model. So one optimizing technique that JITs use is inline caching, which I also explained in my first series about WebAssembly. When the same bit of code gets interpreted over and over and over again, the engine decides to store its translation for that bit of code to reuse next time. And this stored translation is called the stub. Now, these stubs are chained together into a linked list. And they're based on what types are used for that particular invocation. The next time that the code is run, the engine will check through this list to see if whether or not it actually has a translation that is available for those types. And if so, it'll just reuse the stub. Because IC stubs are commonly used in JITs, people think of them as being very dynamic and specific to each program. But it turns out that they can be applied in an AOT context too. Even before we see the JavaScript code, we already know a lot of the IC stubs that we're going to need to use and to generate. And that's because there are some patterns in JavaScript that just get used a whole lot. A good example of this is accessing properties on objects. This happens a lot in JavaScript code. And it can be sped up by using an IC stub. For objects that have a certain shape or hidden class, that is where the properties are laid out in the same order, when you get a particular property from those objects, that property will always be at the same offset. Now, traditionally, this kind of IC stub in the JIT would hard code two values, the pointer to the shape and the offset of the property. That requires information that we don't have ahead of time. But what we can do is parameratize the IC stub. So we can treat the shape and the property offset as variables that get passed in for the stub. And this way, we can create a single stub that loads values from memory and then use that same stub code everywhere. We can just bake all of the stubs for these common patterns into the AOT-compiled module, regardless of what the JavaScript is actually doing. And we discover that with just a couple of kilobytes of IC stubs, we can cover the vast majority of all JS code. For example, with two kilobytes of IC stubs, we can cover 95% of the JavaScript in Google's octane benchmark. And from preliminary tests, that percentage seems to hold up for general web browsing as well. Now, this is just one example of a potential optimization that we can make. Right now, we're in the same kind of position that the browser JS engines were in in the early days when they were first experimenting with just-in-time compilers in the first place. We still have a lot of work to do to find the clever shortcuts that we can use in this context, but we're excited to be starting that work and excited for the changes to come. If you're excited like we are about this and want to contribute to the optimization efforts, or if you want to try to make this work for another language like Python or Ruby or Lua, we'd be happy to hear from you. You can find us on the messaging platform Zulip. And feel free to post there if you want to ask for more info. You can also find links to the projects that I mentioned in my recently published blog post on the bytecode alliance blog. I want to say thank you to the organizers for inviting me to speak here today, and thank you all for listening.