 So thank you so much for being here. I'm really excited to share this work with you guys. If you're here for the component model, you are in the wrong room. So today, I'm going to be talking about some work that, oh, I'm Tal Garfinkel. I'm at UC San Diego. I'm going to be talking about some work that has been done in collaboration with colleagues all over the place. So credit is all spread out. And the thing I'm going to be talking about is, so WebAssembly is kind of this amazing technology from an isolation standpoint. It gives us overheads that are orders of magnitude less than what are possible with existing hardware isolation mechanisms. And this has enabled all sorts of very cool use cases. Unfortunately, Watson also has some fundamental limitations today in terms of security, scalability, performance, and some other attribute that, oh, compatibility. There we go. And so I'm going to talk about two things. First, I'm going to talk about a couple of optimizations on current hardware techniques called Segway and ColorGuard that give us a really nice boost in terms of performance on existing X86 machines. Potentially, we could apply this on ARM as well. It cuts about 25% of the kind of virtualization tax or kind of wasification tax versus native off of workloads. And the other one called ColorGuard lets us increase the number of instances we can run by about a factor of 12. So addresses performance scalability. After that, I'm going to be talking about hardware-assisted fault isolation, which is an extension to modern processors that we've been working on for a couple of years with Intel, collaboration with other colleagues. And it's a very simple set of extensions that's kind of exactly what we want for addressing kind of all these problems in the wasm space. So wasm enables many cool use cases, as I talked about. And again, it's because of these unique properties. And these unique properties are like instance creation that is orders of magnitude lower than what you're going to be able to do on something like Lambda, microsecond scale. And nanosecond scale context, which is very small footprints. And all of this makes it possible to do these high-concurrency, low-latency, edge-compute platforms, like we're seeing from folks like Cloudflare, Fastly. You couldn't do this with existing hardware. Another very cool thing that this has enabled is allowing us to do compartmentalization of untrusted code. So for the last few years, we've been working with the really great folks in Mozilla on sandboxing untrusted libraries. So they depend on all sorts of third-party libraries that are written in C. You guys may have gone to the RL box top earlier in the week. And the reason we're able to do this is really because of WebAssembly. People have tried to sandbox stuff with processes for the last two decades for as long as I have been in security. And it really doesn't work out. Because inter-process communication is so expensive with processes, you have to refactor your whole application. And so most people just don't do it. Things like OpenSSH or browser's use of processes are really the exception rather than the norm. Unfortunately, so Wasm has these fundamental limitations in terms of performance, scaling, security, and compatibility. And I don't just get me down. So my background is in virtualization. I was there in the early days of virtualization. And when I started working on virtual machines, the overhead for using a virtual machine was 30%. So when I'm like, 20% for Wasm, I'm like, wait five years. It'll be OK. But so there are some real challenges here to tackle. And this is what I'm going to talk about. So first I'm going to talk about these limitations, these optimizations, and then finally what we're doing in hardware. Really quick review, just in case you didn't take undergrad computer architecture recently, page tables. This is what containers and virtual machines use. You're going to go get an address. It's going to, you know, you do a load or a store. It's going to look at the address. Part of that's going to be the virtual address that's going to identify the page that you're after. And the rest is going to be the offset into that. You're going to go to the TLB. Hopefully, you hit the TLB. If not, you're going to go look it up on a page table on main memory, do some permission checks, and that's going to actually have you translate your physical address. And the reason I'm kind of bringing this up is because these are the building blocks we're going to talk about. So I just want you to page this back in. That's one way to do memory protection. The other way people usually do memory protection is segmentation. And when I say usually, a lot of old hardware did this. If you're old like me, you remember doing this on the X86. This is largely not in the X86 anymore. But the basic idea is that you describe your memory in terms of extents, in terms of a base and a bound. So you're like, I want to grant access to this base and bound. And if I'm going to load an address, I'm going to look, depending on how I'm doing addressing, if I'm doing segment relative addressing, I'm going to say starting at the beginning of the segment, where's my address in this thing? If it lands in that thing, you're going to allow it. Otherwise, you're not going to allow it. There's going to be some permissions. And your address is going to be done relative to the start of that segment. Sorry, I hope that wasn't too brief. Why am I bringing this up? What is WebAssembly doing? I think one way to think about WebAssembly is kind of like a poor man segmentation. We're taking hardware, which no longer does segmentation for us, and we're using the compiler to give us a segmented memory model instead. And the dope thing about a segmented memory model in software is that it lets us avoid the overheads of hardware. It lets us avoid the TLB flushes, the TLB shootdowns, context switch overheads, all of these things that happen when you get involved with traditional hardware production. Unfortunately, there are trade-offs to this. So there's two ways that we can implement this. The simplest thing we could do would be, let's just add an upper-balance check and a lower-balance check to every load and store instruction. And of course, you can do this, but it's expensive. And of course, this is also not Spectre safe. So what most was in runtimes are doing, if you're running Chrome on your browser on a 64-bit machine, if you're using was in time in the cloud, is a system of what are called guard regions. And so with the guard regions work, right, is you're like, well, I've got a 4-gig address space. I'm going to allocate another 4-gigs after that for my guard region, because I'm going to take my 32-bit offset that I'm going to load. I'm going to add that to a 32-bit heap base. I'm going to get a 33-bit value, 8-gigs, right? And so my accesses are going to land in my address space, or it's going to land in my guard region. If it lands in my guard region, it will trap. There are other things that are done to actually control the end of your heap. So this is great. This is much more efficient than those conditional checks. There are costs in terms of start-ups, slowdown, and resize, because you're still involved with the MMU and still making system calls. And this wastes a lot of virtual memory, which impacts scaling. And finally, this is also non-spectre safe. So more about this. So these scaling limitations, right? Right now, if you kind of do your basic math, 2 to the 47-size virtual address space in user space on x86, 2 to the 33-size virtual memory for each instance, right? 2 to the 14 is 16k. If you organize things a little bit more efficiently, you can get about 20k. But that's the most wasm instances you're going to pack into that 128 terabyte virtual address space, which is kind of depressing when you think about it. So this is a bummer if you're doing serverless, and every request you are spinning up an instance, right? Like if you're a Fastly, right? This is a problem. Maybe for Cloudflare, this is a problem. And this problem is going to get worse with the component model, right? Because instead of each of your applications being one linear memory, maybe each of your applications is 10 linear memory. I don't know what your applications look like, but I feel like anytime I compile something, like the list of dependencies gets quite large. So scaling, I think, is already a problem, and it's going to become a much bigger problem. Another thing is that we don't always get to use guard regions, right? If you want to do 64-bit memories, we're back to those conditional balance checks, right? 50-bit memories on wasm, not efficient. And this also doesn't work on 32-bit processors. And we personally care about this because we ship sandboxing on Firefox to older desktop machines. And there's really a lot of Android devices in the world that are still 32-bit. So this directly impacts us, and we have to pay that tax. And linear memory checks, I'm still adding a heat-base addition to every load and store, so there's a cost associated with this as well. And finally, we can't support all the things out there. Some of this, right? There's a lot of really great work going on in terms of standard servers to address this. Some of this will never be addressed, and this is just, you know, it's a limitation. So next, so let's talk about these optimizations. Again, Segway, it's a fun trick you can play in your runtime that speeds up wasm. Color guard is something that gives you great scalability, right? And so Segway is just a cute trick of using, a cute way to use x86 segmentation. So if you're old, who remembers a 32-bit segmentation in x86? You, that guy over there, right? Like, so if you remember this, right? Back in the day, everything was, you know, you're using segments for all sorts of things, it was great. And in 2003, when AMD released x86-64, right, they took this out. Mostly, right? So your segment registers are still there, most of them just zero relative address, most of the other segmentation hardware was, but you could do more sophisticated things, those things are gone. But we still got two segment registers left that actually do something, and this is FS and GS, right? And the one thing that they do is segment relative addressing. And what operating systems use for today is thread-level storage, right? So if you're accessing thread-level storage, what your compiler is doing, is doing segment relative addressing relative to either FS or GS, and which of those it's using depends on which operating system you're running on. But we have one segment register left. And we can actually reuse the other one too, if we're careful. And so Segway reuses this remaining segment register for a heat-based addition, right? Really simple idea. But the cool thing about this is this freeze-up-a-general-purpose register, and this enables more optimized instruction encoding. And so what is the consequence? Well, if we look at kind of the before and after, we can see we have an extra instruction and we're using extra register, and we can get rid of both of those and give our compiler more freedom when we're using Segway. And you can look at our paper on this for more details. It gets gory. So what is the goodness we get from this? Number one is we get a code-sized reduction, which is very cool because we're shrinking the number of instructions that we have. Good for our iCash. This gives us a nice speed-up on spec. And when we've looked at specific applications, we've seen kind of even bigger speed-ups. And right now, actually, this is shipping in WAMR. And I really wanted to call out this thing about reducing compilation times because we have no idea why this actually is the case. It just is, and when they posted this, we're like, this is great. It's good. Our optimizations are making the world better in ways we didn't even know. So good. So let's Segway. So the second thing I want to talk about is color guard. This is something that helps scalability. And so more review. And so this is a very, very simplified way of view of memory protection keys. But so going back to our page table, right? So each page table entry is there for doing virtual address translation. And memory protection key adds four bits to each page table entry. And these four bits we refer to as a memory tag. And we can think of this in terms of colors. So think of this as like we have 15 different colors. And each of our pages are colored with a different color, kind of like is depicted here. And on each of our cores, we will have a register that says, okay, this is the color of my core right now. Let's say thread for those who are into high-level abstractions. So right, if my register says I can access blue memory, then I can only access blue memory. So how is this useful? How can this help us? Well, so I talked about the way that we use guard regions in Walsum, right? And how that works, right? Is that we are, for each of our instance, essentially burning eight gigs worth of space. And it's not totally burning, right? Some of that address space I'm gonna use. But usually like a Walsum instance, like maybe use a couple hundred megs in my virtual address space. Maybe more, maybe less. But then all that other space, my guard region and the rest of my virtual address space, I'm not getting any value from that. So let's say instead, what if I could use, instead of wasting that, I could put other VMs there. And this is what color guard is letting us do. So what we're doing is we're saying, okay, now that we have these colors, let's say I have like one VM, this is my blue VM. And then if I have say one gig VMs, I'm gonna put seven other VMs decided that there are different colors. And that way again, right, if I kind of going back to, I'm gonna do my heat base addition and I'm either gonna land in my blue VM or I'm gonna land in a different color. And so this way, I'm getting that same property that I got from those guard regions, right? That same nice trick for accelerating bounce checks. But now I'm not wasting the space. You need colors to each sandbox. And we can just kind of repeat this pattern of different colors on each side over and over again in memory. And so kind of the punchline to this, right? If we, so starting out, so if you're wasm time and you've done the smartest thing you can do and you're like, I'm gonna put, I'm gonna use unsigned addition instead of signed addition. I'm gonna use two gig guard regions on either side. So I'm gonna do four gigs, two gigs, four gigs, two gigs. I can pack, you know, roughly 20K instances into that process address space. And with color guard, with half gig sandboxes, now I can do a quarter of a million sandboxes. So it's kind of a really nice win in terms of scaling. Again, this is on Intel MPK. This is also available on AMD since Epic Milan. So it's on a lot of hardware right now. And I think that we will be able to do this with our memory tagging hardware when it is finally available. So now I'm gonna talk about what we can do in next generation processors. So after working with this for a few years, we're like, man, we really need some help from hardware to overcome these limitations. And so this is a result of a multi-year collaboration. Shravan in the back did a lot of that great work. Worked with great folks at Intel and Revo some of the other folks I listed at the beginning. So what do we want from hardware? So we wanted three things from hardware design. Number one, we wanted it to be dead simple, right? And there's a lot of good reasons for this. Number one, we wanted to be able to run on low-end hardware, like IoT devices, and we wanted to be able to run on state-of-the-art server-class processors. Right, and we just like, you know, as one architect put it right, like when you're putting stuff on the data path, this is kind of the Manhattan of chip real estate. So like, you know, every gate, you better have a really good reason for putting it there. Second thing we wanted was minimal OS changes, right? If you kind of follow along with the Linux kernel and like how long it takes for a feature to get into the Linux kernel, right? It took years and years to get MPK in there, and there is still no MPKs to port in Windows, right? So if you want to have impact with a hardware feature, like staying user space, staying out of the way of the kernel is the way to go. Third thing we wanted was we wanted to support several critical use cases. Number one, we wanted to be able to support Wasm, right? And there's things you just want for Wasm that are different from, you know, a cherry, from like all these other different compartmentalization mechanisms, because we've got a trusted compiler. We want to be able to sandbox unmodified native binaries. Right, sometimes, right, you've got a system library, we'd just be able to like be loaded in from like our weird world of using Wasm for sandboxing legacy code. This is something that we really needed. And you know, if you're in the serverless spacing and you're like, hey, I want to be able to run like a Java JIT, and that's also a really nice thing. Finally, we want to be able to support custom SFI systems, right? Custom isolation systems. So like V8 has their own sandboxing system called Uber Cage that they use to contain JIT vulnerabilities. And you know, we talk with those guys and they're like, you know, we're like, what do you want? And they're like, we want that thing you're doing and we're like, awesome. So anyway, there's a lot of love. So we want to support these two environments. And it's interesting, right? Like when we can trust the compiler, we can do optimizations that are not possible with native, right? Because we can trust the compiler, we have this finite set of resources to enforce isolation, and we can let the compiler manage those and you know, spill them. And we get a lot of flexibility from that. But we also want to support native binaries, right? So we're trying to support both of these requirements. So our solution is hardware-assisted fault isolation. And what does this give us? So this provides fast, secure isolation and sharing. So memory isolation is free, right? There is no overhead for bounce checks at all. Because of the fact that we carry those checks in parallel with that TLB lookup, right? So we fit within the amount of time it takes to execute that pipeline stage, so we don't add any overhead. We can enable zero copy sharing. We can map stuff in and out of the sandbox in user space without touching page tables. System call under position for almost no overhead, which is great for native binaries. And we've got Spectre safety. And if you want to talk more about that offline or question time, we can talk about that because it's a nuance issue. We can also do unlimited scaling, right? We can put as many one-meg sandboxes into your virtual address space as you want, right? There's no tax for that. I don't want to do the math on that, but it's a lot. And we can do near-zero costs, sandbox set up, tear down, and resizing, right? Because we're no longer involved with making system calls and messing with the MMU and TLB shoot downs when we're trying to set up and tear down these address spaces. So we're kind of at like language speeds for doing these kinds of operations. And finally, we can provide compatibility with existing code. So the architecture itself is very, very simple, right? Sort of, we have this notion of regions. Regions are, again, like base bound pairs. It's like segmentation, right? We set up our regions. Or say, here's our virtual address space. I'm gonna map this range, this range, this range into my sandbox. When I say HFI enter, the only thing that my code can access now is those ranges of memory. I execute my sandbox code when the guest called HFI exit. It hands control back to some trusted runtime, my wasm runtime, or other runtime. So there's additional details. I'll talk about some of those later. And then, again, hopefully I'll have time questions. So one of the challenges here, right, is that if you wanna naively do this, you're like, well, I have regions. I'm just gonna do two 64-bit comparisons, right? Base and bound. That's fine, right? And that's not going to work out. Because that's a lot of circuitry. Again, we have this constraint. We have to fit within the amount of time that it takes one pipeline stage to execute. And we can't afford that amount of time that would take two of those comparisons. Also, we can't afford that amount of circuit area. So we need to find some way of doing this that is cheaper. And our answer to this is to specialize regions, right? So we have a variety of different types of regions that give us flexibility, but then also let us use less hardware to implement these checks. The first type of region is called an explicit region. And this is segment-relative addressing, right? So we have four of these, and this is great for implementing a wasm heap. We have a choice. We can either have these address more memory and have 64K sized and aligned area or regions, or we can have a smaller addressable space and have byte-aligned regions, which for us doing sandboxing is great because we want to be able to say, okay, here's a data structure, and in place, not modify the code, just map that into my sandbox. And the way that we access those regions is, right? So we have these four regions, and we actually, in the x86, add new instructions, right? And so the region number is actually encoded in the instruction. And the reason that we do this is because we don't want to change the expressiveness of the x86 move instruction, right? So we still get all of our addressing modes. We don't have to constrain our compiler at all, but we get this nice restriction. There is that one industrial restriction that our index now needs to be positive, but as far as we can tell, that's not a significant restriction in terms of compiler optimization. And we can do this with one 32-bit comparator, again, because instead of having to check an upper and lower bound, right, we just have to do one upper bound check. So the other types of region we have are called implicit data regions. And so these get applied to every load and store. Again, this is similar to, if you're used to the x86 32 world, this is a similar idea, right? So for every memory operation, we're gonna go through our regions and we're gonna say, okay, are you in this range? And if you are in this range, do the permissions match, right? If it's a load, then I'm gonna need a repermission. And if everything lines up, right, then I'm gonna execute that load. I've got four implicit data regions, and then I've got two code regions. And so these, my implicit regions, I just use masking, right? So I'm not using a comparator at all, just simple ands and ors. And so again, this brings size, alignment, restrictions, but we have our explicit data regions when we need more granular access. And for heaps, for larger areas of memory, right? I think this gives us the expressiveness we need. And so these choices, right, these like, you know, why do we have four or why do we have two? Came out of our experience with sandboxing lots of different code. Our colleagues at Fastly were like, you know, what's the number that we need? Of course, you know, we can spill these and reuse them, but we don't wanna do that too much. And so sometimes you pick magic numbers and these are our magic numbers. So just regions are not enough, though, right? So we don't just need memory isolation, we need to have some grasp on control. So I don't know, a couple of years ago, Shravan and I were talking about this, and he's like, what should we do about system calls? I'm like, man, I've written like, you know, kernel level system call interposition mechanisms, I've written library interposition mechanisms, and I hate them all, right? This is what I want, right? I just want the support from the processor. So if you've done this before, it like hurts so much, and it's really great to just be able to like, okay, like hear all of my different, you know, sys-enter, int-iddy or whatever, it's just gonna redirect those to a handler. So if you exit the sandbox or you make a system call, just hands it right off to the runtime. And again, like unless you've been through the pain, like you don't appreciate like how great this is. Like I hate P-Trace so much. And again, this is implementable with a simple conditional logic, you know, if a smaller processor or microcode if you're on something like a big X86 core. So HFI has these isolation primitives. Our explicit data regions, they're great for WASM because we can use them for WASM heaps, for WASM linear memories. Explicit data, implicit regions are great for protecting the runtime itself from specter attacks. Also for code regions, right? So software CFI that WASM relies on is not specter safe, but we can use these regions to make it specter safe. And then for native isolation, right? We can reuse kind of, you know, pretty much all these same mechanisms. And explicit data regions, right? If you can actually change your native app, you can use it for sharing too. So we've kind of got this very cool, kind of like two-in-one solution here. So how does it perform? Right, so way better than bounce checks. We get a little speedup over guard regions, sort of across the board. But we can do this on 32-bit processors. We can do this on 64-bit memories. We can do this without worrying about specter. We can do this without scaling limits. And for some workloads, we actually see a bigger boost, right, like with Firefox image rendering. We see this like pretty big speedup from getting rid of the overheads of bounce checking. So good, so I have some time for questions. That went a little bit faster than I was planning, but it's okay. So what can you do? If you're interested in this, so first I'll ask your runtime vendor about segue and color guard. I'm looking around. There's some guy over there that's supposed to implement color guard in wasm time. So he's on the hook now. So yeah, so we're kind of getting this stuff out there into production for folks to play with. If you're interested in HFI, come talk to us or talk to your CPU vendor about HFI support. So we've been talking with folks at Intel, talking with folks at RISC-5, and working within that standards body. And basically the money in the bank with pushing architecture features is big vendors who are saying, this is exactly what we want. Like when we talk to the Chrome guys, they're like, okay, this is exactly what we want for sandboxing V8. And so if we have people from cloud vendors, or if you're a company that's like, yes, this is what we want, please talk to us because your voice matters so much. And then finally, come talk to us about your production needs, right? If color guard, if the scalability or performance thing could be a benefit to you, we'd really like to know. Shravan and I, like the UCSD, UT crew, like we're all about performance, scalability, security. So if you have practical challenges, we'd like to know about your workloads and how we can help. So I'd love to take some questions, if you guys have some. Things are going well with the people in the Pacific Northwest. I don't know, they're kind of secretive. I don't want to scare them away, like a small animal. And we have really good, I think, supporting momentum in the risk five community. We haven't made good connections in the arm world, so if you know folks in the arm world, that'd be great. But no, I mean, look, Intel folks were involved with the design of this. And so, yeah, so there's definitely a lot of interest there. And then some of those folks went to RIVOS and they're in the risk five community. So there's a lot of first person involvement, which is, yeah, ownership is good. Do you have a question? Oh, no, so there used to be like six different segment registers. Four of them now just do nothing, right? Four of them now, if you do segment relative, it will just be relative to zero. Two of them actually still do segment relative. They don't do bounce checks anymore. When they were first added, you couldn't access them from user space. But then in Ivy Bridge, probably like 12 years ago or something, Intel made it possible to access those from user space. So there's, yeah, so we use them for the real level storage. So they are there on every core, and yeah, you can just use them. Oh, MPK. Oh, yeah, so MPK, right? So each thread, like, your thread will use the PKRU register and say, okay, here's the color that my current thread can access. Because you're only running one thread at a time. I'm coming to the core. Yeah, yeah, and if you have hyper threads, right? Each of those is gonna have its own PKRU register. Oh, yeah. For which stuff? For HFI? I mean for HFI or for Segway and ColorGuard? Oh, ColorGuard is only available on like modern x86 hardware, which is 64 bit. So ARM is gonna have MTE coming out on like Cortex stuff, but I think it's probably gonna be bigger Cortex stuff, so I don't think we're gonna see that down at like 16 bit sizes. We'll have to figure something else out, but we'll talk about that. Other questions? If you're sitting on something where you're like, man, I just have this like really confrontational question, right? I'm Israeli, so like, it's like really hard to offend me. What's that? With HFI? Yeah. It's easier and simpler you mean than like page based? Sorry, you're saying so that, I don't think I quite understand the question. Sorry, so with HFI, yeah, I mean, as long as the check is in. Yeah, exactly, yeah. So does the check's in parallel? Yeah, so you don't pay anything? No. Yeah. Yeah, yeah, no, I definitely mean it simplifies your implementation a lot. I mean, so for me, there's this analogy to kind of like hardware virtualization and like, so when I was at VMware, when I started, it took like a room of like the smartest people that I know to virtualize the X86. I mean, the original VMware system, the way that we dealt with the fact that X86 wasn't virtualizable was that we like literally did a JIT for X86. Like we had the hotspot guys working on this. This is badass. And like, nobody had this for like five years. Once like hardware virtualization came out, once we had like VT and EPT, like writing a virtual machine monitor is like a project for a graduate student, right? It's not like, I mean, the bar got so much lower. And I think HFI could do the same thing, right? In terms of like doing fine-grained isolation instead of having to like modify, Python for years and like have a like a wasm, a wasm compiler and like all this stuff, right? You know, you just like set up some regions and go. So I think it really, really will lower the bar for fine-grained isolation and like open up a lot of new possibilities for like how we can be doing serverless, how we can be doing security. Yeah, yeah, well, so I mean, it's published. We're gonna, so I got to have a fight with people at UCSD and they're like, you should like, you know, file blah, blah, blah stuff. And I saw like, this is why we are not going to file IP on it. So, I mean, I could have just like wait them out for another five months and then it will be in the public domain. But no, I think it's because it's been done with like UCSD and so many other folks, right? And because it's gonna be in the public domain, it's not encumbered. And then, you know, they'll work obviously with risk five will be all royalty free. So we're kind of trying to let everybody know we're like the opposite of a secret silence. Any other other questions? Segway, yes. Color guard, no. Yeah? Yeah. Well, so it will help you with a base addition but you still have to do the conditional check yourself. Like, I don't think we have a good answer for like, how do you make like, I don't think we can make 64 bit memories like very fast. We're just gonna need better hardware. I mean, you could design other stuff, but like, that's the only fast board I know right now. And I'm looking. Anyone else any thoughts? We've got two more minutes here. Shropping, you wanna troll me? You wanna troll me? Oh no. Ben, you wanna troll me? It's actually my controls. Okay, cool. What's that 128 bit memories? I've had people like Intel be like, you know, I think at some point we're gonna need 220 bit addressing. I'm like, I think the physics gets harder on that one. But maybe, you know, you can do a lot with those like harder bits, right? You know, we keep on being like, what do we do with these top unused bits with 128 bits we could have a lot of stuff to do. Anyway, thank you so much everybody for coming. I really appreciate it. And please come talk to us after. Yeah, love to chat more about this.