 So, this is the talk on gVisor. We're gonna be talking mostly about container sandboxing, kind of the security threat model, a little bit about with regard to containers. And some of our like kind of thought process when building gVisor around, you know, the architecture and that's what we were thinking and why like other approaches that kind of benefits like pros and cons of each approach as we were kind of building gVisor. So, just as like as an introduction, I'm, my name's Ian Lewis. I'm a developer advocate at Google and kind of part-time contributor to gVisor and other kind of container runtime related stuff. So, when thinking about kind of sandboxing, like the, like normal containers are kind of a great way to run applications, but aren't always a great way to kind of sandbox them to run untrusted code. So, like in a lot of cases, like you might have code that is, that you wanna run, but you don't want to actually, you know, expose that your infrastructure to that code. So, that might be something like a running code that users upload to your service. So, something like a software as a service or some sort of like serverless situation. You know, running third-party code. So, like you have a vendor that provides you with a binary and you just need to run that in your cluster, but like you don't know what's in that binary. You don't know like, you know, if it's safe and you don't know like what sort of vulnerabilities might be in it. Another is to like, you know, running kind of code on complex user input. So, things that, you know, these are all, in this case it's like something that might have, is more likely to have vulnerabilities. So, things that take, you know, user input from that's untrusted and then does something complicated on it. So, like that might be something like video encoding or image encoding or like machine learning, that type of thing. Or you might just have some code that you wrote yourself and like you don't trust your own abilities. So, or you know, to keep it safe and secure and so you want to just sandbox that code and make sure that it's safe to run in your cluster. So, as I said, some of the use cases for like actual like kind of business use cases might be something like SaaS or serverless, you know, video image transcoding, machine learning. These are some pretty typical use cases. But, you know, as I mentioned, like it could be just running third party vendor code or that sort of thing. So, like kind of to break down like what, like why we need to sandbox and like what like the attack surface looks like when you're running something in a container. Like, so here's our application and this application like is not trusted. This is just an application that could run any kind of arbitrary code. So, it can and may run code that we don't expect it to run or that it may, you know, it could be completely malicious. And so, what applications typically do kind of as they're running is they, you know, on a Linux kernel or as part of inside of a container is they, when they need to have access to resources, they'll typically make a sys call to the Linux kernel. And so, as the applications running on the CPU, like part of that, it hits this like call to the Linux kernel, which then does a context switch into the Linux kernel, you know, write some code or some data to memory and then does a context switch to the kernel which then reads that memory and executes some sort of system call. And so, in this case, we're like opening a file, right? But then like typically as part of that, after that context switch, the host kernel will actually run and do something. So like this is actually running host kernel code in a privileged mode. And so, if there's say a bug in the Linux kernel or in the host OS, like you could get potentially remote or get arbitrary code execution that is with all of the privileges that the kernel itself has. And so, this is something that we don't really wanna do. You know, so like as part of the normal running of the application, it's actually calling into the Linux kernel and running some code as the Linux kernel. And so, this is typically the code that is kind of suspect in terms of like being privileged and that, you know, if there's a bug in it could be potentially problematic in terms of like the security of the overall infrastructure. So this is why we need things like container sandboxes. So like, you know, there's a lot of other projects out there, Gbizers is one of them, but you know, there's Cata containers, there's like, you know, things like Firecracker, there's like Rust VMM, there's like, you know, all of these things that are kind of building sandboxes for containers or at least for container like things. And so, these are basically built to be able to run arbitrary code so that people can't attack or escape from the environment that, you know, the container environment. And this code is like, you know, as I said, like completely untrusted. So we don't want to, you know, it could be doing anything, it could be 100% malicious code like that. And we have to kind of assume that that's what it is because, you know, the application inside could be, could have been compromised in some way. So the goal of like sandbox isolation is typically to, you know, reduce the attack surface. And by, you know, the part of doing that is like reducing the execution of trusted like privileged code in the kernel itself. So this is typically done through like some sort of virtualization or abstraction of these hosts of the host itself. So you might do something like, you know, have like a virtual machine that abstracts the host or abstracts away, you know, the hardware or you might have some other like level of virtual, layer of virtualization. So you basically create like this virtual host that like if it gets compromised, doesn't expose the entire host or the entire infrastructure to the attacker. But as part of like this, like at least as we're building GVisor and one of the principles that Google itself has had when it builds systems in order to sandbox things is to have two layers of isolation. So like one of the idea here is that we don't wanna expose the infrastructure to any one single bug. And this is, typically you don't wanna have like a single bug, be a single bug away from having be able to get at things like user data or something that's important to you. And the reason for this is just that, you know, if you have a, you may have, maybe having a single bug is not all of that common, but it's, you know, almost virtually unheard of to have like two bugs that work at the exact same time in order to, you know, that the attacker has access to in order to get like out of a particular, you know, a sandbox or runtime environment. This just like makes it, you know, by having, by requiring that they have like two layers there or they get past two layers in order to get to user data or to something on the host, this greatly reduces or at least we think that it greatly reduces the chances that somebody would be able to break out of the sandbox. So as we're thinking about building things like GVisor and the, and the sandbox, like we, there's, you know, a lot of different alternatives to building sandboxes. So one of those is, you know, obviously like running container, just running containers normally and running the code inside of a container is an option. But, you know, as we said with the being one, like kind of kernel bug away from people breaking out, this is typically not something that we really want to be able to be doing because containers are just basically running directly on the host. And so like any bug in any one of the interfaces that it's exposed to or that it has access to could potentially be a way of breaking out of the container. And a lot of applications typically need to have like, you know, access to a good number of syscalls or so, you know, some of the examples like Dirty Cow are examples where, you know, the bug is in something like MAdvise, which is like needed by pretty much every application out there. So you can't really just filter the syscalls in order to the sandbox the application. Another idea is like using Unicernals. Unicernals are a way of kind of creating sandboxes and typically they do that with like, by kind of linking in or building an application with a guest OS as a built as a library. And then running that in a hypervisor or in a locked down kernel or in a locked down container. So this is like an interesting idea, but one of the things that's kind of difficult with this is that like you can't really bring your own container in the most cases. So like, you know, containers in Unicernals are interesting, but like they have a lot of kind of downsides in the sense that containers obviously like aren't good for security isolation. They only have one layer of isolation and it's hard to reduce the attack surface by just using a container or the features that you're allowed, that are available to containers. So things like Set Comp or App Armor. But Unicernals are a way of helping with some of that, but they can't, you're not allowed to kind of like bring your own container. Like you kind of have to build a specially crafted kind of container that contains the operating system. So you have to bring a lot of that operating system level like support with you in as part of the container. So that's not totally ideal because you have to have like a different build process like a lot of cases, like you have to have, there's a lot of limitations to the type of applications that you can run. But that's one of the reasons why we think that a different approach is the way to go. Virtual machines are probably the most well understood and most used way of kind of sandboxing things or sandboxing applications. Virtual machines give you kind of a guest OS. Typically you bring the whole like kind of VM image. So a guest OS and the application together. But then you have like the, an isolation layer at the hypervisor layer, which gives you like a pretty strong boundary between you and the host. You know, typically like, you know, depending on whether you're using something like type one or type two virtual machine, you know, something that has hardware support or something that, or a hardware, yeah, basically a hardware support for the hypervisor or not kind of depends on whether you have like, you know, kind of one or two layers of security. But typically this gives you a reasonably good level of security. You know, you have a fairly, like with say like type two, you have a hypervisor that's built into the kernel and the VM is run as a regular process on the machine. And this gives you a pretty decent like kind of single layer of boundary. But you're still kind of like one bug, say in the hypervisor away from, you know, potentially having a full host compromise. With type ones, you kind of get, you know, you could be in the situation where you'd have to get like kind of a hypervisor bug plus like, have some sort of hardware issue in order to break out. Like if you had like memory fencing or whatever, like this is maybe a little bit better or this is a little bit better. Perhaps like, but virtual machines have a lot of other kind of downsides in the sense that like, it's not easy to have like flexible resource usage with a, with a VM. So you don't get to kind of, you have to assign full sets of memory or CPU kind of to the sandbox. And then you kind of have to just allocate that completely to the sandbox. I mean, you can do some sort of over committing and stuff like that, but it's not very flexible. You can do memory ballooning, but like typically, like you have to have kind of cooperation with the, the OS inside the sandbox that may or may not be trusted or may not be trusted. And so you don't really necessarily want to have to do that. You also want to have quick startup time. So depending on the environment, like if you're using kind of one of the micro VM things, you can get better startup time. But if you need to support a lot of devices or something like that, that can like increase the startup time. Another thing that we wanted to do as part of Geovisor was to like have something that would integrate well with container like solution. So something that was very container like, but would actually give us a sandbox. So this is something that, you know, would easily integrate into something like Kubernetes and give people a very container like experience in terms of like how they assign and manage the resources of the containers, but would also provide kind of a sandbox. So you wouldn't have like this like totally different paradigm for your sandboxes versus not. So we thought about like a lot of this, like how would we kind of build a system that would give us all of the things that we wanted? So, you know, with virtual machines, you have like kind of a hardware virtualization layer but, and as I said, like if you, when in order to like actually do a, to build a sandbox, typically you would virtualize the system layers, right? So here, like you're actually virtualizing the hardware and providing an interface at that layer and everything above that layer is untrusted. But with Geovisor, the way that we kind of figured that we would do it is by virtualizing at the OS level. So in this case, we're actually virtualizing something, virtualizing the OS and then basically only the application, well, like I'll talk a little bit about that, but like people can basically bring their untrusted application and we can run that in a sandbox. So the virtualized OS is like, is trusted, but like at least up until the point that the application is actually run. I'll talk about that in a little bit. So this is like for us like gives us the two layers of isolation that we want because as when we actually boot up the container, we have control of the OS so we can set it up so that the guest OS is running and is the guest OS that we expect it to be and then put like a fairly strong layer of isolation at the host kernel level. So what this is doing is it is basically intercepting kind of application SysCalls and then handling those as part of a user space kernel and then that entire kind of package or sandbox runs inside of user space and then needs only to make a limited amount of SysCalls in order to implement the sandbox to the actual host. And so by doing that, we reduce the amount of SysCalls and specifically the type of we get to control a little bit more the actual arguments and things like that that are sent to the SysCalls. So this gives us the two layers of isolation. It uses kind of the same principle of virtualization as things like VMs, but it is doing it at kind of a different level. It's doing it at like an OS level rather than at a hardware level. And this reduces the host attacks of risk by reducing the amount of SysCalls that are handled by the actual host kernel and handling like the SysCalls that are actually sent to the host kernel are managed by at least the first layer of isolation which is the user space of the century. So you would have to kind of break out of the user space or you would have to find a bug in the century itself and get over that layer before you could actually even send any arbitrary SysCalls to the kernel. And then you're limited to which SysCalls you can actually run and that's like a vastly reduced SysCalls than a normal application needs to run. So I'm gonna talk a little bit more about like the actual architecture of the sandbox and like how it kind of starts up and how that actually makes a container secure. So the way that we integrate with Kubernetes is we use container D and container D provides an API that allows you to run something called a shim which will then run an actual container and container D implements the CRI interface that Kubernetes needs in order to run containers. So it's very similar to the way that containers run like normal containers run, container D will execute the container D, a shim for run C in order to run regular containers and this runs in a very similar way to that. But we can provide what's called a runtime handler through Kubernetes to container D in order to tell it which run times to actually use. So in this case we're gonna use run SC and we're gonna run a container and so the first thing that the container does or that GVisor will do is it will set up our kind of sandbox environment and we call this part the sandbox, this part that's like in the dashed lines here and that's where the application itself is gonna run but at this point we have kind of a trusted sandbox in the sense that we're not running any untrusted code yet. We have a sentry that we know is like is the actual sentry that we, or user space kernel that we expect it to be and then we have this, we set up something called a Gopher which is going to do a lot of our file IO. I'll talk a little bit about that later. And then we are going to use either ptrace or KVM what we call platforms in order to intercept the syscall and then once we've done that we encapsulate both of the sandbox and the Gopher in a set comp or a set comp sandbox and namespaces. So we run these in like a fairly restricted actual container and then we actually run the application itself and so the application itself, once it starts running, can start making syscalls to the sentry and at this point we consider the entire sandbox untrusted so we can think of the sandbox itself as being potentially compromised at this point because we're running untrusted code. So the idea here is to handle most of the syscalls like entirely within the sandbox and in the cases where we actually do need to use host resources, we can, the sentry does some level of filtering and making sure that the syscalls that get made to the kernel itself are not malicious or gonna be problematic in any way and then we can do things like, we can run multiple containers inside of the sandbox. So this is how we implement pods within Kubernetes. So you can have a single sandbox per pod and then have multiple containers running inside of that sandbox. And then these are kind of namespaced. There's different, like, there's slightly specific namespace support within GVisor but essentially they're namespaced kind of in a similar way to what you would expect from a pod. And then, so you get a very similar experience to the way that containers run. You can assign them all the same IP address, you can, they can talk to each other on local host, that sort of thing. You get a very kind of similar experience to a normal container. So talking about some of the design principles behind GVisor, one is that we have these two security layers. And then, so part of that is just the fact that we have the sentry and then we have a set comp sandbox. So to give us two layers of isolation there between Linux, the Linux kernel syscalls but also with things like file operations, we go through like a separate application outside which is called a gopher in order to, you know, separate things like being able to open files. So like, if you want to open say like an arbitrary file, you would need to break through the sentry and then find another bug within the gopher in order to actually start opening and writing to files that you wouldn't normally be able to write to. So that provides our kind of two layers of isolation for say like file system writing. And then, you know, some other design principles are like that we have like, we provide minimal access to the host. So like, no syscalls are actually passed through. So it's not like a set comp sandbox. It's actually like the syscalls that are made to the host are actually done as part of the implementation of the sandbox. And then we limit the number of syscalls allowed. So instead of, you know, having something like 200 for like the kind of default Docker set comp profile, you have something like 30 that you can run with Gevizer. And then we run Gevizer itself or the sandbox itself in user mode so that, you know, even if the sandbox is compromised, we don't have like root access on the host. We also use, we also implement the sandbox where we implement the sentry and other parts of Gevizer in Go. And the reason we do this is like, so that we don't have like, you know, a lot of the problems that you would have if you write applications in C or kernels in C in terms of like buffer overruns and like, you know, there's, you know, essentially array length checking and things like that built into the language. So it's memory safe and you have all the, a lot of these kind of security features built into the language. You know, other things like unsafe code reviews and like, you know, the fact that it's statically linked so that we don't have like issues with like, you know, people overwriting like a dynamically linked library, that sort of thing. So these are kind of like some of the design pixels that we like build into the actual development of Gevizer. And then we use like a lot of the, you know, the built in like host features like C groups and namespaces and to root or pivot root in our case. And running the actual sandbox as the UID and GID of nobody so that we don't have like, the user doesn't have escalated privileges or privileges that it doesn't need. And we drop like, you know, pretty much all the capabilities for the sandbox because they can, most of those can be implemented by the sandbox itself. So as part of the sandbox, like as we're, you know, starting it up, the sandbox itself like is, you know, encapsulated in a C group in order to make sure that the sandbox doesn't use resources, you know, more resources or take up too many resources on the host. We encapsulate that in namespaces. So this is very similar to like, the sandbox itself is, you know, very similar to like an actual container, but it's a very restricted container in the sense that, you know, we are fixed to running as a UID and GID of nobody and we have a fairly restricted setcom profile. And what we do is like, as we're building up the sand, we're creating a sandbox, we set the like namespaces and to root or, you know, pivot root in order to root the sandbox into a particular, you know, directory, which basically contains nothing except for run and see. So like the sandbox itself contains like almost no file system at all. And then we basically drop all the capabilities and set the user ID and GID to nobody. And then we put in a restricted setcom profile that only allows the running of about 30 or so syscalls, 30 to 40, I think. And then we have, for file operations, because the sandbox can't really do any file operations like opening a file, we have to do that through a separate process called the Gopher. The Gopher has a similar kind of experience except for it actually has access to the file system that the container, that's used by the container. So this is also run in a C group and has reduced set of namespaces is to root it to the actual or pivot root into the actual file system and the file systems that are needed for the container are mounted in. And then this also has reduced capabilities and has like a fairly restricted setcom profile. So this can actually do a little bit more than the sandbox can because it can actually open files and things like that on the machine. But it has a fairly restricted set of things that it can do so that we can, even if you are somehow able to exploit a bug in the Gopher that you're not able to like, hopefully not able to break out of the Gopher itself. So some things to think about, like about like what's not protected. So like the sandbox can do only do so much in terms of what it's gonna actually do. And so we have to be aware of things like, particularly when we're using Kubernetes that a lot of the things that Kubernetes, the defaults that Kubernetes uses are optimized for ease of use and not security. Same thing goes with Docker. So like Docker allows things like raw sockets, for instance, by default, as part of its default set of capabilities. So these are things that you kind of have to keep in mind. And in the fact, in the case of like raw sockets we're actually like as part of GVisor not really allowing you to have access to raw sockets but like the idea is that, Kubernetes gives you a lot of defaults that are not necessarily meant to be secure. And so you have to keep that in mind when you actually run a pod. You need to keep things like memory and disk limits for the actual application. Run C will set that up for a sandbox if you provide it to OCI but you need to make sure that those are there in case the sandbox decides to try to use up all the memory on the host, for instance. Other things like network and disk isolation like network access is something that you need to be aware of because for pods they're giving an IP address and they're on the pod network. So you need to be able to use things like network policy or some other like mechanism to limit what the sandbox can do on that work. Things like arbitrary packet injection like I said like raw sockets. The centrally provides isolation here in the sense that it runs inside of the sandbox but the sandbox itself has access to write arbitrary packets. And so if somebody is able to take control of the sandbox or of the sentry itself they can write arbitrary packets on the network and so you need to be kind of aware of that. Another thing is file rights and permissions like if people can write to the root file system on your container, for instance, they could potentially overwrite files or executables that get executed. And so you can have some vulnerabilities in that sense. So it's ideal to use things like read only file systems for the base root file system and things that you don't expect to change within the container. Also, Gvisor doesn't really have a built-in throttling mechanism so you need to use things like C groups or rely on the host in order to do that. So I'm gonna try to wrap up here but Gvisor itself is something that is very container-like and we are kind of trying to make it as close to the kind of container experience as possible and integrate that really well with Kubernetes. So currently we have the ability to use runtime class in Kubernetes to specify Gvisor as a runtime class and then have that go to container D and run containers as part in Gvisor. We have things like a mini-cube add-on that allows you to try out Gvisor and play with it a little bit easier than having to set up a whole Kubernetes cluster yourself. So give that a try. We also have things like on Google Cloud, we have GK Sandbox which allows you to use Gvisor, get Gvisor support kind of out of the box there. And as part of all this, or in order to support all this, we're building this, we're working on the Gvisor container D-shim to, and this works, we're maintaining this in order to provide container D support for running Gvisor Sandboxes. So Gvisor is an open source project. We have a website at gvisor.dev and you can check out the code on GitHub. We have a Gitter community for the chat about Gvisor and for Gvisor development or usage if you have questions. And we also have a couple of mailing lists that we use for folks who are doing development on Gvisor and for Gvisor users to kind of ask questions and find out more and interact with the development teams, the folks that are working on Gvisor. So I have basically one minute, but what I can, I will maybe show you a little bit about what it looks like in a particular, for a particular when stole that and have it running in Kubernetes. Actually, I'll think I'll probably just give up on that. It was like, I got disconnected, but in any case, you can essentially use the runtime class. What I wanted to show was using a runtime class as part of the pod, but it's essentially just putting this field in there and you get mostly the same sort of experience as a regular container. You can do things like exact into the container, the logs, like all of that sort of stuff. So that's really what I wanted to show you there. So anyway, I think I'll finish up there since I'm kind of out of time, but if you have any questions, I'll be around here. And I have some stickers if you want stickers. So yeah, and we also have Kevin from the development team here to help answer questions if you guys have any. So thanks a lot.