 So my name is Lubislav Pivarch. I'm working at Redhead as a software engineer. And I'm working on the Kubert community project about virtual machines on Kubernetes. I'm there about four or two years, and I'm an approver. And the name of the talk or the title is Running Kubert with Workloads with no additional privileges. So let's start. So what is coming, what is our agenda today? So I'm going to talk about Kubernetes and Kubert, so I'm going to give you a quick crash course on Kubernetes and Kubert. Then we will have a see how is security enforced inside Kubernetes because we are running with Kubernetes. Then we will have a look on what is actually enforced by these security mechanisms. And then we will have a close look on Kubert and how it stands with these security mechanisms. So what is actually Kubernetes? So Kubernetes is an open-source system for automating deployments, scaling and management of continuous applications. So words doesn't tell much to me, so let's have a look on a few of the images and workflows. So Kubernetes API is API based on the objects. And these objects are representing a lot of features. And so when a user wants to run a workload, it needs to find what type of object is representing this. And the name of the object is actually POT. So when he wants to run the workload, as I said, he needs to find also the schema and API version and so on and into craft the spec of the containers he wants to get from the system. Then he posts it to the Kubernetes API. Kubernetes API can validate the schematics and the schema and will store it inside the storage. After that, the scheduler can pick it up and pick a node suitable for the containers to run on. So let's have a look on the node perspective. So we can see there is a node agent called kubelet. The kubelet is watching the Kubernetes API for the workload. And after that, with the help of container runtimes, it creates a POT. So POT is actually only abstraction for one or more containers. These containers then share the resources that are assigned to the POT. kubelet is also responsible for a life-self code of the POT, which is necessary for some kind of interaction with the system. So that's what we need to know about Kubernetes. So let's have a look at what kubelet actually is. So kubelet is a Kubernetes extension that allows running traditional virtual machine workloads, natively side-by-side with the quantum workloads. So if you are on the move to containers but you still know and acknowledge that you need virtual machines, this is the product or project you are looking for. And so how do we do that? So as I said, kubelet is based on objects, and it's really extensible. So kubelet just hooks into the API, provides own object called virtual machine, and we have some web hooks and API that can validate the schema and so on. So user just interacts with any other workload type in Kubernetes and posts a spec of the virtual machine to the Kubernetes API. There we pick it up and transfer it into the POT, so the workload specification. Kubernetes does some work for us, and we end up on the node. So here are two important things to talk about. The first one is that we get a POT, we call it virtual launcher. Here we are running components as libvirt, chemo, which you are probably familiar with, and this is the component which is around for the user. So we don't trust the component. And there is also a second component which is actually our node agent called wheeled hinder on the left side. This is a privilege container, so it allows us to see other processes on the node because containers are just processes we can mess with them. This is really important for my talk because we need to do some privilege setup for the virtual machines to be able to drop the privileges on the side of the user, so the virtual launcher. So great, so how does Kubernetes security mechanisms are implemented? So back again on the first, into the first picture. As we can hook into the API, other mechanisms can hook into the API and so for example, there is an OpenShift specific one which is called security constant constraints or a Kubernetes or playing Kubernetes upstream of a project called POT security. It's actually not a project, it's integrated into the project. And most of the times this mechanism provides some kind of POT security policies and these policies have a lot of spectrum, so on one side you have like restrictive policies which is like hardening best practices what you should allow to normal user. And on the other side there is a privilege policies which allows for non-privilege escalation you can do basically whatever on the system and that should not be assigned to any user, you should know him and so on. So the problem here is that this policy or security enforcement mechanisms are often not enough flexible enough. So if you need some privileges for the Qubes for the visual machines, oftentimes you need to give a privileged access to the user so he can then actually craft any arbitrary workload with the same privileges or even more privileges than we are using. So this is a real issue and we need to integrate with the Kubernetes model to allow normal users to run the virtual machines. So what is actually restricted? At least it's long, you don't need to run through it. It's basically, it comes down to two things. It is restricting a few of the features that Kubernetes provides and the other thing is just plain security stuff which you would use with processes on the node as usual. So we are going to focus on three categories or three features. Those are capabilities, Selenux and running other route. So what was the first step for us? The first step for us was actually to have unprivileged networking because we used a lot of capabilities for networking. We had to cut down on these capabilities. So some background is that Kubernetes has some default network. When the workload is submitted, it gets some IP address. The container gets interface. And to connect the interface to the guest, Kimu provides few options. For example, sleep, which would be unprivileged way to do it. Or through, it also provides a way through top device. But configuring the top device is actually privileged operation or most of the networking configuration is privileged operations. So we need to somehow solve this problem. And at the same time, we need to provide the IP address of the workload inside to the guest. And we use standard ACP. So how to address the problem is to actually offload the networking setup to the privileged component, which is called the weird handler. But it also requires the well-known management tools like LibWird to give us the opportunity, for example, to not manage the networking for us. So there is an option unmanaged in LibWird that basically does what you need. And this also means that the existing management tools needs to know that there are some cases when they are losing their privileges and should not implement the features in a privileged way. So one problem stays, and that is that we leave out the NetBind service. And I get back to it while it's the problem, but it's actually a problem. So what about running the workloads as a normal? So it can be easier setting the user for the workloads of setting the user on the container and then using LibWird in the session mode. Session mode what it does is actually tells LibWird how you don't have much privileges. Please don't do a lot. And please run the virtual machine as the user, which is requesting it. So, yeah, that can be that easy, but not for us, because some security policies actually dictate you that you need to run as any user, which means you get pre-allocated ranges of UIDs and you need to cope with that, which is good, for example, when you're running QM processes on the host, you don't want them to be the same user because they can reach other risks and so on. But it has some challenges with containers because the file system permissions on the file system are set on the build time of actually the container image, and modifying the container file system might not be a good idea because there can be copy-on-writes file systems. So, the solution is to actually use Kubernetes for this and use a feature called Empty Directory, which basically give you a tempfs whenever you need it, and the permissions are exposed, and then you can manage it on your own. This solves a burden of vendors to actually keep track of the file system structure that they are need to allow to the user, or expose the user. So, the next feature you would like to use probably is some kind of storage. So, Kubernetes allows two types of storage, fast and block volumes, and they don't have really standard permissions models. So, what basically happened when you are running as a non-reducer, you get probably permissions denied, and for this, there is a feature inside Kubernetes which is called FS Group. This makes sure that the files and directories are inside a special group, and you are then in the group. That doesn't work well with all the storage providers out there, and this feature actually restricted by some of the policies. So, we can't use that. And the solution is, again, like, manage the permissions inside a privileged component called ViewHandler. So, what about the devices? Kim needs to access some of the new devices, for example, KVM for accelerating VHOSnet for better performance of net networking. What about VGPU pass-through, or SROVnetworking? All these things need to access the devices. So, Kubernetes have this framework, our mechanism called device plugins that make sure that cluster sees the resources, scheduler can schedule, and container gets the devices. But what happens when you run as a non-reducer, you actually don't have access, or it's inconsistent, and why? Because they copy the permissions of the host. So, based on the vendor you are using, or operating system, it can vary. So, it's not usable, it's inconsistent. And so, before a solution, I need to mention that Kubernetes actually acknowledged this, and has a follow-up on it to allow the access for the Unprecedented user. But for us, right now, we need a solution right now. So, again, offloading the permissions on the ViewHandler. But these are the drawbacks, because we only know a few of the devices that we can manage. So, if you are integrating a new device through our extension points, then you can fail to do that. So, I mentioned we left one capability for the networking, and we actually introduced one another, a P-Trace, when we were introducing a VTPM support in Cycuboot. And what is the problem with capabilities and non-route containers is that the calculation of the capabilities of the new process that is executing, so the container, is different when you are running as a route and as a non-route. So, you probably seen this main page. And the trick is that you need to use file capabilities, or the first trick is to use the file capabilities to actually, on the executive binary, to get those, to make sure that the process gets the capabilities, right? The second trick, or the second way to do it, is to use ambient capabilities. And it is a better way, and I will tell you why, but this is not integrating in Cycubanities. So, again, we need the upstream support for that. So, why ambient capabilities are better? They don't require any changes to the images, so you can run out of the box. And some security policies dictate you to run with the, allow no, don't allow any privilege escalation flag in Cycubanities, which is actually, no new previous bit for the process, which means the capabilities are going to, or the file capabilities are going to be ignored, and so you don't get them, so you end up being denied access to operations you want to do. Drawbacks of file capabilities is that you need to set the file capabilities and require them, like always, for the workload, because if you would not require the capabilities for the workload, then the process, the creation of the process will fail, because it is not allowed to have the capability that is set on the binary. So, this allows an optimal approach. For example, if you don't need VTPM, we would probably drop the P-Trace, but we cannot. So, last thing I'm trying to target is the Selenux. So, keep it enabled. And what's the problem there? So, some of the rules were not recognized, suitable for normal container T type, which means normal container workloads should not have the rules, but the situation is getting better and better. And we have just two pain points there, and there is virtueFSD, which requires a lot of rules. And one more thing is huge table file system, which, so the policy allows you to create a file, but not a directory, and literally start creating the structure inside of a huge table file system. And it gets tricky with the Selenux. So, that's not good, but probably this rule can be upstream. So, what's left? So, arbitrary user needs to be used for the workload to play with some of the policies. This can get tricky, because the user namespaces are going to be the thing in Kubernetes. So, we need to somehow think about that. Also, we are trying to remove the custom policy and only set it when we need it. Upstream all the rules we require, or use alternative APIs for things we need. For example, for the huge table support, you can use the approach of the file system, so you can use MFD API, which is not privileged. And, yeah, we are waiting for the ambient capabilities in upstream. And one more thing we can do all, we can do all is to probably switch to restricted first approach and think about what kind of privileges we need for some of the operations we are doing as a management tools or other emulation programs. That's all. Thank you. And you can get in touch with us, with Twitterhandler and at the Slack. Any questions? Thank you very much.