 So thanks, everyone, for joining IN Fabiano Fidencio. I am a software engineer who works for Intel as part of the CloudHop Advisor in Cata Containers team. I'm here today to talk a little bit about, well, Cata Containers, just go through these quite quickly. The confinement work that we have done because this talk is confining the extra security layer. So we are going to go through how we got Cata Containers to properly behave using SL Linux. What is it still missing? The benefits to the community, and we are going to do a quick recap. I must tell you, I have a lot of slides. So I may not be able to actually get to the Q&A here. I'm going to have to speak quite fast, but I will be here outside. So we can just chat there. Don't worry, I'm not going to be running. I cannot run anymore. So Cata Containers, I guess the best way to start talking about Cata Containers is to start talking about traditional containers. Is everyone here familiar with traditional containers and Cata Containers? OK, so cool. Traditional containers are pretty much like a process running on top of the host Linux kernel. And they are confined by C groups, namespaces. And they can have sometimes like monitor access control, which a bunch of people just disabled it. And they can be less privileged or more privileged according to capabilities and set comp. Cata Containers is pretty much the same thing. But we provide an extra hardware virtualization layer where you can actually run your container process inside. So on the case here, on the right, if there is an escape, you are pretty much on the host Linux kernel, like you can do pretty much whatever you want if this is not confined by some mechanism like SA Linux. On the Cata Containers, though, if you have an escape, you are inside that really, really lightweight virtual machine. And the only thing you can see there are the resources that were allocated specifically for your pod. So you cannot mess around with different process unless you are able to also escape the virtual machine, which is like, well, good luck. OK, how Cata Containers actually works. Folks here are familiar with Kubernetes? OK, Cata Containers, since the version 2.0 is quite tailored to Kubernetes. We don't have Docker or Polymer support anymore. And how it works on the Kubernetes side. You have the kubelet demon running on a node. A user creates a pod, kubelet receives a request. Then it will just pass the request down to the CRIN gene, which is either cryo or container d. And then it will start the container d in Cata V2. This long name is the name of the Cata runtime. I will interchangeably say a container d in Cata V2, Cata Shin, Cata Runtime is the same thing. Cata Runtime then will start the VMM process. And on Cata Containers, we have support for, well, we actually test on RCI three different VMMs. Kiemu, which is, I guess, everyone's familiar with. It was the first one that we started using. We have support for Firecracker, which relies on Rust VMM. And we also have support to Cloud Hypervisor, which is a project from Intel, from my team, which is also based on Rust VMM. But it's really, really tailored for the Cloud user cases. OK, Cata Runtime starts the VMM process. The VMM process will actually have the guest up. Inside the guest, we have a need process that is the agent. And the agent is responsible for the lifetime of the containers. You have to share content between the host and the guest. You have to share the container with the FAS. If you have volumes, if there are mounting volumes, those have to be shared between the host and the guest. There are two ways to do that. Well, the first one is using a device mapper. So you can just attach a Virtio Block or a Virtio Scuzz device there. Or you can use something like a Virtio FSD or NIDOSD, which is a shared file system protocol that can just allow you to do that. The Virtio FSD started as a replacement of the old 90-PFS, which we all hope that stays as that as it is nowadays. So enough talk. Let me show you a little bit of what I've been telling you. Here I have a Kubernetes cluster running with Cryo as the CR engine. And I'm going to start to engine X containers. The first one is going to run with the full runtime using the engine X image. The second one has this. The main difference here is this runtime class name, which is just telling me, like, I'm going to use the Cata containers runtime with the KEMU as VMM. So start both pods. They are both running. Let's actually check that they are behaving as expected. So the first one, well, for both of them, we're just going to check that the engine X server is up, works very well for the run CKs, which is default container runtime, and also works very well for the Cata containers. Now, let's take a look at the processes that are running on the host side. Let me just go back here. We have the Cata runtime process. We have Virtio FSD that I mentioned for sharing the file systems between the host and the guest. And we have KEMU. So those are the process that are running on the host side, and those are the process that can, if there is an escape, are through those that an attacker could actually have access to our host. So we have to confine them better. What is the best way to get a process that we see here that is running as root? Because rootless Kubernetes is still not a thing yet. They are doing it with progress, but what is the best way to confine a process that is running as root to not have access to other content? SA Linux. So what is SA Linux? There is this, Red Hat has this really nice definition in their web page. They say, like, this is Security Hands Linux. And it says it was started by NSA, got to the Linux kernel in 2003. So you know this is a quite mature project and quite well-developed project. And it has been expanding since it started to accommodate new technologies, which is amazing. I like the definition, though, from Lukas Vrabets from Red Hat, who says that SA Linux is a technology for process isolation to mitigate attacks via privilege escalation. How does it do it? I guess the main thing we have to understand here is the difference between mandatory access control and discretionary access control. DAC is what we have with any Linux system. We have a file. You can set who owns the file, the permissions on that file. And that's pretty much it. But we have a problem. Imagine that I have a file, that I have a really only file that can only be accessed by the user Fabiano. But someone breaks into my machine, gets root access to that. They can simply change the permissions. They can simply change the ownership. They can access the file. They can do whatever they want. With mandatory access control, this is a slightly different. Because MAC actually defines access controls for the applications, process, and files on the systems. So it basically defines the interactions that a process can have with the other things that you have running on your host side. It uses a set of rules of what cannot be accessed. We call those policies. And when a process requests access to an object, the permissions are checkered, and then it will be denied or granted. So with a Linux, you may be root. You may try to access a file that has 7.7.7 permissions. And if your process is not labeled correctly, you will not get access to that. You'll just get the permission denied. So this is what we want to ensure. So in order to ensure these four Cata containers, new policy had to be created, we have Dunwash from Red Hat, as known as Mr. SA Linux, created this policy called container KVMT. And the whole idea is we get the container T policy, which is what container should be using, which is a very, very restrictive policy. And we expand this a little bit to allow whatever a VMM and virtual OSSD have to do in order to actually be able to use Cata containers. The container T label can actually access all the things that are container file T label. So everything that is inside the container. And if a container breaks out, it will just get blocked. Because if it tries to write anything on slash root, slash usr, slash var, it has no permission to do so. However, it can still read and execute binaries from slash usr, because it's interesting for you to actually create a pod where you can actually link a binary from your host to inside the container. The problem is hypervisors and shared file systems, they need different access. If you are thinking about a hypervisor, it has to somehow have access to Tantap devices in order to pass this to the gas side and then we can have connectivity inside the gas. Verti-OFS, it has to mount directories volumes on the host side. And we really don't want the normal container T label to be able to do so. So instead of actually expanding the container T, the decision was create a new container KVMT type. And mind, this is way more restrictive than just the VM normal label. Because it was going in the opposite direction of us, we got the containers and just tuned that to what we need for Kata containers. And we hope that this would work out of the box for all VMMs and shared file system solutions. We were wrong. So anyways, this was introduced. The most part of the work done on this front, there were two projects there. Container SL Linux, which is the policy. So Dunwoz added a new policy there. And then we have the SL Linux library for Go. And we had to add the support to actually checking the correct file, see if the label exists in the system, and then actually properly pass the label, set the label to the process. So those two projects were involved there. And then we wanted to have this actually working as part of Kata containers. So the approach taken was, we should receive a label from the upper process. So the CRI engine should be the process that should actually tell me, as the container runtime, I want this label. I want you to respond the VMM and the virtual OSSD process with this label. And we pass it down via the OCI spec. So Dunwoz, again, did the majority of the work here. I was pretty much only taking care that we were setting the label in the correct parts of Kata containers. Then I started working on the cryo support, because if you want to have this running on OpenShift, OpenShift uses cryo, we have to have it running with cryo. The cryo support was added by Urvashi. But there was one mistake there. She added this when we were creating the container, and we create the container already inside the virtual machine. We don't care about this being confined inside the virtual machine. We want this to be confined on the host side. And I end up doing the changes there to just set the label on the OCI spec by the time you are creating the sandbox. So it gets to the Kata runtime, and then it will be able to actually spawn everything with the right label. So let me just show you. This is not going to go and run the whole demo game, because it's exactly the same recording. I have the links for that. It's exactly the same recording. This is just the second part of the recording. I'm just going to show you here that the first demo I run with SLinux set as enforcing. SLinux has three modes, disabled. Don't do that, please. Permissive, that is like, you can run everything, but you're going to complain about that. We're going to log all the errors. And enforcing, which is like, if there is an error, I'm just going to block it. So everything was running as enforcing. And here you can see the label of the process. So we have the VertiOFSD here. Is that it? Yeah, VertiOFSD here. And this is running as the container KVMT, VertiOFSD, and the same about KEMU. But Cryo is not the only CRIN gene that we have interest on. We have container D. Container D is by far the most used CRIN gene out there. And does it work? There is only one way to know, right? This is, again, like Kubernetes cluster using Sentos. And I'm going to do exactly what I did before. I'm going to start the NGINX pod using Cata Container's runtime class with the KEMU as VMM. Apply the pod. Take a look that it's up and running. Let's check that everything is actually working as expected. And then let's check the process labels. And oh, boy. So you can see here, the VertiOFSD is running with the container runtime T label. It's not the container KVMT. It's actually the same label that the Cata runtime is using. This label is really permissive. It allows you to play with the C groups. It allows you to create namespaces. It allows you to have way more permissions than you actually should be able to do. But so this is weird. There is something wrong here. The same happens with the KEMU, the VMM process. So let's compare. Like up there, we have container D. Down here, we have cryo. We have the same labels being used for the Cata sheen, for the Cata runtime. But we have those different labels being used for the VIRTiOFSD and KEMU process. Down here, it's the right ones. Up there are the way more permissive ones that we don't want. So I started debugging this. What is going on? So the first thing that came into my mind is like, is this actually enabled on the container D side? This is the configuration that I'm using. And here you can see, it is enabled. So I didn't miss that part. Does container D actually have support for passing this KVM container D labeled down to the Cata containers? It does. It was added by Michael Crosby in that commit over there. It was almost a birthday gift for me one day before on August 2020. So I'm actually running a version of container D that has that patch. I was using 1.5.2. And that is part of 1.5.2. So there must be something fishy for the code. So sorry if this is too small. This is the best that I could do. But what container D does when it's actually setting up the label? It comes here. It tries to modify the process label. It checks, like, is this process here, Cata containers? If it is, let's just get the KVM label here. That one we'll call this one here. And this one here we'll just get to this function. Well, this function will actually populate all the content from that file. And this one here we'll check, like, if container KVM D is part of that file over there, customizable types, we return that, OK, here is the KVM label. Otherwise, it will just return new, which means that it will be running with the same label as the parent process. And you can see that as it's running, as it was running with the container runtime T, it's exactly the same label as the parent process. So let's take a look at that file that it's actually using. If we open that file, we don't see the container KVM D label there. It's just not there. So what do we know? We know that container research is for a label in a file where the label is not there. This label, though, is present in another file called in USR-SHA containers as a Linux context. Those are two different projects. Those are two different packages. The first one over there comes from the SLinux policy targeted package from our distro. The second one comes from the container SLinux process, policy and package. So how do we fix that? What shall we do to fix that? Just remove a bunch of code. Best way, right? What I'm doing here is I actually, instead of doing this whole dance to get the label or not, I just rely on the SLinux library because that library actually knows where to check for. And if it's present there, it will just return you the right label if it's not, it will return you an error. So we just do this, rely on that, and let's see if this works. Same drill as before, right? So container de-cluster, same Cata containers, NGINX pod. It's running, it was before. Let's see if it's working as it was before. And now let's take a look at the label of the process. You can see here, VertailFSD is running confined by the container KVMT. Same about the KMO process. So now we have the right level of restrictions that we want to have. This was measured as part of the container de-1.6.0 beta 5. Fix is there, I back-parted this to the container de-1.5.9. And cool, container de now has support to this and is working as it is expected to work. But we are talking about KMO all the time. I work for Intel, I have a really strong interest in having CloudHip Advisor running everywhere as possible. So does it work? Again, only one way to check it. Same cryo machine first one. We have everything setting to run as enforcing. Here we have the runtime class. Same one as KMO, but this time using CloudHip Advisor as the VMM. So we start the pod and let's take a look. Container creating for four seconds. This is fishy. So something's going on here. When we describe the pod, we can see that there was an error there. If we take a look whether SL Linux was able to catch something, you can see there was an AVC. Don't worry that much about this right now because I have those here. So up there is the error that happened when trying to start the pod. Down here is the AVC. It basically says, OpenTap failed because of permission denies it. The CloudHip Advisor binary, which is running with the container KVMT label, is not able to open a Tantap device. In my mind, I was like, how is it different from what KMO was doing? Took a look at the CloudHip Container's code and what KMO actually does. It receives the file descriptors of an open Tantap device. While with CloudHip Advisor, what we do is we get, we open the Tantap device as part of the Katachine. We close it and we get its name. Pass the name down to CloudHip Advisor. Close it and say, hey, open this for me. And then it's not allowed to do that. Does it have to be fixed on the container or CloudHip Advisor? Yes. So let's go a little bit about how CloudHip Advisor works. CloudHip Advisor is quite neat. Like when you launch CloudHip Advisor, it basically starts an HTTP server for you. And then you can talk to CloudHip Advisor via REST API. Then you create DVM. But create DVM just means you pass the configuration. CloudHip Advisor will use that configuration for a VM whenever the VM is booted. So createVM and bootVM are two different things. But we have a problem. We have a problem because createVM does not have the capability to receive any file descriptor from for the network device. But you can do this when you are attaching a new network device. However, it was only possible to do that if the VM is booted up. But like attaching a network device when the VM is booted is not the way that the containers would work in general. We expect the network to be up when Sandbox is created. It will just fail. And then we have a problem that this REST API is provided by this project. Well, there is this project called OpenAPI that does the bindings. It out-generates code for you so you can get these definitions. And it can just like out-generates go code that we use inside the containers. And it has no notion about the socket control messages. So that was hard. So what I did to actually fix this on the CloudHip Advisor side. I talked a lot with Robert Bradford and Sebastian Boff, who are the maintainers. And we decided to just allow people to create the VM, it means create the configuration, patch it. If you are trying to hot plug something, you just patch the configuration. You are not starting the VM. It does not require the VM to be up. You just change the configuration. That's fine. And then when the user boots it up, everything is there in place. This work was done, measured, and this is what we are using on the CloudHip Advisor side. On the CloudHip container side, though, we had to do a few changes. Because we were getting the network device, we were creating the VM, and then we were booting this up. But we had to postpone the network addition to after the VM is created, but before the VM is booted. Because we want to do this change in the config where we can actually pass the file descriptor down to CloudHip Advisor. And instead of passing the name, we pass the file descriptor. Because if you pass the name, CloudHip Advisor has the preference for that. And we'll just try to open the device by its name. And because of the limitations on OpenAPI, what we had to do was actually taking care of the request by ourselves, because OpenAPI has no notions about sending and receiving sockets. So we just had to do this simple put a request, get the response on our side. And this is a work that is still going like it's there for review, had some really good feedbacks. And I hope to get this merged next week. And this will be a part of the next release. So let me just give you a demo of this running. You can see the same cluster as we had before. Now we are just creating the Nginx pod. Same thing as before, using CloudHip Advisor as the runtime class name. And let's get the pod, make sure that everything works. This is the pod here. It works. Now let's take a look at the labels. Yay. You can see here, virtualFSD is running as the container KVMT, CloudHip Advisor process is running as the container KVMT virtualFSD process. And you can see SLinux is set as enforcing. So yay, we have this done. But is this work completed? Not only no, but hell no. We are still missing. We have to make sure that this actually works with CataDeploy. CataDeploy is a demon set that we have to ease the deployment of Cata containers in Kubernetes cluster. We have to have a firecracker support added. So changes will be really similar to what we have done for CloudHip Advisor. It uses the same like REST VMM crates. But I'm not sure if the community will actually accept the changes. We have to check with them. Nidus, the support, which is a virtualFSD-like shared file system solution program, I never tested with that. Should we have a separate support inside the guest? Maybe. Do we want to have API armor support? Maybe. If someone is interested in working on any of those, I would be really, really happy to mentor. Let me just quickly go through the benefits to the community on this. Now with this, we can actually run everything using Cata containers on clusters using SL Linux by default, which basically means like rel and fedora-based Kubernetes cluster. We have Kubernetes-specific distros that are using container DNS, SL Linux, like Flatcar, Runcher, Typhoon. And maybe, I really hope that at some point, as part of OpenShift sandbox and containers, we can have a hypervisor that's slightly more tailored to the cloud use case than KEMU. Don't get me wrong, I love KEMU. But if you can have something like more modern that has the right use cases for, I would prefer. So pick recap. Cata containers can run under the container KVMT label. Support for these has been added since 2.0.0, but only for KEMU. Cryosupport has been working since the version 119.0. Container D-Patch was merged for 1.6.0, backported to 1.5.9. CloudHopRevisor support has been there since 22.0. Cata containers changes for CloudHopRevisor will be merged hopefully next week and will be part of the next release. It's going to help us to expand where we can easily run Cata containers. And if someone's interested in doing some work on this, I'm happy to mentor. And that's it. Thanks a lot, and I over my time for two minutes. Thanks.