 Hey folks, welcome to our talk about having the performance cake and eat securely, too. I'm Sasha, one of the maintainers of Cryo, and I'm also an upstream contributor to Kubernetes, and I have the pleasure to be here today with Peter. Hey everyone, my name is Peter Hunt. I'm also a Cryo maintainer, as well as a contributor to Podman, Kanban, and Kubernetes. I'd like to thank everyone for joining us today. Back to you, Sasha. Thanks, Peter. Let's see about what we are talking today. So, first of all, we will speak about how SA Linux works together with Kubernetes and Cryo. We will discuss the Selenox options field in the security context. We will also speak about available Selenox options at all, and how labeling affects the container in production. After that, we will speak about a problem declaration. So we will see what are the timeouts for the course of free labeling, for example, and what are the performance impacts of something like this. After that, we will speak about possible solutions of the problem. So, for example, we will discuss if we can skimming the relabel if the container is super privileged, for example. And then we will discuss future work. So everything which comes into our mind, and we will probably add later on for optimization. So how does SA Linux work in Kubernetes? Assigning SA Linux labels to whole pods or single containers can be done by using the SA Linux options field of the security context. Processes as well as files are labeled then with an SA Linux context that contains additional information. So, for example, the SA Linux user, its role, its type, and optionally a level. Those are reflected in the SA Linux type options field. So we have a user, which is an identity that is authorized to specify a set of roles. And we have roles, and those roles work in the same way as the Kubernetes Rx system works. So they control basically which objects can be accessed. And we also have types, and the types are used for type enforcement. So they define a domain of processes and a type of files. The level is used for multi-layer or in category security. So it's written as sensitivity colon category set. So for example, it can be a zero, C one, two, three, and it can also be specified by using ranges. The file, etcc linux-targeted-set-trans.conf maps those levels to human-readable forms. So for example, S zero, C zero can be company confidential, and S zero, C three can be top secret. One requirement to the Kubernetes distribution is that the underlying operating system has to support SA Linux for sure, and also have to come with some predefined policies. So for example, the Red Hat container is operating system used by OpenShift. Already specifies those set of rules that we can use ls-set to find the SA Linux context of the file. In our case, the user is called systemU for etcos release. The role is objectR, and the type is etct. And we have the level S zero for this file. To find out more about SA Linux uses available on the Location Machine, you can use se-managed-user-minus-l. This prints an overview about the predefined users, their roles assigned, and available levels. If we now specify SA Linux options in the Kubernetes pot manifest, how do they get translated to the underlying container on time? So first of all, the provided SA Linux label will be passed one-to-one to the kubelet and to the underlying container on time, which is cryo in our case, if we use OpenShift. And cryo will then use this data on the pot sandbox and container creation process to pass it down to the underlying container storage library. And container storage will generate a new process and file aka mount label, which will be then used by the sandbox. So the file label is used by the root file system of the sandbox, and the process label is used by the processes within the sandbox and the containers. The mount label will be also used for the volume mounts if they support SA Linux relabeling, but in most cases, the volume driver will support that. For example, if we now run a container with the security context and just change the SA Linux label to SCO C3 kubsecret in our case, then we can run a test pod with that SA Linux options and see that every file which is part of the root file system inside of this container now runs as SCO C3. And there are some corner cases. For example, if we share the host pit and the IPC namespace with the pod, then we will not apply any SA Linux label at all. Also, privileged containers will not be relabeled as well. And this whole relabeling process will be used by changing the SA Linux context on this, and this can cause performance issues. There is also a special type available for super privileged containers called SPCT. This basically disables SA Linux for the container on the whole pod. SPCT is almost similar to the unconfined T type, but it has some differences. So container anthems are allowed to transition to SPCT, and confined processes can communicate with sockets using SPCT. So if you see this type in the wild, then you can be sure that SA Linux is mostly disabled for this workload. So in this case, I would like to remind you that please don't run unconfined containers if not absolutely necessary. One more thing to note, if it comes to distributing SA Linux policies inside of Kubernetes clusters, then you can choose the security profiles operator to have distributing them across all nodes. It also supports a custom CRD, which makes writing SA Linux policies from scratch even easier. And with that, I would like to pass over to Peter. Thanks, Sasha. At this point, the state of Kubernetes and SA Linux may seem great. We have a way for containers to access content they're allowed to, as well as a way of disallowing containers from accessing content they do not own. This prevents the possibility of a container process breaking out of the trute from infecting anything on the host. It's not allowed to touch. However, there's also a problem. Note how the before the user or container storage library chose a new level for each pod on startup. As each container gets created, any volumes that it has access to must be relabeled. So the container has access to it. Since this label is owned by cryo, as it sometimes generates it and is responsible for creating the container's root effects and mounting the host volumes onto it, cryo must spend the time each container creation to relabel the volume. The number of files in the volume is not restricted in any way. And thus, relabeling can take an arbitrary amount of time. When the cubelet asks cryo to create a container, it must cap the amount of time cryo has and cancel the request after that time. Otherwise, the request could have gone into the void and the container creation could never happen. When cryo can't relabel the volume in time and the cubelet times out, it causes the two processes to bicker about it. The cubelet asks cryo to create the container. Cryo says, I'm working on it and fails. This creates a lot of log turn and scary messages in the Kubernetes event API. We must be able to do better. In other words, we need a solution that satisfies the following conditions. A container must still be able to have a process label that can access the volume. Or else, what is the point of mounting the volume in the first place? A volume must be labeled so the authorized container can access it, but non authorized containers cannot. And we want to relabel as few times as possible to ensure the container creations and restarts don't time out if that can be avoided. To satisfy these conditions, we came up with two solutions. The first solution is to conditionally skip the relabel if the top level directory is already correct, assuming the container was specifically allowed to do so. We use cryo's allowed annotations functionality to ensure an admin wanted this behavior to be enabled on the node and that the pod author actually wanted to opt into it. Some advantages to the solution are the container still gets a label properly confining it in case it isn't totally trusted. It can still be enabled on a power pod granularity. So only containers that need it get this functionality. It is more friendly to restarts than the default behavior. If the container was created once and kept the same label, then it will inherit the work that was previously done to relabel the volume. The same is the case for multiple containers. If multiple containers in a pod access the same volume, only one container incurs the cost of relabeling. The volume can be relabeled ahead of time, so no containers incur the cost of relabeling. However, as with any solution, there are some compromises. The label does have to happen at least once. And if the volume isn't relabeled ahead of time, the first container in the pod is liable to time out. However, subsequent containers or restarts of the same container will succeed. If a file in the volume was relabeled outside of this process, then the container will be denied permission. We don't really expect this to happen, but we make the feature opt in to make it more obvious where to look when a container sparingly gets denied permission to something it previously had access to. Luckily, we came up with an alternative to mitigate these issues. Our second solution is to skip the relabel if the container is sufficiently privileged. Remember the special SE Linux type super privilege container? We leverage the fact that the container is essentially unconfined to avoid relabels completely. This solution is faster than the default or the first solution as there is no relabeling ever needed ever. Simpler as it doesn't require configuring cryo or adding any special annotations to the pod and portable. Any volume can be mounted into any privilege pod and never incur the cost of relabeling. However, as is likely obvious, this solution is not very secure. The cryo team does not typically recommend giving your pod so much privilege unless you absolutely trust it. Otherwise, a container breakout could cause serious issues on the host as that pod is completely unconfined by SE Linux. In other words, the cryo team has presented users with three options for their pods, depending how secure and safe they want their pods to be and versus how quickly they'd like this relabeling done. The first option, which is the default, is the least performant as it relabels every time, but it is the most secure and reliable as the volume is correctly labeled each time. The second option, conditionally skipping, is more performant as it skips the relabeling sometimes, but it is slightly dangerous to the pod as the contents may change label and the pod will be denied permission. Though it is similarly secure as the first option as both of them have a process label and are confined to only touch the things that they're allowed. The final option, always skipping if privileged, is the most performant because the label never happens. However, such a powerful pod must be trusted. With such a variety options, you may be asking, what future work could there possibly be? If you did ask yourself that or are now, I'd like you to imagine a world very similar to the one we have now, but where cryo doesn't need to be labeled at all. In cases where the kubelet knows the label, it can use a mount flag to label the volume, when the volume is being mounted. Then it can tell cryo not to relabel. Thus, the volume gets correctly labeled and cryo doesn't need to do it. This means that the time cryo takes to create a container won't be proportional to the size of the volume and it will reduce the risk of container creations timing out. There is a push to do just this in Kubernetes right now in Kubernetes enhanced proposal 1710. You can check that out for more information. In this future world, we will likely still have the other two cryo workarounds, as they cover cases the kubelet may not be able to, such as when cryo chooses the label after the volume has been created. And with that, I'd like to thank everyone for joining. Here are some resources to learn more about cryo and now I'd like to transport from the recorded realm into the live realm for questions.