 Hey, everyone. Welcome to the talk. My name is Peter Hunt. I'm a senior software engineer at Red Hat working primarily on cryo, but sometimes on kubelet and RunC and podman and other container-related technologies. My name is Urvashi Munani, and I'm a senior software engineer on the OpenShift container steam at Red Hat and a cryo maintainer. All right, so today we're going to talk. I'm Rinal Patel. I'm a cryo maintainer, and I work for Red Hat. Hey, everyone. Thank you for joining our talk. I'm Sascha, one of the maintainers of cryo, and I really hope you enjoyed this session. Main secure, performant, and boring as ever. Cryo is a lightweight OCI-based container engine implementing the Kubernetes container runtime interface. It supports all OCI-based container images, including older Docker formatted ones, and it also supports all OCI container run times, such as RunC, Cata, Geovisor, and C-Run, to name a few. Cryo is built on top of various building blocks, focusing on different aspects, such as storage, image, and networking. With this structure, Cryo is able to pick and choose features from the building blocks while these projects can evolve at their own rate. This ensures that we maintain stability while enhancing Cryo with new features at the same time. From day one, Cryo was built with only Kubernetes in mind. Our versions walk in lockstep with those of Kubernetes, and Cryo evolves as Kubernetes does. Cryo also focuses on making workloads running in production as secure as possible. A few quick examples are we have fewer capabilities enabled by default, and we also give the user the ability to run containers in read-only mode. So let's take a quick look at some of the updates from the last few months. Cryo now supports the V1 implementation of the CRI as a kubelet has finally transitioned to it. While the kubelet was transitioning from V1 Alpha 2 to V1, Cryo had added support for both CRI implementations to ensure version fewer objects for common operations by only creating a new object when a change has been detected. This helped in reducing the overhead CPU of the Golang garbage collector, helping improve overall performance. And finally, a CVE regarding SysControl was discovered that allowed an attacker to bypass the safeguards we have in place and set arbitrary kernel parameters on the host. Cryo uses Pinniness to set kernel options for a pod, and there was a bug in it which allowed an attacker to sneak in additional SysControls that would give them root access on the host. We quickly fixed this up and released patched versions of Cryo starting from 119, which is when the vulnerability was introduced. And now over to Sasha, who will give us an overview and demo of Konmon RS. Thank you. Let's talk about how we can rethink container life cycle management in Cryo. So for many releases, Konmon was the desired OCI runtime monitor, which is able to interact with run C and C run as lower level container runtime. You can check this project out under containers Konmon and GitHub if you would like to. But in general, this little habit tool helps us to ensure the communication between Cryo and the OCI runtime. So, for example, it takes care of the container creation. It also takes care of the cleanup when the container terminates. And it also provides endpoints for executing processes inside of containers and also provides endpoints for attaching to a container, for example. Konmon is capable of writing the container log files for Kubernetes, which have to be written into a special format. And it also beside that with this use cases for Putman as well. Konmon is generally designed to have the lowest possible memory footprint, but there are some drawbacks with this. So it's written in C and therefore finding new maintainers for Konmon is really hard for us. And the main interface to interact with Konmon is the CLI. So over the past years, we extended the CLI in the same way as we extended features to Konmon, which kind of makes the usage a bit more complicated than we initially thought. Adding new features to Konmon also increases technical depth therefore. So not we don't only have issues finding new maintainers for Konmon, but we also increase technical depth. So we are kind of moving into a sinkhole here. And it beside that, it still has some runtime dependencies like GLIP. And this makes static linking hard. So we provide a, for example, for cry, we provide a static binary bundle, which contains all runtime dependencies out of the box, and Konmon relies on GLIP. So we have a little bit of troubles linking statically. And today, we're happy to announce that the possible successor of Konmon, which is Konmon IS, Konmon IS contains a completely new software architecture. For example, we can utilize the programming language Rust and also the asynchronous talk to you around time to rethink what we have done in the past. Besides that, we can also target the lowest memory usage as possible. We can support something like having multiple good containers, like bots in one instance of Konmon. And we can reengineer the indoor container process execution. Besides that, we can provide an extensible API and the golden client out of the box to be used from container and times like cry or apartment. But how to provide an API between Rust and Golang? Well, we decided to choose Captain Brodo for that. And Captain Brodo is an insanely fast data interchange format. And it's in capability based RPC system. So it's like protocol buffers, but a little bit faster. And it also has a smaller memory footprint. So we thought it would be a great use case for Captain Brodo to try this out. Captain Brodo is faster than protocol buff and also has a smaller memory footprint than a chrpc runtime. So for example, if we want to create a container here, then we can do it in the same way as we do it in chrpc. So we define a create container request, which contains the request and as well as the response. And the content create container request itself requires all the necessary data we usually have been passed over via the CLI to Konmon before. So we need the container IT, the bundle path determined. And we have to decide if we want to choose a terminal or not. And for example, we can also specify different lock drivers now. And this is new. So we can specify multiple lock drivers. For example, we create a new structure lock driver, which had the type and the path. And for now, we only support the container runtime in the face lock driver, which provides the necessary looks to Kubernetes. And the response then is the container PID. And from a use case perspective, we would want the Konmon server, which then waits for incoming connections from the client, which would be cryo. And cryo tells Konmon how to create a container and how the container life cycle should work. So Captain Brodo allows us to provide clients for multiple languages. So we can create a Golang client without having the user to struggle with the Captain Brodo types at all. And it also trims down the CLI to only require runtime like run C and the directory for state handling. So we only require two CLI arguments to start Konmon. Besides that, it allows us to use the streaming capabilities of Captain Brodo rather than using the sockets for it. Right now, we still rely on sockets because we kind of want to mimic the old behavior of Konmon. So Konmon is unfortunately not ready for production uses yet. But right now we are working on the integration into cryo. And this helps us because we are already having a feature complete implementation right now. But for sure, this has to be better proven. So in the future, we are going to follow up to make it ready for part-man as well. Check the project out on GitHub and tell us what you think. And I would like to provide you a little demo about Konmon.is. To actually use Konmon.is, we would have to run a modified version of cryo which already implements the current work and progress state of Konmon.is. So now let's do that. We start cryo and it starts up as usual as expected. And then we can run cryctl a test container inside of a test sandbox. And if we do that, then everything looks as we would expect it from cryo. So we run a new container and got the container ID. And if we run cryctl ps, then we can see that the container is up and running. But now let's double-check that by looking at the local processes. So if we look at the current processes running on my system and grab for Konmon.is, then we can see that we have a Konmon.is instance now up and running using the runtime run C and the runtime directory there. And if we look into this runtime directory, then we can see that we have a Konmon socket that we're up and running, which is the main socket that cryo connects to and uses Konmon.is as a server. So now we can, for example, also exec into the container and execute a command, which then gets returned. So this works as intended. Right now, our overall plan is to integrate Konmon.is in the same way as we integrate Konmon into cryo and make the process of transitioning from Konmon to Konmon.is as easy as possible by making everything work out of the box. Thanks for listening to this demo. Thanks, Sasha. Now we'll go over some optimizations that cryo recently made through its handling of SC Linux volume relabeling. Before we do so, we'll quickly review how SC Linux works in Kubernetes. Pods or containers can be assigned an SC Linux option structure. In the structure, the user can specify every field needed to construct an SC Linux label, the user, role, type, and level. There's a summary of what each of these fields are here. The field we're most concerned with is level, as the other fields don't typically change unless the user specifies otherwise to access a file on the node. The level field is a way of subdividing content among like types. In this case is how different containers cannot access files. They themselves are not given explicit permissions to. The SC Linux option is passed from the API to the cubit and founded to cryo. Cryo then takes every unfilled field and populates it with its storage library. The full label will be generated and the mount label will be used for the root of s and volume while the process label will be used for the container process itself. Note, the mount label will only be used if the volume plugin opts into it and host path as a volume plugin doesn't opt into this. So you don't get automatic relabeling if you tried to mount something from the host. Final and important piece of information is the special type SPCT or super privileged containers, which essentially disables SC Linux for a pod or a container. If possible, we recommend not to use this type, though it can be useful when pods need access to many files on the host and are trusted. The security profile operator or SBO is useful for distributing SC Linux profiles that can help avoid setting your containers to SPCT if they're blocked by SC Linux. At this point, the state of Kubernetes and SC Linux seems great. We have a way for containers to access content they're allowed to as well as a way of allowing containers from accessing content they don't own. This prevents the possibility of container processes breaking out of the cheroot from affecting anything on the host. They're not allowed to touch. However, there's also a problem. Note how before the user or container storage library choose a new level for each pod on startup. As each new container gets created, volumes that it has access to may need to be relabelled, so the container has access to them. Since this label is owned by cryo, as it sometimes generates it and is responsible for creating the container's root of fest and mounting the host volumes into it, the volume plugin instructs crowd to relabel the volume each container creation. The number of files in a volume is not restricted in any way, and thus relabeling can take an arbitrary amount of time. When the cubelet asks cryo to create a container, it must cap the amount of time cryo has and cancel the request after that time. Otherwise, the request could have gone to the void and the container creation could never happen. When cryo can't relabel the volume in time and the cubelet times out, it causes the two processes to bicker about it. Cubelet asks cryo to create container, cryo says, I'm working on it and fails. This creates a lot of log churn and scary messages in the community's event API. What's worse is this process happens every container restart, as cryo needs to be sure the volume is labelled fully correctly. Our goals to fix this problem are as follows. A container must still be able to have a process label that can access the volume, or else what's the point of mounting that volume? A volume must be labelled so the authorized containers can access it, but not authorized containers cannot. And we want to relabel as few times as possible to ensure container creations and restarts don't time out if it can be avoided. To satisfy these conditions, we came up with two solutions. The first solution is to conditionally skip the relabel if the top level directory is almost correct, assuming the container was specifically allowed to do so. We use cryo's allowed annotations functionality to ensure an admin wanted this behavior to be enabled on the node and that the pod author actually wanted to opt into it. Some advantages to the solution are the container still gets labelled, properly confining it in the case it isn't totally trusted. It can be enabled on a per pod granularity so only containers that need it get this functionality. If it is more friendly to restarts than the default behavior, if a container was created once and kept in the same label, then it will inherit the work that was done previously to relabel the volume. If multiple containers in a pod are with the same label, access the same volume, only one container incurs the cost of relabeling. The volume can even be relabelled ahead of time so no containers incur the cost of relabeling. However, as with any solution, there are a couple of compromises. The label does have to happen once if the volume isn't relabelled ahead of time, the first container in the pod is liable to time out. However, subsequent containers and restarts will succeed. If a file in the volume was relabelled outside of this process, then the container will be denied permission. This isn't in an expected case, but we want to make sure the feature opt-in to make it more obvious where to look if the container sparsely gets denied permission to something. Luckily, we've come up with an alternative that mitigates these issues. Our second solution is to skip the relabel if the container is sufficiently privileged. Remember the Special SC Linux type super privileged container? We leverage the fact that the container is essentially unconfined to avoid relabels completely. The solution is faster than the default or the first as no relabeling is needed ever. Simpler because it doesn't require configuring cryo or adding special annotations to the pod and portable. Any volume can be mounted into any privileged pod and never incur the cost of relabeling. However, as is likely obvious, the solution is not secure. The crowd team does not typically recommend giving your pod so much privilege unless you absolutely trust it. Otherwise, a container breakout could cause serious issues on the host. In other words, the crowd team has presented users with three options for their pods depending on how secure and safe they want their pods to be versus how quickly they'd like the relabeling done. The first option, the default, is the least performant as it relabels every time but is the most secure as the volume is correctly labeled each time and the pod is fully confined. The second option, conditionally skipping, is slightly more performant as it is more performant as it skips the relabeling sometimes but is slightly dangerous to the pod as the contents may change label underneath cryo and the pod will be denied permission if they try to access it. It is, however, similarly secure as the first option because the pod is confined to files that it wants to access. The final option, skipping if privileged, is the most performant because the label never happens. However, such a powerful pod must also be trusted. In the future, with KEP1710, we hope to avoid this label entirety. The KEP is basically the way the KEP works is the kubelet instead of passing down the relabel option to cryo upon mounting the volume will pass a mount flag to the mount command that will do the relabeling for it. So none of this overhead will be needed because it will be relabelled upon mount. Thank you. Native containers and Linux share the same underlying column. Container security is multi-layered like an onion. The more layers you have, the more secure you are. For example, you can think of a selenux, username spaces and seccomp as different layers. Seccomp allows you to restrict the syscalls that your container processes can make, effectively reducing the surface area of attack on the underlying column. Container runtimes allow you to install and pick seccomp profiles that specify exactly which syscalls are allowed. Container runtimes ship with pretty good default profiles that allow only a subset of syscalls that are considered safe. I get involved in various container CVs and quite a few of those CVs could have been mitigated if seccomp was enabled. Unfortunately, seccomp isn't enabled by default in Kubernetes yet. By default, all pods are unconfined unless they're configured to pick the runtime default for a particular profile. We worked upstream on the seccomp default feature gate that allows one to change the default to the runtime default profile. This feature gate is still alpha and so isn't enabled by default in production deployments of Kubernetes. So you can read the linked blog for more details on how to use that feature gate. So we decided to enable this by default in Cryo till we graduate the feature gate to beta and GA. So Cryo users are protected by seccomp by default. Cryo added this flag a few releases ago and is switching it to true by default in 124. Okay, now let's take a look at what's next for Cryo. In our previous KubeCon talks, we introduced the concept of runtime classes that can be set in the cry config allowing users to use different container runtimes, such as run.cc, run.gviz, etc. for different workloads in the same cluster. We're extending runtime classes to allow users to use multiple storage drivers now. A major use case of this is giving VM-based containers the ability to use a different storage driver, such as device mapper, and sort of the standard overlay FS, making the performance of VM-based containers with Cryo much better. The next one is cosine and six dot support, which allows signing a container image and storing that signature to registry that can be used for image verification later on. A lot of work is being done in the containers image backing library for this and once that is ready, we will be adding support forward to Podman and Cryo. Additionally, there is work being done on upstream Kubernetes to support forensic container checkpointing. This will allow users to take a snapshot of a running container that can then be transferred around to another node and the original container will not know that a snapshot of it has been taken. Only the KubeNet API is being extended and only the checkpointing feature is being added to the KubeNet, which will be an alpha feature in version 1.25. The restore portion of this will be implemented on the container engines. So once the CRI API has been extended for checkpointing, Cryo will add in the restore support as well. And finally, Cryo can now export open telemetry trace data when needed. It is currently an experimental phase as we wait for KubeNet to get support for it as well. This will be happening in version 1.25. So yeah, a lot of exciting stuff coming up for Cryo. And these are all the updates that we have for you on how Cryo continues to remain secure, performant, and boring as ever. We can take any questions now, or you can also find us on the Cryo channel on the Kubernetes Slack if you have any questions later. Thank you for attending our talk.