 Hello, everyone. My name is Urvisha Monani and I'm a software engineer at Red Hat, as well as a cryo maintainer. Hey, folks. Thanks for joining. My name is Peter Hunt. I'm a software engineer at Red Hat, primarily working on Cryo. Hey, friends. I'm Sasha, one of the maintainers of Cryo, and I'm happy to be here today. Hello, everyone. I'm Rinal Patel. I'm also a software engineer at Red Hat and a cryo maintainer. Welcome to our talk, Cryo Still Loves Kubernetes. Cryo was designed as a runtime just for Kubernetes. And in today's talk, we will go over some of the things that we have been working on in Cryo to make your Kubernetes clusters run even better. I will first give a brief overview of the latest developments happening in Cryo and then my co-presenters will dive in deeper into some of the notable features we have added. We have been adding more metrics such as improved CRI level stats that could be exported and queried in Prometheus. Sasha is also working on adding womb-related metrics that will give a better visibility into womb kills of your pods. These metrics provide a good foundation to add custom alerts to your cluster. You may have seen our prior talk about dropping the infra container. I'm happy to report that the feature is now getting close to stable as you have been fixing bugs and issues there. Podman added a feature called short name aliases that we have now integrated into Cryo as well. This feature allows you to create aliases for your preferred images and prevent registry image squatters from running malicious images on your cluster. You can configure aliases for your critical images to better protect your cluster. You can read more about this feature from the article linked in the slides. Another small feature that I would like to highlight is the P-PROF over UNIX socket support. By default, P-PROF works over a HTTP endpoint and that needs to be protected. Enabling it over the UNIX socket keeps the endpoint access locked down by default and provides valuable profiling information to admins and developers more easily. We have been using the pattern of enabling experimental features using annotations in Cryo to test and refine them before proposing them for inclusion as first-class features in Kubernetes. Ubershi will cover username spaces and Sasha will cover Seccom profile generation in the talk today. Besides those, we have support for specifying shim size and enabling devices in containers using annotations. You can check out our docs for more details. If you aren't aware, Docker shim is slated to be removed from the Kubelet in Kubernetes 124. You can read the linked cap for more information. The guidance from SIG node is to move to a CRIB's runtime and we want Cryo to be your preferred runtime. Now, how can you actually switch to Cryo? Simple, actually. There are two steps. The first step is to install and start Cryo service on your node. We package Cryo for several distributions. The second step is to configure the Kubelet to point to the Cryo socket as a container runtime endpoint and you are all set. Folks are used to the Docker CLI for debugging and building images today. We have alternatives for that. Cry CTL is a debugging tool that can be used to interact with a CRI runtime directly. You can list pods, list containers and spec them, exact into them using Cry CTL. You can check out the linked documentation for more details. We recommend using Podman for your local container development workflows including building your container images, tagging and pushing them to a registry. Next up, I'm going to go through a case study that we did and describe some improvements we've made with Cryo and handling heavy load. This will hopefully highlight a way that Cryo is uniquely capable of tuning its behavior to the needs of the Kubelet to improve these kinds of heavy load situations. And then following slides, I will describe something called a resource, which is just a pod or a container, something that the Kubelet asks Cryo to create, whose creation time partially depends on the load of the node. So we're going to talk through the fundamental problem. Much like the Docker shim and other CRI implementations, Kubelet and Cryo have a client and server relationship. Because of that, the Kubelet must set timeouts on resource creation requests. This is to guarantee the eventual part of eventual consistency. But what happens when one of these timeouts is actually hit when a pod or container is being created? So I'm going to walk through this scenario. This is the behavior that Cryo did before Cryo 119. So the Kubelet requests a resource from Cryo. Early on in that process, Cryo reserves the name. This is a primary key that is used to make sure multiple clients try to create the same thing. It includes the attempt number, the name space, the pod name, or the container name depending on what type of resource it is. Due to system load, at some point, the Cryo request takes longer than the Kubelet is expecting, and the Kubelet times out. This takes four minutes for a pod and container creation, and can happen because of time spent with the provisioning networking resources by the CNI plug-in, or from IO throttling from the cloud provider, or from Cryo recursively relabeling SEO Linux directory, volumes for SEO Linux. Whatever the reason is, the Kubelet that the original request fails from the perspective of the Kubelet, because the timeout happened and it doesn't know that the request didn't just go into the void. Because of that, it sends a new create resource request. But Cryo is already in the middle of creating the original resource, and because of that, Cryo returns a name as reserved error, because that unique name is already actively being created in another Go routine. Eventually, Cryo detects, Cryo finishes creating the resource and detects that there was a timeout and cleans up all of its work. This is to prevent the resource from leaking until the Kubelet garbage collector has to come in and clean it up. But this is not very elegant for a couple of reasons. Number one, during this time, the Kubelet where the Kubelet and Cryo are doing this dance where the Kubelet requests the resource and Cryo returns saying name is reserved, that happens every 10 seconds, and that every time that happens, an error propagates through the cube Kubernetes event system saying name is reserved, the pod sandbox creation failed or container creation failed. For that reason, it's kind of an ugly user experience. On top of that, Cryo completely undoes all of its work after the timeout happens. Cryo cleans up the resource so that nothing leaks, but then we end up redoing all of that work again. This not only is wasteful, but it also further exacerbates a situation that is already heavy load. The node is already under load, and that's how we got into this situation, and it's only making the matters worse. So we need to do something better. And luckily, Cryo can rely on the fact that it's not a generic container manager, but it's behavior is specifically tailored to the needs of the Kubelet. So one thing that we can do to mitigate one of the problems is finish creating the resource, and then until the Kubelet asks again. So the process is largely the same, the name is reserved, timeout happens, create resource comes in again, name is reserved error, so those errors are still propagating through the system. But when Cryo detects the timeout, instead of undoing all of the work that it had previously done, it stashes it, it saves that resource. It's able to do this because the Kubelet is a special client in that it has very specific behavior that it does. We, the Cryo developers, know that the Kubelet is going to re-request that same resource until it gets created, just going to keep going, going, going, going. Not only that, but it's going to keep the same unique name, the name that we had reserved, unless the API server changes the request. But then it's a new attempt number, which is a new name. So we know that the request is going to stay the same for that unique name. And for that reason, we can save the request knowing that if a new request comes in with the same name, it's going to be the same request. And we know that that request is coming because we know what the Kubelet is doing. So this is great. This all is one of our problems. We're no longer redoing work that we had previously done. When the new request comes in, we realize, hey, I've already saved this resource. I'm going to return it to the Kubelet, problem solved. So we got one of those problems. We're no longer doing redoing work. But there's still the problem of the API events where during this process, if cryo is taking five minutes to create this, this resource, the Kubelet is going to be attempting to get, you know, about every 10 seconds, he was going to re ask for this thing, which, you know, four minutes minus five, that's one left. And then, so that's about six times we're going to get this event through the system. And that's not a very pretty user experience. The user is going to see that their pot is failing because the name is reserved and it's going to be failing a lot, name is reserved, name is reserved. And that is so we can do better. So another thing that cryo is able to do and has done is we're going to throttle new requests and I've kind of changed the diagram a little bit because it's not entirely clear if there aren't different routines. So cryo R1 stands for routine one, the routine, the go routine that is processing the original request. So same thing happens, request comes in, cryo reserves the name, Kubelet times out. That's all normal. Kubelet sends a new request to a new cryo routine, which was happening before. So we're talking about of immediately saying I have already working in this cube, but name is reserved error cryo is going to wait, it's going to stall until either the request times out from the cuba's perspective, or for a, you know, a fixed time like six minutes just to make sure that the cuba didn't disappear into the void and we're not waiting forever. So cryo returns error until the resource is created. Basically what we're doing is we're stalling the Kubelet, who would otherwise be sending these requests every 10 seconds but now is forced to wait for its own time out length, at least, or at most its own time out length. Eventually cryo, the original routine detects that the time out happened, and it saves the resource. This is consistent with what we did before. So cryo returns an error. This, the second routine also needs to return an error and this is to mitigate against a situation where the Kubelet decides that the request has timed out at the same time cryo decides that the resource is available base because cryo's cryo attempts to send and that resource gets lost. So to protect against races. And to cryo, the second one returns an error. Luckily cryo knows that the cuba is going to re request that resource pretty much immediately 10 seconds later. And on that next re request, cryo can look into his resource store realize Oh, I've already got this and return it immediately. It's a much better user experience and also mitigates against the load. We're now, you know, instead of returning one of these errors every 10 seconds, it's once every four minutes. And we're not redoing all of the work that we did in the original routine, we're saving that resource just as we, you know, should. And the moment that we're able to we send that resource back. And we're all a this is all possible because cryo is able to tune its behavior to the Kubelet. So next up I'm going to go through a quick demo. So here we have a communities, local Kubernetes cluster setup. We're going to have two different versions one version is one where we throw away the progress and have this name is reserved error pop up. And one is going to be one where the new version where we, we save the results and don't throw away it and also stall the, the Kubelet. So here the node is just bootstrapping. So I have to admit I was not able to accurately create this situation. It's hard to put a node under load to the point where this situation pops up, but the node is not totally overwhelmed. It's hard to artificially do that. So I have spoofed it. I hope that you'll trust me that this situation is largely mitigated in production it's a situation that came up quite a bit and we're seeing a lot of improvements with it. But yes, it's difficult to artificially do. So the way that I artificially did it was I have a CNI plugin that I created, which is called bridge timeout and all that it does is if the CNI command is add which is called on a run pod sandbox request. It sleeps for four minutes. And then after that, it calls the bridge plugin. It's very simple. Basically does what the bridge plugin does, but a lot worse. The, the motivation here is we have seen in production situations where our CNI plugins are overwhelmed by the rate of sandbox creation, and they stall the node from being able to actually, you know, create the resources. And then we ran into a lot of names reserved errors from CNI plugins being the bottleneck. So this is a realistic situation just kind of spoofed. So we have the node running with the old cryo, we're going to create this resource. The resource is just going to be a pod that calls top. It's my favorite one to demo. And then we are going to. And we're going to time travel a bit because I don't want to wait four minutes. And so soon you're seeing, you know, 258, 340 so we're about to finish creating the pod. And we will see that the, the request is going to fail. And that's because the cuba is going to hit that four minute, that four minute time out, we're going to get a context deadline exceeded error. So this is our first failure and this is expected because the CNI plugin took that long. And then what's going to happen is the cuba is going to attempt to recreate that. So here we see this name is reserved error, and this propagates all through the system. And this is basically the cuba, you know, re requesting and probably like I'm in the middle of it calm down relax doing nothing to actually slow down the cuba. So this air is going to pop up a couple of times. As we can see here, we have a, you know, they're separated by like 15 seconds. And so this is a bad user user experience. Eventually, this pod will attempt to be recreated and that could pass or as we know it will fail because four minutes will have been exceeded again. So in this situation, the cuba and cry would never reconcile. But we can do better. So next up we have an example with our new cryo binary. And we're going to do the same thing. We're going to create this top pod boat with the new crowd binary and it's going to be running in the same way with the CNI plugin taking four minutes, much too long for the cuba's request. Let's fast forward to when the timeout is about to happen. So timeouts about to happen with 332 348 355. So here we have the expected context deadline exceeded, which we can't do anything about I mean it's the CNI plugin that's taking a long time. So cryo can't make it go any faster. It's the same way about, you know, the latency on the cloud provider's perspective, but what cryo can do is successfully save that resource and be able to like allow the cuba to move forward. What we're about to see is so it's kind of confusing here, but we're going to we're going to show the describe pods. So here we have the failed sandbox request 19 seconds ago. And then we see the polling message. What this means is in between that time, the create son pit pod sandbox succeeded because the cuba. Once the pod sandbox request succeeds, the cuba then starts pulling the images for the containers themselves. So we see that we're pulling the alpine container it's created and it started so we have a running pod. It only took, you know, 14 seconds a time, time for cuba to re request from cryo, and we can see this in the cryo logs. That, you know, we had this deadline exceeded issue and we had some names reserved errors, but eventually, once the resources actually created cryo realizes that it's in its resource cache, and it returns it. And this means that we successfully saved the work that we did before return the error and we only returned one error through the whole system. So a user will have seen that the context deadline exceeded but then soon thereafter the the request will have passed and they'll be much happier. So that's it for the demo. All in all, we have a situation where cryo is able to better tune its behavior to the needs of the cuba and this is uniquely helpful in this situation where we can mitigate the problems of heavy load due that come and a lot of the pitfalls that resource creation falls into. Alright, so now we're going to talk about user namespaces. A lot of work has been done on enhancing the username space support with Kubernetes and cryo over the past few months. The username space support is currently an experimental phase as we continue to polish it up, but it is something that you can try out and use today. There's also a lot of discussions going on in the upstream Kubernetes enhancement plan for username spaces to continue to drive this to be a fully supported feature and Kubernetes. You can enable username spaces for a certain runtime by adding the IO that Kubernetes cryo user and S mode annotation to the list of allowed annotations for that one time. And the picture on the left here is an example of such an entry in the cryo.com. You can also use runtime classes to create a custom runtime with this annotation enabled so that if you want to use username spaces on certain ports such as your billboards, you can specify that runtime class for that pod yaml while your other workloads run without username spaces. And the picture on the right is an example of a pod yaml that has the username space annotation. There's certain options you can pass with this annotation. So one of them is the size which sets the size, the range of UIDs, the username space you want to have. The map to root option means that it would map it to the UIDs you and your container. If you want to run your container without root, so with any other user such as 1000, you can use the keep ID option here and it will keep that UID. And then some other options are you can specify the UID map and the GID map range and this as an option over here as well. And that will be custom to your container. So, one more thing is that the kernel 5 has added many more features which will make username spaces better, and we're working on incorporating that into cryo as well. So I've also put together a quick demo showing how all of this works together. So I have a local Kubernetes cluster up. As you can see my node is up and ready and I have no pods running right now. This is the pod yaml that I will be using to create my pod. As you can see here I set the size to 65536 and I want to keep the ID as I'm going to be running as user 1000. To make username spaces work I have also added a range for the containers user and my sub UID and sub GID so it will map from 200,000 to whatever this large number is. What this means is that on the host it will be 200,000 while in a container it will map it to the UID 0 and so forth. So 200,001 will be UID 1, 200,002 will be UID 2 in the container. The next thing I'm going to show is the annotation that I have added for my RunC runtime in my cry.conf as RunC is the runtime that I am using currently. So as you can see over here I have set the user NS mode annotation for RunC. Alright so I have all my setup done and now I'm going to create the pod that I showed earlier and we're just going to wait for it to come up. It should take a few seconds. Okay it's creating now and awesome it's running now. So now let's exact into the pod and let's check what ID the pod is running with. It should be 1000 as I specified in my pod channel and that is 1000. And now let's check what UID is running as on the host so I'm just going to look for that process. And as we can see here it's running as UID 201,000 because as the mapping was 100,000 to zero and so forth. So yeah my pod is running in username spaces with more isolation. And that's it for the username space part. Alright, I would like to talk about how we can recall second profile is using cryo and Kubernetes. The first thing I have to notice is that second profiles for Kubernetes have to be available as JSON files on this. So we need something like this, we have to specify a default action, which is in our case throwing an error. We have to specify architectures, which is in our case our local architecture for 64 bit, and we need a list of those calls which are allowed. And in our little example here we just allow the clone this call. This profile has to be on every disk, because if the schedule schedules the workload on a note, then the cryo assumes that the profile is available there and then it passes it down to the low level OCI runtime. And if it's not available then just getting the creation of the workload will fail at all. This has a major drawback, because different OCI runtimes have different minimal this calls in their footprint to actually be able to run a container. So run C, for example, has this list of this calls and if we compare it to this list of this calls for Z run, which is a complementary runtime written in C, then we can see that they actually differ. So this is also not ensured by every version of there so it could change for a version release for run C and it can also change for the new release for C run. And this is a main drawback, because if you know what to create second profiles, we have to require detailed knowledge about underlying or CRM used by cryo, how the cluster is configured, and which version we actually use. So we also need necessarily the applications this calls which may be executed in all code paths, and if we miss a code path, then it may be possible that the workload gets terminated during runtime which is actually not what we want. So what is the solution around this. I think a good solution around this would be to record second profiles and a test environment and then apply it to the production workload. And how this works is part of my demo for the day. So this demo should show you how we can record those second profiles using cryo. And to be able to run this demo we have to use the latest version for cryo which is 121, and it may not be released yet. But we can verify which cryo version we are running, but just running Christy tail version. And here we can see that we use around time version 121 development version, and that we actually use the container and cryo which is important as well. So it is also necessary to use a project like the OCI second BPF hook, and it has to be installed and configured on the system. So this is a pre start OCI hook, which actually runs exactly before the container workload would start. And it creates a different child process which compiles BPF module, and then attaches a trace point to our running workload. And then if the application the hook is possible, is able to record all our CIS calls into an intermediate structure. And if the application terminates, or if the POTS gets deleted, then the hook will write the resultings that come profile to disk. But this has a few drawbacks, for example, the hook needs Capsis admin, so it's highly privileged so it's only recommended to run it in the test environment. And it also compiles C code on the fly, which means that we need to be Linux headers available on the local system to actually run the hook. But I can just recommend you to give this hook a look because you can also run it with potman and play around with it. So now we have to look at our cryo status configuration. What we can see is that we have the hooks path configured correctly to look at the OCI BPF second hook. So the hooks directory is configured to point to that direction that the cryo is able to pick up that hook. In our case, we are using zero as default runtime, which means then we that we create that we allow the annotation IO container stress is called which is the annotation for the hook to be executed by cryo. So this is kind of a security mechanism is will be enabled by default for when 121 but for all configurations we have to ensure that this annotation is actually allowed. The annotation key IO container stress is called provides all information for cryo to record a second profile. We can now create a first workload via cube CTL run. What we do is that we create a simple pot which never restarts. And then we run mkd a test, but we also pass the annotation IO container stress is called and specify our output file to attempt my pot that Jason. If the container is running and is terminating then it will be immediately get deleted by cube CTL afterwards. So now the pot is get got deleted again and now cryo uses this OF annotation as actually output file we can have a look at the actual second profile. Our profile looks like this, which is kind of interesting because it nearly looks like the same in the same way as we specified the default. This is required this minimal this requires this calls for C run. But we also include our mkd just call in our case. And this would be actually already be usable second profile. But, and usually we run multiple containers within a pot, and we cryo is also able to separate those containers and record into different files. So for that we made a custom modification to cryo that cryo has the intelligence to split up the annotation based on the actual container workload. For example, if we now run a pot like this, we can specify multiple annotations using the traces called annotation, referencing the container name. So we now use the engine x and read this containers, which are down here and reference them by their name to specify different output files, otherwise they will overwrite each other if we specify the same output file. So this would make any sense at all. But if we now run that pot, and wait for to be ready that it's actually running. Then we can have a look at the hook is currently running and attaches to the container bit. So if we now look at the processes running on my local system, then crap for the OCI is second ppf hook, then we can see that we have two instances of the hook running and recording into different files. So yeah, for sure we have to delete that hook, again, to be able to see what is actually written into those files. Now the pod got deleted. And cryo uses the annotation, you have again to look to separate our profiles. So if we now give our professor first quick look and we can see okay they look different. But to be actually see that they are actually different at all, we can lift them. And yeah, now we can see those profiles have differences called in it, and they are not the same. So this test works at all. All those profiles in Kubernetes have to be part of the second root. So which is usually the Kubelet roots like second, and which can be different, configurated into a different directory. But in our case, I just have to copy our second profiles into wallet Kubelet second. So if we did this, then we are able to use both profiles in our workload. So if we now use created deployment workload for example which uses two replicas, then we can just use the new security context that come profile which got introduced in Kubernetes one dot 90 to reference our local host profile. And now we can use the engine X profile for the engine X container and the writers profile for the writers container. So if we now create this deployment. And wait for it to be available in my local cluster setup. Then we can look at our parts, which got created. And if we just choose one of those parts, then we can see that we also have the container annotation for the second profile now available for both. So we have the engine X annotation for the engine X profile and the writers annotation for the writers profile. And if we want to check how this actually looks like and on a low level OCI runtime, then we can use cry CTL and spec. And in our case we just look for our engine X profile engine X workload. And one of those containers we just choose one because we have to running. Then we can see that we have a second profile applied, which actually matches the same in the same way as we have recorded it. So for recording second profiles of cryo. One more thing I'd like to mention is that cryo targets to make Kubernetes more secure by default. So our overall idea was to make the runtime default profile for second default profile for all workloads. Right now, all workloads run unconfined by default, and we created a cap for that. So if you're interested in maybe providing some feedback to the Kubernetes enhancement proposal, then feel free to ping us on Slack or directly to pull request. Okay, and that's it for our talk on cryo still lots Kubernetes. We'll take some questions now. Thank you all.