 If you can believe it, this is my first time giving a talk since November 2019 at KubeCon San Diego. And it's the first time for our little group of maintainers to see each other in about the same amount of time. So we're excited to see everyone again and give you an update on the project. If you were at KubeCon San Diego or you watched what we called the mini summit, we've given other updates since then. Obviously, there's been some virtual KubeCon's and other events since then. But I thought, you know, it's our first time back together in person. Let's just recap the last two years for those who may have seen the last 18 months as a fog and you don't really remember what happened at all that time. From a release standpoint, we've had really three major release lines. The third one is just now starting. If you saw our tweet today or the release that came out yesterday, we're just starting the 1.60 beta. But again, in those two years, we've come out with 1.4, 1.5. And if you count up all the service releases and point releases, we've done 29 total releases since KubeCon San Diego across three release branches, including at the time 1.3 was just in service, which is now end of life. Also in that time, we've fully automated our release process. We don't rely on Derek to run some commands on his laptop anymore. We moved our entire CI to GitHub Actions and migrated all our sub repos and have all that working for almost two years now. The team at Microsoft, some of them are reviewers and committers in the ContainerD project. Each one of those major releases, there have been improvements in the Windows support, in the CRI support for Windows. And now there's actually official packaging with each of our releases for Windows. Also in that time, one of our maintainers has added testing for ARM64. We haven't integrated it into our release process. Those of you who use GitHub Actions know there's no built-in ARM64 runners, and we haven't been able to do self-hosted. But again, that's another architecture that's getting tested automatically, and hopefully will be part of our release process soon. Also in that time, we firmed up kind of our disclosure and security process, and we've gotten to test it with three CVEs in the last couple of years, and so that seems fairly well streamlined as well. And we even have a new role, a security advisor role that came along with rethinking our process. And there's some other governance updates. You can find that in our ContainerD slash project repo. We've added new reviewers from multiple employers, and there have been over 370 unique contributors just in the last two years to the project. 10% of those are not just one time and gone actually have submitted 10 or more PRs. So we've got a great community that's been continuing to grow. A little more detail on that. If you ever use DevStats that the CNCF have put together, there's lots of other community statistics you can dig into. You can see that 67 unique companies are participating. We now have 13 committers and 14 reviewers from, again, a good cross-section of companies, and we'd like to see that as well as the community comes together to build ContainerD. One of the other things, if you were at that mini-submit or watched the video or have followed since then, we've added this concept of non-core sub-projects of ContainerD. So at the mini-submit, I believe, Kohai was there and presented Star GZ Snapshotter. That's now part of the ContainerD repo as a non-core sub-project. Image encryption was being worked on in the time. That's also now a sub-project in the ContainerD repo. And then something that didn't exist a few years ago but has gotten a lot of community activity recently like the NERD CTL, the AcaHero, one of our maintainers put together. It also is now a non-core sub-project of ContainerD. And one of the nice things about that is that it packages up a lot of these features into a simple installable client that lets you play with things like lazy loading and container encryption and even rootless mode. The rest of this presentation is going to be a lot about the what's new, not going to try and be exhaustive. There's tons happening in the CRI that Mike will cover. We've started to add the capabilities for open telemetry and tracing, more metrics. There's been interest from the Confidential Computing Consortium in adding capabilities into ContainerD around that. In fact, the Inclovara project is already using ContainerD. And there's a bunch of work Maxim will talk about around Sandbox API and SHIM and a bunch of rework that's going on even right now which we hope to get into the 1.6 release. So with that quick intro, I'm going to turn it over to Derek to continue. Wait a second. Okay, so I'm going to talk about the deep dive section of ContainerD. I'm going to kind of go through, break it up roughly into three different parts. So everyone can be familiar with what we're talking about here. So in ContainerD, we have the client and Phil mentioned nerd CTL. This is one of the clients CTR. If you're familiar with ContainerD, it's like our kind of our debugging tool that we ship with ContainerD. It calls into all our underlying functionality. And then our Go library, we actually consider part of our client. And if you're using ContainerD as integrating, you're probably using using our Go library today. Also our CRI implementation actually uses that same Go library in the ContainerD daemon, which you're probably all familiar with running on your hosts. There we have our API server, the actual CRI plugin. And then we have all our resource managers that do data storage, garbage collection, everything around shim management, which actually owns the containers. And then the actual ContainerD shim is we have a shim that's per container or pod instance. And this is what actually manages all the running processes. So to break that out, it ends up looking like this. You'll see the client all the way there on the left box. You'll see the daemons actually in the middle and then the shim on the side. So in the client here, so this is roughly how our Go library is kind of laid out. All image distribution is actually done inside of our Go library. It's not inside the daemon per se as a service today. Everything that's related to setting up a container, so creating your OCI spec, setting all your container options are actually defined inside of our Go library. All the configuration for the services as well as... So like we have the ability here. So for the client, you can use these service proxies which talk over our GRPC API. But we actually use the same client in the CRI plugin, which actually doesn't go through the GRPC API. It actually directly uses the service implementations. So inside the daemon, we have our API servers. We have our CRI plugin. And then in the core, we actually have service implementations for all of the services that we define. So everything that related to containers, related to accessing content, accessing the actual images, namespaces, and snapshots all have a definition here. And the API gives you access to those through GRPC. So if you know the container DAPI today, you might be familiar with some of these APIs. They're very low-level. If you're used to using Docker, there's probably a lot of stuff here that you wouldn't necessarily use directly. You'd see stuff more that you're used to, pull, push inside the actual Go library. And then here's where we actually do our data storage. So in our metadata layer, this is where we store everything related to any sort of label we have, every single image, what it stores, every snapshot, every container is actually stored here. And we have a garbage collection process that will go in and it will delete stuff that's no longer referenced by anything. And that's so that we can have our approach of failing clean. So if something becomes unreferenced, we'll clean it up. The back end underneath that, this is where the actual snapshots will live, the actual large content blobs that you pull down from the registry, the actual clients that talk to the shims are at this layer here. And the metadata layer will actually manage everything that's done there. So the shims, this is probably where you've seen a lot of discussion lately around GVisor, Coddic containers, RunC, which everybody knows, Firecracker, all implement this shim interface. And what this allows us to do is to separate the underlying container implementation from our actual resource management of that container. So from a container D perspective, we can treat these shims as a resource. We don't have to worry about everything that's being managed, necessarily underneath it, whether you're using C groups or you have a VM implementation, all of that's actually related to how the shim implements that. And we can implement it a little bit more abstractly. I've added since the last time we did this talk, actually I guess we didn't have this two years ago, this diagram, but at least since last KubeCon, we have the Sandbox Manager. That's a new interface that we're adding. And we'll be discussed by Maxim soon, discuss a little bit more coming up. All right, so let's quickly go through what a poll looks like in this diagram. So when you do a poll, the first thing that's going to be done is we're going to get a lease. And a lease in container D is basically a way to tell the garbage collector that I'm using this content right now. And if something fails, then the garbage collector will eventually clean it up once that lease expires. So this is our way, as we all know, like polling sometimes doesn't always complete or your process gets killed randomly. We always try to make sure we never leave anything around. So after that, the poll process is actually going to resolve your content. So it's going to resolve your tag to some OCI registry. It's going to grab that data from the OCI registry and it's going to go into the content store. And the path that's going to make from the client is through GRPC, through our service layer, through our metadata layer, and then actually to the storage. After that, every poll, you're going to unpack that into disk. So the poll is going to actually create the snapshots. And then for every snapshot it creates, it's going to unpack that data using our Diff service. And all the Diff service does is it takes these layers and it unpacks them into the snapshots. And then after that, we can just create an image. This is the lightest way part. It just links to the content that we've now unpacked on disk. Well, likewise, let me go through a run flow real quick. You'll kind of see, unlike poll, we don't have a top-level run function. So NerdCTL or CTR will actually orchestrate these. It's going to do the same thing first. It's going to grab a lease. And then it's actually going to go resolve your image. Normally, when you run a container, you start with an image. In this case, it's just going to read what that is. It's going to resolve that. It's going to look at the image configuration. So it's going to resolve all that, read the content, so that it can actually resolve your configuration that you want to use for your container. It's going to set up the snapshot. So there's normally the read-write layer that you're going to use inside of your container. It's going to create the OCI specification. So it's going to use that image config that it read back from the content store. It's going to use those to generate this OCI configuration, create the container. Now, in this case, the container is actually a metadata object that we use in container D to keep track of container as a resource. And then after that is where we actually start the tasks. And the tasks here doesn't actually have data that we store in container D. You can see it kind of skips over our storage layer there and goes directly to the back end. That's because tasks are the ephemeral object that are associated with the actual running container. If you were to restart your machine at this point, we would still see that container metadata that you wouldn't see in any of your tasks data with how it's implemented today. And in this case, all your remaining operations are going to be actually through this task interface. So you're going to start the container. And for the container, you're going to exec or kill or whatever you're going to do is actually going to go through this same flow from the client all the way back to the shim. And as I mentioned before, I added sandboxes in here. This is something that we're in the process of adding. We don't have a flow all the way back through the client today because the sandbox API is still being worked on. But this is one of the areas that we're working on today. So this will help in cases where you have a VM sandbox to actually be able to manage that sandbox through the container DAPI in the lifecycle. So this will be useful for some of the more advanced use cases that we'll be discussing later. So now we're going to dig into CRI a little deeper with Mike. Hey, guys. Well, you said deeper, but really I'm going to step back a bit and talk more about how we interact with Kubernetes itself. Okay, as I'm sure a lot of you guys are Kubernetes users and you don't really know what goes on a node except that you configure a container runtime. So I'll go ahead and talk about this a little bit. The concept, the idea is on your node, you have Kubelet registering with API server to host all the pods and resource types that pods need on the node, on the worker node. And it's the job of Kubelet to run its managers and converse with the container runtime, allocate volumes and attach things to the container through these CRI APIs. The CRI APIs were done by some guys in Google probably about five years ago as a way to modify how currently we were integrating with Docker for launching containers and managing. We thought it would be a good idea to have a CRI API so that other people could come in and implement without having to fork Kubelet, which is basically what we were doing originally. There was a rocking at ease and a Docker fork. And currently, let's go ahead and let's see. Let's go to the next. Currently, there's some coding Kubelet that implements the Docker shim. You can actually find the tree in there. And when Kubelet comes up, it launches the Docker shim to host these CRI services, but they also have some internal integration techniques and things that are in there. And it was decided that all that stuff needs to be pushed over into the CRI so that the container run times that are running using only the CRI interface would have a more legitimate API to work with. It just felt easier. So what we're going to do and what has been happened is we've deprecated Docker shim. That's the internal version, not the external version. The external version of Docker shim is still being worked on. It'll still work with Docker. I don't know if you know about it, but underneath the covers, Docker actually uses container D also. That's another piece of information. But we've got some work going on in a way to make all the test buckets work for the container run times, either cryo and or container D, so that you guys will be able to trust that everything that can happen with Docker can also happen with these other container run time types that you're all migrating to. When it gets completely pulled out, I guess maybe 124 or 125, we'll see what happens. Along this route, the Cry API has been in alpha for about three years. I'm sure you know that the API cycle for Kubernetes APIs can be very long, but we are moving to a version one beta status, and Sasha has a pull request in there right now to add the V1 API as an optional API in the 2.3 cycle. So, you know, the one V1, the V1, 2.3 Kubernetes. So what we're going to do is make sure that all of the container run times, then implement the cry interface, support both the V1 alpha API and the V1 API at the same time. So when Kublit comes up, it'll check to see if you've got V1 in the container run time or not. If you have V1, it'll go ahead and start using that as soon as this PR gets merged in 2.3 of Kubernetes and Kublit. And then it'll all be great, right? And then later on, if container D reboots and you have to reconnect Kublit to container D, then it'll be continually using the V1 API. If, however, using an older version of container D or cryo, and it's connected initially to the V1 alpha API, then it'll still be using the V1 alpha API unless you reboot Kublit. And then it'll recheck to see if the V1 is there. Just give you an idea that, yes, there's going to be a migration kind of issue, but we're going to tackle it in the container run times and in Kublit to be OK with either one migrating up, but you'd have to reboot both to get up to the next version of the API. And I imagine we'll do the same thing when we go from the V1 beta to the V1 GA, OK? So that's going to happen. Confidential computing. Very exciting. A lot of interesting stuff going on in multi-tenancy. We've run into some issues. One of the oldest issues in Kubernetes is a little problem with if a pod A pulls an image down to the node and pod B wants to use that image and you're only saying pull if not present, then pod B will get to use the image from the first pull whether or not it had access to it. So if you're storing an image as a private image that can only get access with a password and authentication, then it may already be on the node. So what we want to do with this and insert secret images is actually check and keep a running list of who actually pulled it and whether authors require or not. If you follow these links, you'll find the cap. It's been approved for one, two, three and we've got the code for it as well. Just a heads up, this problem should be fixed. However, the workaround right now for multi-tenancy is to force a controller to always say pull always. Now, if you use pull always policy, image policy, then what the container run times will do is only pull the manifest. So if it's a one gigabyte image, it's only going to pull the manifest part and make sure that you have authentication. And if we've got it already locally cached, we'll just use those blobs. Let's see. The other thing that we've been running into a lot with computer computing is that we need this ability to decide where we want to store the container image cache information, the metadata information, the snapshot images, and whatever data that you're storing in your container. And what the Cata containers guys want to do and a few other groups is to put that information inside the virtual machine. It sounds like it'll be expensive, but it'll be very secure. And so instead of right now, we pull the images and cache them on your host, what will happen is if you say you want a runtime handler to be secure Cata containers with internal, then that runtime handler switch would go to a configuration that's pointing to pull the image into the virtual machine. So that's going on. We can answer questions on this later on. I'm hoping you guys are interested. Another set of changes that we would want to do, if you're looking at Knative or SDO or some other situations where maybe edge type situations where you want the pods to be really, really fast. Well, there's a couple of problems with the way Kubernetes is currently designed. It wasn't designed to run pods every couple of milliseconds. It was designed to run pods based on how they've been deployed and distributed around the node. And then after a couple of minutes, we'll check to see if they're finished or not and if they're still running. And if they're not, we'll rerun the pods and keep matching the contract that you have. But Kubernetes isn't really built for that contract is just a couple of milliseconds for a quick service or even half a second or two seconds. So what we'd like to do is improve how we handle probes to move them from a per second kind of level to milliseconds. And we'd also like to change the Kube process, Kublik process, actually, for how it determines the current state of pods. Today, it determines the current state on a configurable but default, I believe it is two-minute cycle time. So you won't know at the scheduler side until after a couple of minutes whether or not you need to run another pod. And that's not really fast enough for a K-native kind of scenario. So we'd like to move to a subscription process. It sounds obvious. We'll go to inventing. We already have inventing support for tasks inside the runtime container. So we'd like to move that outside. And that's it. I'd like to briefly highlight what we've been working on lately and what to expect in upcoming container releases. We plan to focus more on our Shim V2 runtime, the one responsible for container tasks and for managing engine processes. Up to container D1.5 Shim runtime interface didn't change much. So we implement just one runtime service called task service to manage the whole container lifecycle. And it was sufficient to run containers and even to launch pods and boxes. However, with micro VMs, there are a lot of new use cases that we want to cover. There are some new use cases that drive runtime changes that we are working on. Simbox API is ongoing for quite a while. The goal is to make micro VM or any other Simbox first class citizen in container D. There is an effort with Confidation Container containers to apply more security and container images so the host can't access sensitive data that the image may contain. There is also proposal to add port forwarding API to our Shims. And there are many more potential use cases that we are not yet aware of. For the new runtime, we added plug-in support to our Shims in 1.6 beta. They use the same plug-in system as container Dima, so it's easy to add new services to the runtime, define dependencies between these plug-ins, and so on. And in order to support new services on Dima's side, we want to introduce new Shims service that will manage Shims processes independently from API that these Shims implement. And we give more controls to client, how to schedule and how to manage workloads through container DAPIs. A quick recap of how our runtime system works today. So as I mentioned, Shims implement interface full-test service, and container Dima offers the same backend API to manage tasks from client, which hides the communication, and the LAN communication with runtime. So whenever we launch container from client, we call container Dista service API, and behind the scene, it starts a new ShIM process for us, and then it forwards create request to that ShIM we just created. Same applies to subsequent calls. Container Dima keeps the list of running Shims, looks for the one we need, and forwards requests to proper instance. With a new approach, we allow to call Shims API directly. So container Dima still manages ShIM processes lifecycle, so we can ask container D to start new ShIM for us, or request existing one by ID. But container Dima is not aware what kind of services particular ShIM implement. Instead, we let container D client to talk to ShIM, and depending on services runtime offer, client can decide proper flow of calls. So for instance, if ShIM implement starts service, we launch container from client, like the same way we do it today. If in addition to that, ShIM supports inbox API, client can start sandbox instance before launching containers. If in addition to that, ShIM supports confidential containers, APIs client can prepare image for that sandbox. And this is flexible enough to enable all kinds of service combinations and support. It's easy to support new cases. The concept behind ShIM service is really simple. It can create and it can delete Shims. To be precise, we already do that today, we just split existing task service to manage Shims and tasks independently as two different plugins. And we'll expose its API to clients so clients can control Shims. Container D still expects to track ShIM lifecycle, so whenever ShIM process dies, container D will clean it up. This service is going to be a foundation for high-level services, which depend on specific ShIM implementations. Task service is the only service we require to be implemented today, so container D can launch new containers by first calling ShIM service to start new ShIM instance, and then by calling task service API from that ShIM. And this way we keep backward compatibility with previous versions of container D. Same way, Sandbox API will invoke different set of services offered by Shims. Sandbox API will be one more consumer of ShIM service. It aims to bring a notion of group of containers to container D. The goal is to provide a well-defined ShIM interface to Task API that can be used to add new Sandbox implementations and runtime level. Behind Sandbox concept can be really anything depending on how runtime implementations are. For instance, it can be post-container, micro-VM, or even virtual machine. That's all I have. Thank you so much for listening, and we'll be happy to answer your questions.