 Welcome to the ContainerD introduction and deep dive session. I'm Derek McGowan, maintainer on the ContainerD project at Apple. I'm joined today by fellow maintainers Phil Estes from IBM, Michael Crosby from Apple, and Wei Fu from Alibaba. We're going to give you a brief introduction to ContainerD today and deep dive into various parts of ContainerD. First, we're going to discuss what ContainerD is and why to use it. Then we will give an overview of its architecture. We will deep dive into the new node resource interface, also called NRI, and ContainerD's other plugin systems. Then we will discuss ContainerD's most recent release and what's coming up in future releases. So let's start off by handing it over to Phil. Hey everyone, I'm Phil Estes, one of the maintainers of ContainerD, and I'm just going to take you through a brief introduction to ContainerD, mostly focusing on things that are new or that we didn't get a chance to touch on since it's only been a few months since the virtual KubeCon EU. So let's take a quick look at ContainerD and a few new items, and let's get started. So we always try and start out giving a little bit of background on what ContainerD is. Obviously, we're a CNCF project. We're one of the graduated projects as a container runtime. And really, maybe the best way to see that is to see us below higher-level platforms like Docker or Kubernetes, but above lower-level runtimes like OCI's RunC, or the various sandboxes like Cata Containers or Firecracker or Google's GVisor. And so, really, ContainerD is that resource manager that can sit between a higher-level platform, like the two we mentioned, and manage things like the processes, the images, the snapshots on the file system, all the metadata about containers and their dependencies. And I think the important thing there is that we don't just manage those resources, but the way ContainerD is designed. It's a very extensible system, and so a lot of the projects that we've seen use or grow out of ContainerD have used this extensibility to great advantage. It's also a tightly-scoped project, so we don't change the scope of ContainerD often. The only thing we've really done in the lifetime of the last four years is bring in the CRI plug-in so that Kubernetes had a natural plug point to use ContainerD as a runtime. There are a lot of new governed sub-projects within the ContainerD repository and organization. We now have a Rust-written TTRPC implementation, container encryption, implementation support, StarGZ snapshotter, which handles things like lazy image pull, and you're going to be hearing about NRI as well during this presentation. So that gives you a quick snapshot into what ContainerD is. Now we've walked through ContainerD's history in more detail just a few months ago at KubeCon Virtual EU, and you can find that talk on YouTube already. But just for a brief history, ContainerD came up alongside Docker. It's not a fork or it doesn't inherit any part of the Docker code base, but effectively it grew from a container supervisor used by Docker into a full runtime in the 2016-2017 era. During that time, we created completely new interfaces for managing containers and images, and then after defining those in GRPC, we also created a Go client API that's used heavily by the CRI, for example, as it implements the container runtime interface for Kubernetes, and many other projects use that clean API to manage containers and images and container lifecycle. Now you might ask, okay, so I see how ContainerD came to be and how you designed it. What are the use models or patterns that we see as ContainerD has matured? One very common use case, mentioning the CRI, is that if you're using Kubernetes, ContainerD is the most stable, best supported, and mature runtime in use today underneath Kubernetes. Every major cloud provider has a managed Kubernetes offering. They either use Docker or ContainerD today, and many of them have switched fully to use ContainerD. If you're a developer, you probably aren't going to use ContainerD directly, but if you're using Docker, by nature you're already using ContainerD and BuildKit, which also can be used separately or within the build system for Docker, heavily uses the ContainerD API and can use ContainerD as the runtime for build. On the edge, which is getting a lot of focus and interest these days, ContainerD is seen as a memory efficient and stable runtime used in popular projects like Rancher's K3S. Functions as a service has also been a growing area and even has some blurred lines with pure containers as a service offerings like Google's Cloud Run or Fargate. And again, ContainerD is seen as a fast and efficient runtime used in projects like FASD or IBM Cloud Functions, or even IBM's new cloud code engine service built fully around ContainerD. Now, we don't have a lot of new information on ContainerD adoption across the ecosystem, since we just shared many of these numbers and percentages just a few months ago. But there is a CNCF survey that's now out capturing some of the 2020 changes to that information. And we're hoping to see very soon, ASSISTIG's latest usage report, which should be out by the time you hear this or maybe in the next month. Now, the CNCF survey update did show that ContainerD usage grew by over a third since the last time the community was surveyed. Production use of ContainerD is also growing. 36% of respondents noted that they were directly using ContainerD in production. And so again, without a lot of new data, we definitely see continued growth in the use of ContainerD, and we also see in our community project a lot of new activity from people utilizing ContainerD, maybe in scenarios that aren't going to be shared publicly, but obviously show a growing use of ContainerD across the ecosystem. For example, we've already mentioned major cloud providers. You can see many of their logos on this slide offering Kubernetes as a service where they use ContainerD. And again, we've already mentioned Docker's use of ContainerD. Many of the isolators and sandboxes have written shims directly for ContainerD so that you can use things like Cata or Firecracker or GVisor or others right within ContainerD using the same API to access these sandbox isolation technologies. Containers as a service is a very interesting area. Obviously, we already knew about Google Cloud Run and AWS Fargate. Azure has container instances and IBM just announced the Code Engine project, which also effectively is like a serverless containers as a service offering. And so almost all these projects, at least where we have public information, are directly using ContainerD to build these as a service technologies for their customers. And then many developer tools like we've already mentioned BuildKit, but Ignite and Kind are also using ContainerD. And then new on the scene in the last few months, maybe you've heard of LinuxKit, but Amazon recently released Bottle Rocket OS as an open source project that again uses ContainerD for a containerized Linux OS idea that they've now shared with the broader community. Now, we've given you a lot of information on how ContainerD is used, who might be using it, various products and development tools and offerings that we know use ContainerD, but maybe you're interested in getting involved directly to our open source project. And of course, we're always looking for those who'd love to come contribute, who have issues that they found using ContainerD, that they join in our discussions. We're on the CNCF Slack in the ContainerD channel and ContainerD-Dev for those who want to help with the development of ContainerD. And of course, we'd love to see many people get involved. We'd love to improve our documentation. That's an area we've been talking to the CNCF about. And so no matter what skill level or what your interest is, we'd love to have you join us in the ContainerD community and reach out. If you can't figure out where to get involved, definitely reach out to us and we can find a way to get you involved. Thanks, Phil. Now let's take a look at ContainerD's architecture. In this diagram, we have broken down ContainerD into three main components. The client is ContainerD's go client, which contains most user-facing container logic you may be familiar with. In the middle is the ContainerD daemon, broken down by API, core, back-end and plug-in. This is where ContainerD's service interfaces live and handles all data storage and resource management. The shim is what owns the container processes and communicates low-level operations to and from ContainerD. The shim is responsible for interacting with the lower-level Container Runtime. As a user of ContainerD, the client is the first place you will interact with. If you are building a system on top of ContainerD, the client is what you will be working with. You can see in this diagram that some core functionality that is often associated with Container Runtimes is implemented here in the client, such as container management, which handles creation of the OCI Runtime spec, preparing the container snapshot, as well as starting individual container tasks. Image polling is also completely implemented inside the client. Because of this approach, clients are mostly limited to using our go library. Even though Python, Java, or another language could talk to the ContainerD API using gRPC, basic functionality would need to be re-implemented to use the full runtime. The CRI interface is probably much easier to work with if you are not implementing your project in go. One of the first things you will come across in the client is the options pattern. With ContainerD, we chose to keep the core as simple as possible. By defining clean, simple interfaces in the core, that meant that the client had to have more functionality and configurability. In the ContainerD core, there may just be generic variadic options on a core type, but in the client, you will see with options used on these types implemented feature. You will commonly see this used for labeling and filtering on types. We were also hoping this would allow clients other than be forced to work around opinions in the ContainerD daemon. The core of ContainerD contains the components we consider most critical in terms of stability. Here we define the main data types and interfaces used by ContainerD. If you think ContainerD overuses interfaces, this is partially intentional. We want data and operations flowing through the core and everything else wrapping those types or using data that is tracked by the core. The core has a main implementation of these interfaces, and these interfaces are extended all the way to the client. The design is such that all data flows through the core so that plugins and other components don't need to store data. This is important because we have implemented garbage collection inside the core metadata store. So if data is being stored somewhere else, it is really hard to track that data and make sure it is removed when it is no longer needed. This makes ContainerD more stable by keeping storage and memory usage stable. It also avoids data inconsistency and crashes. We don't expect as many change in the core relative to other parts of ContainerD. Features generally aren't implemented inside the core. However, some features will require small changes inside the core. So for example, we recently added a feature to support remote snapshotters. The high-level feature of remote snapshotters is not something that the core knows anything about, but one functionality we did add to the core was the ability for the snapshotters themselves to report that they already know about a snapshot as it is flowing through the core. So when you create a new snapshot, you can pass information that says this is the target snapshot that I'm trying to create and it will pass that to the backend. The backend now can actually communicate back up to the core to let it know it already has that snapshot. That allows the client to know the snapshot is already available as a remote snapshot when it sees the core report the snapshot as existing. Meanwhile, the core has no knowledge of what a remote snapshot is. It is really important for us to keep the core of ContainerD un-opinionated and to make sure we don't get feature creep showing up in the ContainerD daemon. We also don't want opinions showing up in the daemon where new features or requirements are limited by previous opinions added to the core. One thing you may notice is that there are no tasks in the metadata store. Even though tasks are considered a core type with a core service, the data associated with the task is completely ephemeral and only related to the actual running container. This data may be something like process information or Unix sockets. When a node is restarted, all the process information will be gone as expected. However, all the metadata associated with the container including its specification and snapshots are consistent in the metadata store. This information can be used by clients to recreate the container and tasks. The disk service is another core service without any state. It operates on snapshots and content but does not store its own information. Notice that the CRI plugin sits inside of the ContainerD daemon. In current versions of ContainerD, the CRI plugin exists as a single imported plugin. In ContainerD 1.5, we are merging the CRI code base and integrating CRI's components Today, CRI operates by using the ContainerD client package directly and calling out to separate plugin interfaces such as CNI and NRI. Let's take a look inside the CRI plugin. CRI consists of two gRPC services, Image and Runtime. Both of these services share a service backend in the plugin today and call into the ContainerD client. The CRI plugin is responsible for defining the pod by sharing the namespaces and cgroups between all containers in a pod. It is also responsible for managing the pods container that is responsible for holding namespaces open. In the future, the pods container will be able to go away since there are other ways this could be accomplished today. The CRI plugin invokes CNI for setting up container networking. This call is invoked after ContainerD creates the first container so that CNI can set up networking using the container's namespace. The CRI plugin also calls into the new node resource interface. Let's hand it over to Michael to discuss the node resource interface. Alright, I'm going to show you some of the work node resource interface for ContainerD and Kubernetes. When you think about resource management, we have different workload requirements. You have customers that are running batch workloads and also customers that have latency since the workloads that may be in the request response cycle for various users interacting with the service. Both of these type of workloads, we need to be able to co-locate these but they have very different performance requirements that their users expect. Also, various user requirements such as priority and SLAs and SLOs for how eviction happens on these nodes is something that we all have to manage. When you think about resource management, we have various resources on a compute host. You have CPUs which you may want to dedicate CPUs or restrict access to CPUs depending on the workload. You also have NUMA on multi socket systems. Then we can expand out into L3 cache, use pages aligning making sure your workload runs on the same NUMA node that your GPUs are connected to. The current solutions, Kubelet does have a CPU manager. There's a few steps improving this, expanding into the NUMA support case and like pinning a CPUs but when I was looking into this, I found the UX was a little weird. It wasn't very explicit. I started looking at prior arts in the container space and one of the interfaces that I like the most is CNI, so the container network interface. I believe the way it uses plugins and the interface of how a CRI container runtime interacts with CNI to set up networking for your containers is very elegant and it makes it very composable as well. So I thought let's make CNI for resources. And this is where NRI comes in because CRI was already taken. I would rather it be named the container resource interface but we'll stick with node resource interface for now. So myself and some others in the community, we felt like the Kubelet's not the right abstraction for things like this. The lines between the Kubelet and then the lower level CRI is a little too blurry and we think that being able to hook into the life cycle of containers at the CRI level is the right way to do resource pinning and topology on a node. CRI implementations like container D, we know how to interface with the underlying host systems. We can get the information for what devices are available and this is where things like CNI for networking also hooks in. So that's why I wanted to start with building out NRI and with this we can do various things such as managing the topology and quality of service for different workload types. We can have latency sensitive workloads pinned on CPUs with very simple plugins as well as batch workloads that execute across multiple cores and then also get their cores taken away if more higher priority workloads enter the node. Inside the container D org we have an NRI repo that we've been working on the API and interface of NRI and you can see how it takes a lot of inspiration from CNI where we have a config you can chain plugins together there's a specific order in which these various binary based plugins can be executed. We have a clearly defined input and output to these plugins so that they're able to get the access that they need to the underlying pod or container information that they're setting up resources for and then they have defined output and what points in the container lifecycle that it hooks into. So go ahead and check it out at github.com slash container D slash NRI and let us know what you think as we build out this interface. Thanks Michael let's go back to the diagram. For the underlying runtime there's a component in the container D daemon and an external shim for managing container processes. The runtime in the daemon is responsible for starting up new runtime shims. It will pass the OCI runtime spec and commands to the shim. The runtime shims are the boxes on the right side. These shims are what actually own the container processes. The containers are parented directly to the shim. The container D daemon will talk to the shims about container D. We use this lightweight RPC to reduce the memory footprint of the shim. If container D gets restarted it will connect to the shim in order to send commands to the container. There are many different runtime shim implementations. The run C implementation is the most common. There's run HCS for running on windows. There are also sandbox shims such as firecracker and cota containers. Since the shims only implement this RPC interface anyone can implement their own shim to be used by container D pretty easily. Container D just needs to know about the shim in order to use it to run containers. Plugins can call into the core of container D and the core can call out to different back ends. The back ends themselves are also pluggable. For example, Snapshotters are all implemented as plugins. Each plugin can define their own configuration which gets concluded with container D's global configuration object. If you have a gRPC service it can be added as a plugin and call into the core services directly. In the client there are many options for customization. By default the container D client will use the gRPC API to communicate with the container D daemon through the gRPC API. However, a client can be instantiated with any custom implementation of a service, even allowing clients to operate completely without a daemon. You can also add your own resolvers. Resolvers are what are used to push or pull images between container D and registry. The default implementation communicates with a Docker registry using the standard OCI distribution API. However, you can implement your own very easily if you have a different or faster way to distribute images. Including your own plugin is pretty easy. You can compile in the Snapshotter and register it as a plugin just like every other built-in Snapshotter or you can implement your own Snapshotter as a proxy plugin. Let's hand it over to Wei Fu to deep dive into container D's plugin system. Hello everyone, thanks for joining me. My name is Wei Fu. I guess you all learned so many things about container D's architecture from Derek. The core of container D constructs a plugin mechanism about how to manager OCI image and how to run a container. For common cases, container D has some built-in plugins as default functionalities. In Linux platform, container D can download image with GZ format and unify it as a bundle on overlay FS file system. And then we can run it by run C. But it is just one common case. With OCI perspective, OCI spec outlines how to run and unpack file system bundle. For example, the file system bundle can be built and unpacked in various ways. And OCI runtime spec describes the lifecycle of container which can be handled in guest kernel for additional isolation. For different scenarios, there are still so many combinations here. But don't worry about it. You don't need to put your special implementation into container D call base. The container D can work with your external binary plugin as one of the backends. So let's see how it works. The first one is generic image layer support. As we know, each OCI image layer is tough file. We can compress all in captive tough file. OCI image spec use media type to describe image layers format. When you pull image by container D, the workflow is like that. The container D diff service reads the image data from the content service and then unpatch it into the snapshot file system. By default, container D can detect a layer is in GZ format. If the layer is not GZ or tough file, the container D will need string processor plugin. The string processor is a binary plugin to be a media type converter. Look at the configuration for Z-Standard string processor. When container D detects the media type is Z-Standard, it will call Z-Standard binary for decompression and return the media type streaming. The string processor allows one plugin to accept multiple customized media types. But, it depends on your own implementation. The last step of pulling image is to unpack the tough streaming into special file system in order to support different file system for image. The container D export the Snapshot Prophecy API for the third-party plugin. The Prophecy Snapshot plugin is a long-running process and the container D can talk to it on GRPC channel. After pulling image, the container runtime can run the file system bundle. The container D use the container D stream way to as control layer for different kinds of runtime which runtime has their own values and different designs. It's not good to use one connected stream to integrate different kinds of runtime by command line. So, connected stream way to provide general control API for the container life cycle hiding the implementation of running runtime manager. No matter what kind of runtime it is container D can communicate with third-party runtime implementation by stream way to API and connected stream way to also provide binary name convention to locate your own stream. For example, run C way to name we translate into the binary name, connected stream run C way to In RIS 1.4 run C way to stream is the default stream in connected D. Connected D also improves the logging plugins. At the beginning the connected standard I always redirect by the name pipe. But it requires the data receiver must be alive. Otherwise the pipe channel will be full and it will impact the running container. So, connected D enhance the container log by URI schema. Right now connected D can support the file redirection. The connected standard IO can be read into file directly. And then the schema also supports the binary plugin. It can set up a background long running process to handle the IO. So, with the file binary log schema the connected IO goes through the connected D so that the connected D will not be impacted by the logging flow. So, that's it. This is all about the external plugin. It is easy to integrate with connected D, right? Thanks. Now let's do a recap of the container D's architecture real quick. Going back over the whole container D diagram, you can see how operations flow from the client on the left all the way to the run time and back end on the right. The components on the left tend to be more stateless, the components on the right have more state. Even though our garbage collector is in the middle, it tracks resources used by all the back ends, but it operates completely invisible to the client. Our end goals to track any resource which may be used by container D to guarantee the long term stability of the run time. Now let's go back over to the container D, released back in August. It includes support for C groups V2, improved SE Linux support, support for remote snapshotters, and support for CRI on Windows. We are about to start the beta process for container D 1.5. A significant change in this release is the merging of the CRI code base into the main container D code base. This is not a significant change to the end user, but allows us to better integrate CRI components into container D. It will directly tie into container D's core services. The new sandbox API will not only allow for better integration with virtual machine sandboxes, but also allow us to clean up the way pods are managed by container D. This will also be the first release with support for NRI. Feel free to join the discussion today for container D 1.6. This release will continue the effort of integrating CRI into container D's core and improving resource management. We're also looking into ways to improve the API between container D and Kubernetes. If you are a container D user, please file issues. We're also looking for feedback from end users on our roadmap and documentation. We know our documentation is lacking for a project as widely used as container D. If you have a passion for documentation on the user journey, we would like your help. If you are interested in being a maintainer, we encourage you to get involved in contributing and reviewing code. Continuity is a high quality threshold for code contribution and any extra help reviewing code is appreciated. The maintainers also recently launched a security advisor role. This role is intended to help get individuals who have a vested interest in container runtime security involved in our security release process. As the cloud native community has adopted container D, our users have a greater need for access and communication within the community. We have been doing the best we can as maintainers, but we would really appreciate having additional support in managing the community. If this or any other role interests you, please reach out so we can help you get