 Hello, and welcome to KubeCon Europe 2023. This is Kubernetes Signode intro and deep dive. And today, your speakers are Don Chen from Google, Derek Carr from Red Hat, Sergei Kangelev from Google, and Mrunal from Red Hat. We all chairs and TLS of Signode, and we're happy to present in you what's happening with Signode and what we're working on. Before we begin, please remember to be nice to each other and read a code of conduct if you're interested. Previously, we updated community on Signode achievements and covered everything happened up to 126 of Kubernetes at KubeCon North America 2022. You can find recording and slides, everything available online. So if you're interested what was happening before, please go there. We covered interesting deep dives during that presentation. You may be interested in CGroup V2 and in-place memory, in-place pod upgrade. Today, we will cover, we will go into Signode overview again. We will talk about Signode areas of interest, and then we can deep dive into KubeLet resource management. After that, we will talk about what was happening in 126 and 127, what are our plans for 128, and we'll go into one of the highlights of recent developments in sidecar containers. Finally, we will talk about leadership updates and the ways you can be involved into community. Without that, let's go to Signode overview. Signode is responsible for many things happening on Kubernetes nodes. It consists of multiple areas as KubeLet. It talks about node and pod lifecycle, how they're managed, and what kind of stages they're going through. It also talks about resource management. And during today's presentation, we'll go deep into resource management specifically and also communicates with operating systems through the container runtimes. Our charter is to be responsible for all the components that control interaction between pod and host resources. Signode is vertical SIG. We control specific area of code. And opposed to some horizontal SIGs like SIG instrumentation, vertical SIG means that we own specific components. And if somebody wants to change any feature touching this component, we need to be involved. And we have multiple sub-projects. Beyond one of the horizontal sub-projects is CI sub-project doing some where we watch for reliability of components of SIG node and looking at CI status. We also have sub-projects by specific components. We have sub-project for KubeLet, for container runtime interface, for node problem detector, something that will notify you about the issues with your nodes, and many more. With all that, I want to go past Derek. And Derek will go into KubeLet resource management. Thanks, Sarah. As Sarah noted and SIG node, one of our primary responsibilities is to figure out how to give pods access to host resources and how to make sure those resources are fairly shared among pods or containers on that same node. So today, KubeLet supports a number of resources, CPU, memory, disk, ephemeral storage that you often see in your pod specs when as an application deployer you're deploying to Kubernetes. Then there are some resource types that you don't see in the foreground in your pod spec that the KubeLet is managing the background for fair sharing. One of those would be things like PIDs. And we're also supporting frameworks in the background to allow device plugins, for example, to advertise dynamic resources for you to consume your application. Basically, this is a rich set of growing and diverse set of resources that we need to manage in the SIG. Next slide. But if you're interested in understanding how we in the SIG approach this problem around advertising a resource, oftentimes community members come to the SIG and say there's a new entity that we want to make known within Kubernetes, whether that's a new device, a new class of resource, or a new specialization on one of those resources. One of the first questions that we have to ask is how do we want to make that resource known? How does the user express an intent to desire it? And then how do we model it relative to the concrete representation of that resource? A lot of times we don't think as a Kubernetes user, what does it mean when I claim one or two or three CPUs? But we in the SIG are then responsible for figuring out, for example, how that maps to fractional shares in a C-group scheduling system. Many resources can be static and they never change over the life of the node. Other resources are dynamic and how we then reflect that back up to the control plan and the scheduler are important questions to reason through. And then particularly for some of the items that we're exploring in the future, we have to be very careful to avoid node bootstrapping loops. We have to make sure that the Kubeit can work absent the presence of this resource being known. And it's a perpetual challenge that I have to always reason through these things. So next slide. So at the end of the day, users are wanting to make a claim on a resource. So some resources are fixed and things like memory or disk space, you can very easily count how much of that resource you want and you can make sure that then you can hold that container or pod within that allocated budget. Other resources we could sometimes describe as non-countable and these might be class-based resources where you're just trying to give a certain service quality around that resource that a pod can express. So if you have a new resource that you're interested in having to say you understand or articulate it, one of the first questions is, is it actually counted or not? Similar to that, not all resources can be over committed. In Kubernetes, you can make a request for a minimal amount of resource where on the Kubeit we guarantee that you will get that resource at minimum. You will never get less than that. You'll make sure that you always have that reservation of the resource. And we map that to the request field. For resources that can be over committed, we allow you to give a separate field which we call a limit. And for example, you can say a container can get between one and two CPUs at any given time or between one gig and two gigs of memory at any given time. But not all resources can be over committed. A good example of that would be something like huge pages where right now, for example, if you have to have a pod that requests the huge page, you cannot over commit that resource. You basically request a certain amount of huge pages and you can't schedule out more of those than exist on the node. And then as we're thinking about these resource problems, mostly we're trying to make sure that the cluster scheduler can make an informed scheduling decision. But we also have to keep in mind that there's a certain amount of overhead to just support running the management components on the node or the life cycle of the pod itself. And what we have to do work is sometimes figure out ways to make that reservation known so that the cluster in the end is more reliable. If you're interested in scheduling problems to say what's the best node to fit a given budget, that's typically the domain of SIG scheduling. And a lot of these resource management problems are worked across these two SIGs. But if you're interested in figuring out how to make sure that the request is actually fulfilled when scheduled, that's where SIG node really shines. Next slide. So other things that are typically beyond the Kubelet but come up when thinking about how to support new resources are things like limit ranges where you can say a policy rule that says a pod and a given namespace must request between one or two CPUs, right? Or it must request no less than five megs of RAM and no more than 100 gigs of RAM. You have a way of setting very subtle policy windows that can define valid resource requests. And so that's the domain typically of the limit ranger. And if you have a new resource, it sometimes is worthwhile asking to cluster admins benefit from being able to constrain the windows of consumption for a given resource. And then typically even more important than that is resource quotas, which basically deals with ensuring the amount of that resource that a given namespace can claim in your cluster as a whole. So a lot of users in the world will partition their Kubernetes clusters into a set of namespaces and they wanna control one namespace's ability to consume all resources relative to another. And if there's a reasonable expectation that the resource you wanna introduce in the Kubernetes might be viewed as precious or scarce, it's often important to then ask should it be incorporated into quota. Next slide. So once the Qubelet knows how to advertise your new resource and once the scheduler then wants to schedule that resource, the Qubelet sees that the pod has been scheduled to that node and it runs its own local admission check. So a lot of people are familiar in Kubernetes today with admission controllers or web hooks that can extend the Kubernetes control plane and intercept control plane API requests. Similar to that complementary on the node, we make admission decisions that intercept pod, scheduling acknowledgments on the Qubelet and ensure that Qubelet could actually meet the needs expressed by that pod. And but we'll do checks to ensure that the resources are actually available on that node to support that pod. In some cases, for more exotic or specialized node local topology decisions, it might make higher order checks, for example, it'll say this pod wants to use one CPU and it also wants a GPU and it has expressed the desire for those two things to be co-located on a common NUMA node. For some of those node local topology decisions, a scheduler at the cluster level doesn't have that total system view. And only at the Qubelet is that system view known today, for example, and the Qubelet has to ask, do we actually have the feasibility constraint satisfied on that node to make it? This is an area where we continue to try to improve or get better on and as an area where folks if they're interested in trying to help contribute to the SIG, it's an area where we try to avoid any race conditions or false positives or false negatives with accepting that workload. Because at the end of the day, the goal would be once the pod is scheduled, we wanna have a good confidence that pod will run. Next slide. Once the Qubelet acknowledges the desired intent of that pod being run on that node, oftentimes then the Qubelet has to start the process of allocating or reserving resources for that node. Many resources that the Qubelet allocates are stateless and oftentimes just controlled by C groups. And before example, if a pod is desiring a particular amount of CPU or memory, this allocation step, you can think of as the Qubelet just creating the C group structure that constrains the pod. But then there are other resource types which might require the Qubelet to do something new. For example, if your pod has an empty dir and it wants to store a state independent of a container backed by memory, now the Qubelet might need to go and allocate a location on tempFS to fulfill that request. And you can imagine for other types of resources, particularly for a new emergent class-based resources where it's boring, you might want to dynamically attach a given device to that node to fulfill that request. A lot of things that we have to explore in this area and then typically we need to make sure that when that pod is completed that we can clean up any allocations. And we want to know when that allocation has been confidently deallocated as a part of the pod lifecycle because ultimately the other day a new pod is going to run on that node and expect those resources available to it. Next slide. As we discussed earlier, some resources are over-provisioned and can support over-provisioning. Not all can. And we said earlier, if you are interested in exploring new resources in the communities, this is one of the first questions we would want to ask. A good way of highlighting this is today a given node may run on, a given node may support many pods and in effect, while each pod is making a request for a certain amount of CPU, the set of concrete CPUs that are running in that pod are shared on that node. And ultimately they're kind of a time share on that CPU time. Other resources, you can imagine things like GPUs. When you want a GPU in your pod, typically that GPU is for your pod and no others. And that would be an example of a resource that is not over-provisioned. For a certain set of our resources, we have a concept of a quality of service class which says dependent on how you request and limit particular resources, we might say you go into a guaranteed quality of service bucket versus like a burst of a bucket. And in cases where you're in that guaranteed quality of service class, we might support very higher order node local optimizations to give you greater performance benefit for your workload. For example, if you have a, if you make a request for an exclusive CPU and you want a GPU at the same time, you probably want those two things close on the physical host. So lots of interesting things to think around this one. Next slide. At the end of the day, you're running your workloads on nodes and you wanna know what, are you actually properly sizing your workloads? How much resources are your workloads actually using in production? Our goal and signal is to make sure that you can make the most optimal resource, optimal usage of your resources as possible so that you can get the best density and the best utilization in your environment. So typically then when a new resource is explored, we wanna follow up with questions on how do we know it's being used? So if it's counted, for example, is it being observed in things like CAdvisor? Is it being fed back into metrics loops for monitoring stacks to scrape and measure? And then beyond that, once some of these things can be measured, the question becomes, can we support other adjacencies abilities to make better dynamic resource resizing requests? So a lot of neat things can come out of this. So next slide. Sometimes the node, particularly for over committed resources may find that the actual amount of resource being used is greater than that which is available on the node. And unfortunately when that happens, you have to make a decision about what to do. And so there's capabilities in the keyboard today that when we think about new resources, we have to ask like, if this resource is scarce, what do we do when scarcity occurs? And when thinking about new resources, we often then think about how to handle those problems and then how quickly can we handle those problems when they occur? But for example, if the node is running low on available memory because pods are using more than they actually requested, the qubit might make a decision to evict a lower priority pod relative to another that's consuming more than what they requested. And ultimately it makes those decisions that a workload can be rescheduled to another location. And we wanna make sure that the workloads remain healthy on the node that are still there. Next slide. And on summary, hopefully from this brief overview, you can sense that resource management touches a lot of topics around API modeling, the life cycle of a node, the life cycle of a pod, the health of a host. And it's actually quite a complicated topic that takes a large number of people's collective brain power to reason through. A lot of these resources are very specialized and have subtle behaviors that's hard for anyone of us to keep in mind. So if you're interested in this space, I think we're all happy for more collective brain power to be brought to the problem. There's a few areas that we're exploring today that are highlighted here. I'll just call it a few. The one of course is we want workloads to be more performant and more optimized on the nodes they run. And so better understanding that the node local topology in the new space has been an interest of ours as late as well as being able to expand to a greater set of resources or specialized resources, whether that's network bandwidth or any other class-based entities. Exciting work ahead. With that, we'll turn it over to Yuminal. There's been a focus on not keeping betas forever since 120. Next slide. Have you done a bunch of work to either deprecate or graduate lingering betas? So these are some of the features that were either graduated or deprecated. Next slide. So these are the features that we graduated in 126 and 127. So device plugins, CPU manager, downward API for huge pages, Qubelet credential providers, and topology manager was graduated in 127. Next slide. So that said, we still have a bunch of work to do because we still have some features that are stuck in beta. And as a seg, we're trying to either graduate them or deprecate them. So as we go into 120 at planning, we'll try to address as many as we can. So let's take a look at what we did during 126 and 127. These are the three new features that we want to highlight in 126. So first of them is dynamic resource allocation. It's a whole new API to request, share, initialize, and clean up resources. It's like a generalized version of how storage is accessed today. I would encourage you to either read a blog or watch another talk about this feature. It's an exciting new feature that opens up possibilities like splitting the GPU into multiple slices and using them from pods. So the next one is evented Plague Alpha. So typically Qubelet relays the pods and containers from the container runtime every few seconds. And this puts a lot of load on both the Qubelet and the runtime. So evented Plague is an effort to make the updates evented rather than frequent relisting. The alpha was added in 126. And then we added improved multi-numa alignment options to the topology manager in 126. So 127, we learned a lot of stuff. So a couple of GAs to call out. First, the GRPC container probe support. Second is second default. It enables better security by default. Then the evented Plague moved to a beta. Another one to call out is the in place pod vertical scaling. So this was a big effort. It took a lot of time from community as well as the reviewers and up to us to get it right. It basically allows you to resize pod resources without having to restart them by adjusting the C groups. And it will be useful in various auto scaling scenarios. Then we added some other utility features like Qubelet tracing, which allows you to add open telemetry tracing and then look at the traces of what's happening through the Qubelet as requests are flowing in. Then we made some improvements to image pools. So we have better control over how many images are being pulled at the same time. What's coming up next? We have sidecar containers that Sergey is gonna cover in more detail soon. Then we did some initial work in swap. We wanna explore the next steps there and see if we can provide more controls over what percentage of swap can be accessed by the workloads. Also, another intersection here is with C groups, we too, once we enable swap support, it allows us to enable the user space OomKiller like OomD. So we have better predictability in which container and pods are OomKilled compared to what the kernel does today. Then there'll be further improvements in DRA. Hopefully we'll graduate event based on feedback and we have a couple of announcements in the area of image pools. So there's a lot of other topics also being explored. So we encourage you to come and join signaled planning meetings and help us move things forward. So next, Sergey is gonna cover sidecar containers. Thank you, Bruno. Lots of exciting things happening in Kubernetes and Signode. This, I wanna talk in details about sidecar containers. Sidecar containers as a concept was introduced long time ago. I don't even have a date, but it was used for a long time. And people were using it for many continuously running pods that they're running typically for web services and such. Sidecars were introduced to support generic functionality like log forwarding or metrics collecting stuff like that. And it worked reasonably fine with all the long running pods and continuously running pods that never finishes. With Kubernetes growing and supporting more workloads, we start putting more emphasis on jobs. Besides other improvements we're doing for jobs and batch like workloads, we see that sidecar pattern is not working for jobs. When you start a job or like any batch workload that has a completion, sidecars will prevent pods from being terminated and removed. So people will come in up with various workarounds to support sidecars for jobs. And jobs indeed needs sidecars. Typical example is log forwarding for metrics collection. You want to know what's happening with metrics on your job. You want to understand the status of it, the state of it. And having generic sidecar aligned helping with that is very useful. Another example where sidecars start being useful is service mesh, where it's proxy that people install on the pod. We can monitor connections from into pod and from the pod. We can look what's happening in these connections, modify this connection for some security on that. It's all important scenarios. Before built-in support for sidecar containers, service meshes worked reasonably well with continuously running tasks. So they have the same problem of pod termination as with jobs, but for regular containers they worked fine. The only problem was that in need containers wasn't covered with service mesh. If you have in need containers that needs to utilize all the security benefits that service mesh provides, you wasn't able to express it with typical tooling provided by Kubernetes built-in. We wanted to address this problem as well. And there are many scenarios when sidecars are used for continuously updating secrets or reading some configuration files. It's also a generic task that can be implemented in single container. And it will be nice to have it implemented as a sidecar that wouldn't terminate or wouldn't prevent pod from termination. We worked in 127 on proposal for sidecar containers. And the proposal is we wouldn't introduce any sidecar term into Kubernetes API. What we will do is we'll allow in need containers to run continuously. Some in need containers will be marked with restart policy always, which will indicate to us that this container is a sidecar and it needs to start the same order as all other containers, but other containers will not wait for its termination. So it will, or completion, it will start running. Other in need containers will wait for sidecar to get into started state. So startup props will complete successfully. And after that, all other in need containers will run or regular containers will run, but this container will keep running. And if it fails, even if it fails on pod marked with restart policy never, it will keep being restarted because you want to keep monitoring or log forwarding for your job even if your log forwarder crashed or was killed for some reason. I wanted to talk a little bit about development process for this feature. This feature was proposed long time ago and we've been working on implementation. Difficulty was at API surface and any API surface was hard to agree on and we needed to make sure that it's future proof and we went back and forth between very simple implementation, very complicated implementation. We end up creating a working group to log on decisions and make it more efficient to go through these conversations. And this working group was very successful over time. We locked on decisions relatively quickly and then proceeded with implementation, description of implementation steps. And in terms of implementation steps, we took some lessons from other major caps and we split our work into some refactoring that we do before API change and then some big PR that will go with API change including specific features we want to support out as a MVP product. And then after API change, we want to catch up with additional features and this is mostly to make sure that big PR that will land, it will not be overwhelmingly big it will not take too much time to maintain and too long for approvers to review. And with that, I wanted to thank everybody. Oh, status of it. We have a PR, working progress PR in 127 but we didn't merge it for multiple reasons. And one of that is we didn't want to put two major features in one release 127. So we likely will merge it in early 128. This is the plan. And we still have some open questions for follow ups for after alpha. It's primarily termination ordering and how we gracefully terminate side cars how we allow ordering of termination. That I wanted to say thank you for people participated in the working group. I just took a list from all the notes. Thank you everybody who joined and gave your opinion or presented something. And I want to have special shout out to people who was implementing this working progress PR. Thank you. We really appreciate everything happening and side car supposed to like promise to be very successful and very needed feature. Is that I want to pass the deck to go through the leadership update. So thanks Sergei. This is our last meeting. Many of you who track our mailing list might have seen an update to the SIG leadership structure. But it's been a long-term goal of ours to support growth of new members in the SIG. In one way that we've done that is to split our chair and technical lead roles within the SIG and its corresponding governance and recognize my co-presenters here as I want to thank both Sergei and Renault as agreeing to step up and take the chair role in the SIG which was relieving both Dawn and myself from that role and helping make sure that the SIG runs smooth and is executing well and meeting the needs of the community. So big thank you Sergei and Renault for taking that on. And then we've expanded our roster of technical leads in the SIG to include Renault to make sure that we can help navigate tough decision-making or conflicts that might arise in the SIG around what to do next and how we look to achieve it. Big thank you Renault for helping for the team there and look forward to the success that that brings this year. Next, I think I got your next. I wanted to say thank you for all the trust put us to us in the community and with that I wanted to talk about how you can contribute to SIGNOTE. So first of all, whenever you contribute and what we ask is to pay attention to stability. Stability always comes first. Any bug fix that you submit without tests raises questions. We need to make sure that we're not introducing new problems and that the old features and old behaviors are covered with the test very well. Then we want to make sure that we do optimizations. We cobalt growth and the workload that we can run is various significantly from where we started with. That's why evented plug feature is very desired and we worked on it hard and we want more optimizations, more improvements in this space. Then comes features and the final review also care about user and developer experience. If you can contribute documentation or just give us feedback on PRs and issues, please come and contribute. You can contribute by attending our SIG meetings. We have two meetings, one for STI and one for our main meeting. You can also participate in dog thracing and doing some triage from now. You can join the regular meetings on Tuesdays at 10 Pacific and the CI meeting on Wednesdays at 10 Pacific. We are available on the SIGNOD Slack channel on Kubernetes Slack and we have a mailing list as well and know how to reach the chairs and tiers. Thanks for joining this talk and we hope it was useful for you and we look forward to hearing from you.