 Hello. You are here for Kubernetes Signode maintenance track. So if you lost, I closed my eyes. You can run away. Yeah, but really, welcome. Really happy to see all of you. I'm Sergey Kanjalev. I'm a chair of Signode and work for Google. Hi, folks. I'm Ronaldo Patel. I work for Red Hat, also a Signode chair and tech lead. So today marks our first in-person maintenance track in a very long time. So if you go to YouTube and search for recordings for maintenance track, one from maybe seven years back will be in-person. Starting COVID, we published our recordings every KubeCon, KubeCon Europe, KubeCon North America so twice a year. The last one was covering everything about road map up to 127 for, oh, sorry, typo. It was about Europe, 223, back from Amsterdam. And it was also online. So this is the first in-person. Just if you get from China and you can do Chinese, there is also KubeCon China. This year, there was a maintenance track for Signode and I put some notes there if you're interested. And it was also quite interesting that we have coverage in China now, and it's great. So today we will be talking as usual, structure of our talks. Generally the same. We're talking about Signode overview. Then we cover some features we recently worked on. I would plan to work on. Then we try to go a little bit deep into one of the area. And finally, we ask to join and tell you how to do it. So yeah, let's get going. First, it's a Signode. So when you look at Signode and thinking about containers, you may imagine this nice, beautiful ship that is full of containers that are perfectly stacked together, and ship is sailing towards the sea. Horizon is clear. Nothing happening. So what typically happens in this Neville analogy in our life is some containers just catch on fire all the time. Have you heard of Krashov-Backloopov? This is a container, catch on fire again and again and over and over. And we're like, OK, fire extinguisher. OK, fire extinguisher. Yeah, we'll just put a person here standing next to you. But we have nothing to do. We can't just drop you off. But yeah, we'll just carry you. In other case, it's even worse. Somebody, instead of a container with a steel metal frame and borders, they'll bring a platform with a bag of rice on it. Like, oh, just bag of rice. It has no limbs, no line, nothing. And yeah, just carry it. Like, put it in a corner. Put nine of them in the corner, in fact. And scale it like, happy, happy. Yeah, it's just bag of rice. It's small. And then water drips into it. And it's grew nine times in size and 11 times in weight. And like, now you need to throw things away because this rice is like so huge. Like, OK, let's throw this container, that container. No, no, now you can throw rice away. And yeah, life is messy. And if you go back from Neville analogy back to real life, Kubelet is in constant fight between what API server imaging configuration needs to be. It has a declarative configuration for every port, what it needs to be doing. And Kubelet is trying to apply this configuration as much, as best as possible, to real life and reconciled constantly with container runtime, asking how is this container doing? Let me poke it. And if it's still working, I will report back that it's OK. If it's not OK, I will tell that it's not OK. So constantly reconciling state, working with resource managers, with runtime, with CNI, with storage, with everything. And it is a console battle. There are all sorts of conditions and race conditions that needs to be taken care of. So this long description of signaled work can be boiled into our charter. Our charter is signaled as responsible for components that support controlled interactions between ports and host resources. I think it's a definition that covers most of what we do. And the signaled is vertical seek. In Kubernetes, we have verticals and horizontal seek. Horizontal seek may be something like logging as a functionality being covered by a seek instrumentation that covers logging for all the components. So signaled is vertical seek. It owns components. Owning components is harder than applying some features to all the components, from my perspective, because we need to be always there. And signaled owns a lot. When I got introduced to Kubernetes code base, it was described to me that half of it is API machinery, and then most of it is signaled and everything else. Yeah, so it's a, I mean, some six are tricky. They put their components outside, and it's a huge component. And like, oh, just small component here in reality. Like, it's so much logic. It's just hidden somewhere. But if you look at Kubernetes core, API machinery and signaled is the biggest pieces of code base. And yeah, as I said, we own a lot. And not only on Kublet, we also have other components, like container runtime interface. We have many people working with runtimes itself. And this is a very close connection. We own smaller components, like node problem detector, kernel models. They are all very important and needs attention, because they also have very specific customer problems in the all running in production. And we run on many environments. So where we are right now? Last recording, if you had a chance to watch it, it was covering how we fought permabaiters for a long time and how we almost winning. And yeah, we almost won. I mean, numbers lie a little bit. I will explain why numbers lie. So you can see that five features are like, I took all the feature gates. And it's 40 of them right now related to signaled and five of them in GA. Feature gates that is in GA means that feature just was promoted to GA, so meaning that not more than two releases, meaning that we're still doing a good job of promoting baiters and making them being shipped. So this is what we've been concentrating a lot on reliability, stability, and promoting permabaiters. So people wouldn't be confused using some functionalities that was in beta since 111. And they just want sonic GAs to be what they use. And if you look at other feature gates that we have, 19 of them, like majority of them is alphas. So we do a lot of experimentation these days. We get through the stability phase. We have a lot of people trying to bring new features, new ideas, and we experiment in. You can see like 19 alpha features. Some of the beta features are also experiment still. For instance, not swap. We promoted it to beta, but it's still not at the stage we want it to be. So it's kind of alpha-tasted beta. And yeah, we still have CPU manager policies that we're experimenting on, so that are also beta. So yeah, there are lots of experiments and also many droughts. And you may ask what we experiment about, what we care about the most right now. And you can split all the experimentation into two buckets and third bucket, I will explain later. So first bucket is new workload types. When I say new, nothing is new. I mean, you can either run a job or web server, basically. I mean, there are other things. But what we ultimately want to do is to make sure that every new type of application can run not only it's not only possible to run this application Kubernetes, but it's also comfortable to run this application Kubernetes. So we need to do many features to improve experience for those features. And then we also want to better understand hardware. As we run on more and more environments and more and more HPC applications wants to run and they want to fully utilize the hardware. We need to understand this hardware better. So we have all sorts of resource managers, but we also want to understand other resources. We want to extend into a pluggable model when certain way how we taught Kubernetes to understand resources is not quite, I mean, it's too general and we want more precise understanding. And I think Mike will have a talk tomorrow about resource management and how we, what we do in this area. Yeah, this is Mike, ask him. And then as a big area is just quality of life improvements and some things that we wanted to do for a long time and just finishing up. It's a lot of things that will help cluster administrators and end users to do Kubernetes better. So this is what we're doing. And Ronald will cover what exactly we did in 129. Yeah, so I'm gonna talk about some notable features that we have worked on in 129. So this is a big one. Like sidecar is a hugely demanded feature for a while. And it took a long time, but I feel that we have an elegant way to specify sidecars and pods now by just saying restart policy equal to always. And that allows things like Istio and service measures and log forwarders to wrap around the life cycle of your regular containers. So in 129, sidecar went to beta and we improved the termination ordering of the sidecar containers. So Sergey and Todd are doing a talk tomorrow. I, it'll be really interesting if you wanna learn more about it. And I would also like to call out Gunju and Mathias who worked a lot on this feature. So a username spaces. So this is primarily worked on by Giuseppe, Rodrigo and Sasha. So what are username spaces? So username spaces are yet another namespace in the kernel that allow you to change which user is in your container. So you could technically be root inside your container while being non-root on the host. So if there's a container breakout and if your process is able to escape the container, it's not really root on the host. So it limits the harm that it can do or it can't read other pods or do anything on your host system. So this is like yet another layer in the onion of container security. So this feature has taken a long time. So finally we have features in the kernel like ID map mounts, which makes it easier to create pods with username spaces. The latest update here is to be added changes to pod security admission so that users can specify root inside the container and they're allowed to do that even in baseline policy. So the way you enable username spaces is by adding host users equal to false in your pod. So I implore you to please try it out and give us feedback so we can take it to beta. So this feature is a quality of life improvement. So this is worked on by Sohan, Peter and Jerry. So if you're used to traditional UNIX or Linux processes and demons, they have a way to specify configuration through drop-in files. Like for example, system D, right? You can go to a specified directory, just override the settings you want to. So we are doing the same thing for the Kubelet now. So it makes it very easy for you to manage your Kubelet configuration. And these drop-ins are applied in alphabetical order, so it's easy to know and group the settings as you override them. So these three are all related to images. So the first one is for splitting disks. It's really file systems. So what happens in practice is a lot of people want a separate disk to store their container images because they want to separate the IO, utilize for pulling images from their workloads IO. And you could have huge container images, so you don't want them to be constantly being pool-deleted and life-cycled on your main disk. But a problem that we have before this work is when you move your container images to a separate disk, the writable layer of your containers also gets moved to the second disk. Now, when your container starts writing files to the writable layer, Kubelet is no longer able to monitor that and evict the pods based on what's being used there. So we are trying to fix that with this feature. And potentially in the future, we can even separate out logs and other bits that are covered by fmdel storage. So this is driven primarily by Kevin Hannon. So the next one is parallel image pools. So this is a nice quality of life improvement. We didn't have a good way to actually ensure how many image pools are happening on a node. And this could lead to situations, say on a reboot or something, you have 100 new pods scheduled, and you're suddenly pulling hundreds of images. You don't want your node to come to a standstill because you no longer have IO remaining on your cloud. So this will allow you to restrict the number of images that are being pulled in parallel. So we just made this feature beta in 129. So this is primarily worked on by Reuven. The final one is image GC. So the way image garbage collection has worked so far is you have a threshold. Typically, the default is 85%. And when the disk reaches that much usage, the Kubelet will go and try and remove images that are not used to make up space. So there is definitely scope for having other policies to remove images. So Peter, who is here, added this feature where the Kubelet can start removing images that haven't been used after a specified age. If it hasn't been used in a day, then Kubelet will go and proactively clean up the disk space rather than waiting for disk pressure to kick in. So there's a lot of work happening in the area of images. And Peter led an image work group. And Sergey is going to cover some of what that group discovered in terms of features that are needed. Thank you, Bruno. You have amazing memory remembering all these contributors and every cap. And I think we all share the same passion for people contributing and going through all the hurdles to deliver features. It's not an easy task, and we appreciate everybody. So image pools. How we do it in Kubernetes? In Kubelet, Kubelet itself knows which images it wants to run ports. But it doesn't do downloads itself. So historically, we split download logic into separate interface. It's called image services interface. And you see this interface has a very limited number of methods. You can describe this interface very easily. So let's say Kubelet talking to runtime, like, hey, get me image. And then like, no, I don't have it. And then like, OK, pull it for me. Runtime pulls it for you and gives back. So now Kubelet can use this image. So pretty straightforward. Sometimes it's a little bit strange because like, hey, give me image. No, I cannot pull it. Please, here's the credentials. And it will give you an image and say, like, here you go. But you ask for full, but his name is bar. So yeah, it happens. I'm always simplifying here, but I will describe what's happening. And then what also we have is, you know, this Kubelet is running out of disk and like it look at runtime saying, like, what do we have? Like, what can you help me clean up? And runtime responds with a list of images. And then we go and remove one image after another image while disk is on the fire, like behind the disk pressure. Like, let's do it already. And yeah, that's all interesting interactions we have. And most of these interruptions are due to limited interface that we have between Kubelet and runtime. And we looked at those interruptions and problems that we have in the space. Biggest problem for us is that the image service API is totally independent from CRI runtime API. And it was designed intentionally. We wanted image service to be as straightforward as possible. We thought that only we'll be storing images for us. So we will teach it to pull images, to put them on a disk, and give us information about it. So it should know about image usage, for instance. This wasn't designed originally. We wanted Kubelet to hold all the knowledge about what is used, why it's used, and when it's needed. And then we also said that Kubelet will hold credentials. Because runtime, I mean, when we're talking to container d, container d maintainers don't really want to deal with credentials. Like credentials and memory, we need to do threat review. No, thank you. You already have credentials. You already downloaded them from somewhere, like do with them. Give them to us temporarily. We will keep them during the request, and we'll done. Like is what typically interactions happen. That's why a credential provider is staying in Kubelet. And that creates a lot of problems, because Kubelet knows about credential providers, and it knows about credentials for all the registries. And runtime knows which registry it wants to download the image from, because it knows about mirrors. So if you have a credential for the mirror, and runtime wants to use this mirror, there is no way for Kubelet and runtime to agree on how to get those credentials and pass them there. It's not quite, it's just historical reasons mostly. That's how Docker was designed. And API was basically covering what Docker can do. But yeah, that's reality that they need to deal with and try to address slowly. And then all other problems, like if we want to download images based on certain attributes of pods, like some pods won't, maybe you want to distinguish whether you want to use mirror or not to use mirror based on pod properties, pod attributes. I mean, I'm just coming up with scenario. And it wouldn't be possible, because Kubelet knows about one part, runtime service knows about that part, and they're not connected right now. So we're thinking, we've been thinking about this working group, do we keep all the attributes and pods in Kubelet, or we want to pass some of them into image service. And some of them needed to be in image service right now. For instance, we start passing handler, like what is it, runtime handler. So in Windows, images are differentiated by runtime handler. Depending on what runtime handler you use, you pull images differently. So you need to pass this information to image service or image service. Now it knows more and probably something that it doesn't really care about. So yeah, many interesting aspects here. And finally, disk usage patterns. As we noticed, there are more and more people implementing better ways, I can't tell you it's better, better ways to download to pull images, faster ways. And these ways don't necessarily use the disk as a way we get used to use the disk. So typically, as Ronaldo mentioned in one of the caps, we start by using read-only layer from writeable layer, but there may be other patterns. And we don't know about these patterns, we want to know more about what runtime did with the disk and Kubelet needs to know about it because Kubelet is responsible for evictions and garbage collection. So this is all what we discussed in the working group that Peter led. And it was even more interesting conversations. So I posted a link to the notes and we have some recordings somewhere if you're interested. And then I also wanted to talk about what is on top of mind on disk space. And disk space get interested. Like you remember, we do a lot of experimentation and I mentioned that new workloads are the number one priority. And disk space, container images space became interesting because we see more and more EIML images become heavy, become more distributed. Sometimes they change more than typical image would change. So there are different patterns with the new workload. And the old way we deal with container images not always work. So we have a lot of work that people proposed and discussing right now. One of them is download progress. Like right now we have no way to know on which stage of download we are or container runtime is. So we want to have communication between Kubelet and runtime on where does disk image download current stage is. Maybe some progress, but it will be nice. Then we think about better security. So today if you pull the image with secret and then you want the same image, you just use the same name and it will be given to you even if you don't have a secret. So it's not quite secure. Yeah, Mike is smiling here. But yeah, we need to fix the situation. And we have other feature, other request. Like some people don't want images to be available once pot is gone. So yeah, something to be done here. We don't have any native mechanism right now to implement this scenario, but it's definitely something that on top of feature is that the common as a request to signal it. And faster image download is a king. There are many people trying to implement different strategies, different patterns. Plug in disk with prefetched images, like attaching disk mounting them, doing lazy downloads of different sorts, doing some priority based downloads when they afloat everything else and get only what needed. So all those features are coming to the queue and we are looking at them. Unfortunately, community is not that huge. Otherwise we'll take all of them. So if you want to work with them, please come discuss and we definitely interested in hear what you are thinking about it. And here is a way you can get involved in this feature so any other area you interested in. So first of all, I wanted to repeat our contributors priorities. And those priorities didn't change. Like every year we speak about them and we opening new slide, like let's check our priorities changed. No, they're not changed. So reliability and stability first. We appreciate everybody coming with new features but if you want to bring feature, like maybe think about test first. Is the area you want to change tested enough? Then look at bugs opened in this area and help us address these bugs before you making a major change in this area. So we prioritize and value contributions in this order. And then optimizations are always welcome. If you have a choice and you don't really care what you want to do, optimization versus feature, please come with optimization first. It will help us a lot. We could work through over time and it's now not a small puppy. So any optimization you can make is very valuable. We do a lot of caps around optimizations today but more is made here. Features always welcome. It's not a quick process as we discussed. We don't have many features emerging every release but if you persist in enough and you put energy behind, it's doable all the time. And if you're not ready to commit on anything major, you can always come help us with documentation, help us with some login improvements and self troubleshooting improvements. People all over the world are struggling with Kubernetes to understand what this error means. So maybe you can help us improve this error. And just being on top of PRs and issues is always welcome. So yeah, please attend our meetings. It's a good way to help us. We have two meetings a week. As I said, we have a reasonably big SIG. So two meetings a week, you can attend either. One is about features, another about tests and stability. And we have triage guide work. You can find out how to help us triage things and you can always just look at boards and see what's happening. Yeah, those are contacts and timing for meetings. If you're interested, please join us. So now it's a thank you and time for questions. So this is probably not about most of the things that you talked about. I have a question about logs from pods. Maybe it's an optimization in terms of the priorities. So currently, if I'm not mistaken, it's the container runtime or the Kubler that writes the logs from pods on the disk, on the nodes disk, right? And if I have a cluster, I want to read those logs from the pods. I need to deploy a demon set that mount the proper volume and read those files, read the logs from those files, parse them and do something with them, right? Would it be, is it possible or do you think it will be possible to maybe have a different way of sending the logs from pods to an agent rather than just storing them on disk, maybe send via TLP to an endpoint that is already there? So I think when we started the CRI, we defined the CRI format. And since then we haven't really done any work there in the area of improving how we send logs. So I think downstream on the container runtime side, I see like different log drivers that can potentially forward it. But maybe I think we should have that conversation and see if there's any scope of standardizing common patterns of like using a vector agent or a Fluent D agent and how we not only write to the CRI format, but also forward to these agents. And then the other question is, how do these agents lifecycle work, right? Like as you mentioned, there's demon sets. Should they be instead be run and lifecycle by the runtime? How is their memory and CPU allocation managed? And I think we have discovered some logging patterns. So maybe we can also improve and add to the documentation of the existing patterns and how conversation on how we improve the situation. Right, so you mentioned the logging drivers. So I know Docker used to have logging drivers. Is it like include this in the CRI? No, I think this would have to be something done at the runtime layer. Container D or Cryo would have extensions that would allow you to additionally forward logs somewhere. Right, currently it's not customizable at all, right? Yeah, all the CRI in Kubelet is expecting today is that the logs are written in the CRI format. If they aren't, your Qubectl logs won't get you anything. There's also a way, I think, to get the logs from the API server. Is that right? So when you do Qubectl logs, you get redirected and the logs get slept up. Yeah, yeah, yeah. Okay. Thank you very much. So if I wanna push this forward, do I go to the SIG node or SIG instrumentation? So I would suggest you join one of the SIG node meetings and bring this up as a topic of discussion. All right. Thank you. I was just curious about some of the feature flag you were talking about. Is that looking at per node basis or is that a cluster for a whole? I mean, often I see clusters that have uneven node sizes and things like that that I might want to do garbage collection differently per node. Yeah, I think that would probably be the next phase. First we have to allow a way to separate the disk and I imagine your nodes would have to be homogeneous with similar size disks attached. I think beyond that we have to, I guess, discuss more and I'll come up with the design. Okay, thank you. Okay, if no more questions, thank you everybody for coming. Please leave us good feedback and maybe we'll get better time next time. Thanks folks, thanks for joining.