 where last of talk of today, maybe this week even. And my name is Dong Chen from Google. Together with me, we have Mike from AMO and Peter from the Red Hat and also Dixie from Google. We are going to introduce the signal and what we have done and our work to you. Yeah, next. So you can see that this is the picture of four of us the first day just right in front of this conference room the first day at the cubicle and you can see that we are exciting and we got the time but we are exhausted because some of us have the jet lag and you also can see us today exhausted and but satisfied here and on the stage. So here it okay before this trip might here and kind of suggest us or maybe force us to reflect our lives 360 degrees. And so then we can know each other better because we've been working so long virtually because due to all kind of the rhythm pandemic all those kind of things but we don't know each other we only know our work part of this. This is what we are put here together and share with everyone here and here is the today's agenda and my job is very simple just introduce signal and and what we are doing and the street forward and then Peter is going to talk about our achievement proudly and then DGC is going to talk about what it is the selective project what we are kind to looking at and what we put our time and energy and the mind is going to talk about about the future direction. A lot of things may be familiar if you went to a lot of talk we've been talked about before this the conference and a lot of work like the DI is being discussed more than actually it's more than a couple years and finally we reached a certain stage and so we want to share with you and so okay let's go. I think everyone here in this room definitely familiar with this graph and here is the API server and at the CD and the control plan and the scheduler all work together from off the control plan for a given cluster which is managing many many nodes and so if you are here it means we haven't done good job because you are so care about the megal it is always as the infrastructure focus megal always is the maker we are invisible you equal all us so obviously you are here you want to know more and when it is we are not bring the reliability to you or maybe it is we haven't done enough to support your new needs so let's figure out today. So let's quickly explore like the critical components on the node and all those all those critical components form off the backbone of our cluster and make sure we can run off the container as the application so the first one is the Kubernetes obviously the format on the node and which acting off the agent container control plan off the agent and ensure off we can run or manage off the pod lifecycle management and also collaborate self the container runtime and to ensure off the container creation deniesion and also run we think is a locator of the resource container off the runtime the popular choice it is the Docker container D and the quail that ensure the management of the container lifecycle creation deniesion all those kind of things and also they are poor off the image and pack off the image layers and ensure we can run all those kind of things I this time is first time I separate off the resource management because there are so many things discussing going on so and at least today is the Kubernetes together along with off the container runtime we are ensure the acting as the resource governance for all those running container and the pods and ensure they are running also ensure off the locate off those CPU memory and other resources for those running applications effective off the resource management can ensure off the performance and also avoid off unnecessary the resource contention on the node so you must have heard a lot off the device plugging management and also ongoing off the dynamic resource allocator effort and they all is try to discourage off the specific specialness of the hardware on the node and share the availability of those kind of things to the Kubernetes so Kubernetes together with control plan and can allocate off the pod which needs those device and being scheduled and can you access those devices and use those kind of things so there's a lot of discussing I think might both might and Peter is going to touch base later and give an overall and the cap recap of all those kind of things being discussed this week and also last many months so next one it is node problem detector just like a watchdog just like any watchdog running on the node and we'll be actively monitor about the issues on the node could be monitored some hardware failure and the kernel backlog and also even like the unresponsible Kubernetes and even container runtime so node problem detector also report those problems to the control plan and so control plan can take corrective reaction for example reschedule pods and the control of the nodes and all those kind of things I think last many years we didn't invest heavily on the node problem detector and we should put more effort on those kind of things the very last but it's not really last one it is the storage and the network component for example CSI and the CNI basically it is just make sure of the smooth of the communication and also make sure we can integrate with the storage resource all those kind of things so who is which team and who is the people and the management of those components and make sure they work together same next day and actually the signal a virtual and critical team in the Kubernetes community and response ensure we are can smooth powered execution on those work machines and we often heard a lot of complaints from the open community and everything's contribute to signal is really slow and not much progress which is so true and we apologize for that one but I also want to point out we have active member involved with signal activities right if you can see that I just list before this meet conference I just a week ago at least I find all those kind of information you can see how many people and involve and activity and the inactivity but and into and also you can see that how many projects sub-projects we've been running concurrently and the work group we've been participate and another thing is I just go back to the previous release like last year since last year and we like CAP like features we've been working on you can see that and if you really compare to signal and other node we always actually is the most one that have to deliver most of the CAP and check of the most of CAP and ensure we can we can deliver so also we have several effort like the separate effort in addition of the weekly signal meeting we have the weakness charge meeting and you can see that this is the last night I asked for our SIG project leads and give me the data from the circuit also our SIG chair and he give me and like every day like every week average back charge it is the 12 we have the new goal and the Dixie is the new need here and we have work together we can like the want to have the new target is like the 20ish or maybe beyond this year and then we also have the like long average of the PR open and the real weakening it is a little bit more than 20 and then so those are those the facts I just got from the need okay now I hand over to Peter to talk about what we've been achieved real work hey everyone thank you for joining and sticking around you know to the end of KubeCon my name is Peter Hunt I'm also working on SIG node and I'm here to talk about some of the stuff that we're currently working on and thinking about and we'll start with off with CAP so I basically here are a list of all of the CAPs that we worked on in one dot or all the ones that merged actually we worked on more but you know how CAPs go so we have these 12 CAPs that I'm going to go into a little bit more detail about but I'd like to first sort of highlight a little statistic which I'm quite proud of and I think all of us up here should be and anyone who works on SIG node which is that and it's a little bit of bragging but SIG node is has the most CAPs that have merged and we have made progress on in 130 so that's pretty cool so you know as Don alluded to you know some people say that SIG node is a very slow SIG but it's actually also a very large SIG and as a result we're actually very productive we're just you know we have a lot to do and we're always looking for more help which Mattias will talk about a little bit later so I'm going to talk a little bit more just about the CAPs that we've been working on and go into a little bit more detail I've bucket them into a couple of different groups the first one which I'm also quite proud about is we've made a lot of progress in 130 on some very old CAPs you see this number 24 app armor support is actually the oldest currently open CAP and it has languished a little bit in beta and it's still in beta we haven't graduated yet but we have made some progress in one not 30 by making the app armor fields feature gate which has you specify with a former formal API similar to sec comp profiles so you don't need to use annotations anymore which is quite good and we're hoping to have that go to stable in the future we also made some progress on username spaces it graduated to beta in this release which we're very excited about and we updated it so that now it requires the cri to have support for username space id map mounting before and the underlying container runtime as well before it specifies which closes a potential security hole where the cuba thinks that there's id map mounts support but actually there is not so this is good and we're looking forward to expanding the use cases for username spaces and the pervasiveness of them another one we made some progress on is memory swap and that we have retargeted to beta two again we've added an additional option now the no swap option which means you can have swap enabled on the node but none of the containers are actually given any swap which is useful for cases if you have like some other you know entity that's giving a container swap like you know with an nri plugin or something like that but they're not coordinated with the cuba which means there's a little bit more customizability with the swap but it's still like the cuba is not going to fail if swap is enabled and we also drop support for unlimited swap because it introduced a whole bunch of issues and node instability and we didn't like it so we took it out so sorry if you like that but it's not very good and then we also made some progress on proc mount type which was first introduced in 112 as alpha and has not made any progress since it's still in alpha but we're now relying on username spaces to actually specify it because all of the use cases we think of which is often containers inside of a pod like you know a nested container inside of a container that's in a pod are really relevant with username spaces as well and introducing proc mount type without username spaces is almost as strong as like introducing another like semi privileged option so we have that dependency now as well the next bucket of features which I'm not going to go into as much depth but these are all just some features that were formerly in alpha and now they're in beta they're following the normal life cycle of CEPs and we're excited about them drop in cuba configuration will allow you to specify you know drop in config for the cuba which can allow you to customize the configuration easier image GC after maximum age allows you to clean up a image or specify an image should be cleaned up after a certain time period and then a sandbox creation condition which is useful for batch workloads and then a sleep action for post stop hook allowing you to you know pause after post stop these up here which are brand new features in alpha and you know we're also feeling excited about them and I'm going a little over time so I'm actually gonna you know just glance over them but you can look they're in progress and we're looking forward to be making more progress on them as well the final bucket which is actually just one CEP but it's very relevant now and everyone's thinking a lot about it and it's you know we're celebrating this week it's a DRA con you know these are all the talks that were about DRA but actually in reality the reason that it's so relevant is it's actually AI con and here are all of the talks that are about AI and machine learning so obviously the Kubernetes community is thinking a lot or in the cloud native generally is thinking a lot about being relevant in AI workloads and being able to handle those types and DRA is one of the things that Stig Note is thinking a lot about in enabling those special workloads so you know I'm going to go just briefly over because you know there are all those other talks that you can go in reference but DRA is basically a way to teach the scheduler and the kube API about special devices and it is relevant for enabling GPU and special networking cards in the context of AI we're really thinking about GPU enablement but it also can enable and all that and it defines what we call it's apology which you know it's just a fancy way of saying like the node resource alignment so you have resources that are on a node and you want to be able to align the pods with those resources so the DRA as it currently works now it's this is a brief diagram and it's currently an alpha well it was an alpha pre-130 and it's still and we've changed it a little bit and I'll get into that but the pre-130 this is the way it worked where basically a vendor is able to register you know a special API to the with the DRA driver controller and that sets some information to the scheduler about you know what resources are available on a node but there are a couple of problems with it and things that we're looking to fix so one of those things is you want to be able to like there exists this use case where there are multiple GPUs plugged into a node and you want to be able to choose which GPU a pod is you know you want to be aligned to and the current API wasn't really able to do that expressively there also is the DLA allocation problem which is currently the way the scheduling works is the scheduler has some information about what is on the node but really it's just going to go and you know try sequentially through all the nodes and be like can I try this here and then it gets denied at runtime rather than at scheduling which is kind of inefficient and a consequence of that you know pattern is also right now there's no way to signal the cluster auto scalers when you know because we're not thinking about at the scheduling level is being denied at the runtime so there's no way to tell an auto scale they're like hey we don't actually have enough of these GPUs or these special networking cards so like we need to make more nodes so the way that has been proposed to fix this is uh so this is the proposal you know sort of new setup which is going to allow the cuba to well the dra driver has kind of two pieces of it one of it is you know to specify the resources and that goes to the schedule and also have the cuba be able to broadcast that information to the scheduler so the scheduler has more in-depth knowledge about what these resources are and what they have and that means that it can deny things that scheduling step and that will allow you know other events to be listening to the way that the scheduler works and be able to react to that so this has been done with kepp 43 81 structure parameters and we're all thinking a lot about this and you know this is going to probably be the hottest topic of the next year so stay tuned and you know let us know if you have any other use cases for it you know it's it you know obviously looking forward to seeing you at a icon in Salt Lake City so uh with that I'm going to pass to Dixie who's going to talk about another deep dive in signode hello everyone I'm Dixie I go by the alias Dixie and I'm going to deep dive into in place pod resize and why in place pod resize in particular because I hope it also becomes a resource management con next time so this is um one of the features that can actually impact the cost of the resources being used and has the prospect of reducing the cost of running your workloads so what is in place pod resize so today if your workloads require more resources or if they require fewer number of fewer so fewer amount of resources while they are running and if you were to resize those today those pods need to be recreated but with in place pod resize uh you should be able to resize your pods on the fly dynamically without your without any disruption to the pods uh why in place pod resize can be helpful when you have workloads that are kind of bursty or that kind of utilize more number of resources in the beginning or and for a specified amount of time while they are running uh in place pod resize can be helpful by resizing your resources depending on the pods usage and like I said previously it can also help with the cost reduction and it can also help when the resources are not specified properly in the pod spec so what are the changes that we made in the pods in the container spec for this feature so if you see here we have added a new field called resize policy where you can specify whether you want to restart your container or not for scaling cpu or memory depending upon your needs as for this feature the resources were made mutable which helps you to change change the values on the fly depending on the usage and this works with the vertical pod auto scaling the another change that this feature required was to change the status of the pod spec the pod status and it has been extended to show the amount of resources that are allocated for your pod versus the amount of resources that you desire your pods should have so the allocated resources is the one that is allocated at a point of time and when you do the resize you can see there is another field called resize which shows you the status of your resize so as per the spec the resize is in progress and the desired state is specified in the resources section for cpu and memory the status of this feature is this is in alpha right now since 127 and we have been trying to promote it to beta for a while now and we are seeking for some user feedback there are a lot of blockers here which i have specified and added the links in interest of time we won't be deep diving into it and in 131 we are trying to address at least some of the outstanding issues if we are not able to do beta but we are actually trying to aim for it some of the issues that we have already addressed are supporting cpu managers static policy and some more and the rest are added in the link and i will hand it over to Matthews thank you hello so let's now look at the future of sig node so if you probably heard it uh there was an anniversary uh so Kubernetes turned 10 and last last year during the chicago coupon there was like a great keynote talk from Tim i gave the link so you can see it later and in this um during this keynote he asked several uh communities engineers what should be the future for the next decade of of uh of Kubernetes so one of them was Clayton and Clayton said that Kubernetes is very good because we can throw any workload at it and it's going to be average but it will work so now as the signal to support this statement we take the decision to shift more the signal focus from infrastructure to workload but it already happened so we already currently have a set of features or caps that are working on to support this we are trying to announce the lifecycle management with the cycle containers we want we are already working on declarative node maintenance we focus on addressing some of the hardware uh better like Numa support gpu tpu accelerators storage also um another very important thing when you are doing when you when you are specific on the workload is how do you isolate so and of course we collaborate with other six on this namely seagaps architecture auto scaling scheduling and batch but as um Peter said the elephant is in the room so all this work is not sufficient because of ai and especially one another kind of ai that uh we aren't prepared for which is inference and if we focus on inference or jnai uh it it's really different from a normal uh ml workload where you train your model because in this case you need to serve and and you need to keep on serving where for training you can do batch you save and it's finished usually you schedule one workload one one workload per node and you take all the resource on this one and by the sheer size of the models it takes a while to start so today uh there are different vendors that are trying to use the stateful set for that but they are not suitable we have scheduling issues we need a better way of allocating resources so auto scaling also needs improvements and actually uh this was like pointed out uh for the dra that we needed and also for some uh future work that i'm going to talk about uh so signode and other signode and other six we are proposing to create a new work group specifically for inference or serving we haven't decided the name yet together with sigaps architecture auto scaling and workload batch it has been discussed last monday during the the contributor summit and we plan on evolving the dra improved the input the the pod resizing and also continue with the swap and huge page support another um plan for the future is to improve the the stateful set for each pc workloads so which translate into revisiting the the hardware model for the for the node so there was like a non-conference also on monday and here are some of the key points to to remember from this conference we want to make like efficient use of the of the hardware we want to be workload agnostic we want better scheduling predictive scaling and as i said in the beginning focus on the workload so this conference was just the beginning we haven't started working on it yet but at least all the right person went were on the same place and we started the discussions so we should really stay tuned as we said and uh watch for the improvements in the future also we cannot do this alone so signode is a big thing but we have a lot of work to do so we need your help so in this room if you think about contributing there are ways of contributing you don't need to be like a super talented contribute uh coder we have code contributions i listed them here by priority about what we need you can increase the test coverage please fix bugs uh or do code reviews but you can also contribute even if you don't code just try your features improve the documentation or give feedback on the on the user experience or translate the docs or organize some events to spread the word and make sure everybody knows about how Kubernetes is great now i'm gonna try to convince you why you should invest particularly in signode because we're a great team that's it we enjoy food we enjoy drinks and uh no to be more like okay so as as Dan said in the in the beginning uh we are at the center of of everything so we are contributing to components of of Kubernetes we we have like uh we are ranging from problems in hardware in network in in discussion with the cri uh in scheduling in the pod lifecycle uh so there is definitely something for you to to to work on and of course as other six and and the whole Kubernetes community we are welcoming anyone to help and uh yeah i mean you should really try reach to reach reach out to us um try to join the meetings i think i have yeah here how do you do so there is like a website helping you how to get started into the community's contributions i gave the links for the community meetings that Dan mentioned we have the main page for the signode all the working groups there will be new ones and some links about how to mentor so please consider that and if you want to have the the slides of this i will upload them into the schedule and there is a QR code to to reach it now i think we can go to questions the uh KEP in one uh in 1.29 it was um talking about the the kubelet gc is that gonna is that just is that for all images or is that just for the kubelet image um so that is for the container images so like all all container images right right specify hey you're too old yeah like yeah exactly so it's you specify like a timeout and after an image is unused for a certain amount of time and it um is qualified for garbage collection regardless of node pressure like memory uh disk pressure on the node thank you we've been doing a lot of work and it's um and we have a lot of experimental um progress i guess we actually chose for various reasons to base everything on 1.28 and um what is the actual process of how can we get feedback on on how do we factor what we've done to be acceptable before we go through i i don't want to waste our time and your time um like creating a cap that is just going to um is unnecessary editing cycles for example uh we can do a short version where we explain what our objectives are and then we can also list of all the follow-on objectives that we have because we identified a lot of different things that we want to contribute into particularly um kubelet in the runtime um and if you have some plans ahead of you know we also want to sort of fit in um so even though we've done it one way today um i don't want to go too far and have to refactor thanks for this one and uh i have to honestly say most of the people give us feedback is came to the signal the weekend meeting but the when you raise this question actually today's meeting i realized actually the bar for for for community to give us the feedback is too high right so you have to go to the signal meeting and the time difference all those kind of things so there's another way to give us the feedback so because we have the signal the meaning list and we reach off the most of the engineer but i also think i also think about that the bar is too high because the signal the many people subscribe so you know so um maybe after this this one i can talk to you we should figure out one way to not just hit to everyone and uh and there's the some way we can we can accept right so accept so not disclosure all your personal information contact information to in part of the community but you can provide feedback to us yeah yeah and also for earlier like the image garbage collection there are also have the i just want to add one thing and there are also have the discussing for some image even like not like the wire use but there's the way for we at least we discussed how to pin those images and even you set off that time experience data and it could be there's the discussing and so just want to add one thing hi um so my question is around in place um no uh pod resize uh it's an amazing feature and this is i'm i'm being uh i'm very new to signal um as uh as a consumer of the uh feature so uh my question is we typically recommend uh any app platforms when they're deploying uh when they're gauging or estimating their resource requirements to have higher density on the nodes now how does in in place pod resize separate itself or provide additional functionality in terms of when the when the nodes are highly dense how can it avoid rescheduling uh how can it separate itself from virtual vertical pod autoscaler we are essentially in a situation where we'll end up rescheduling that part on a different node and having something similar to vertical pod autoscaler and in place pod resize does it have any impact on the future for vertical pod autoscaler as well in place power the resizing actually it's just enhancement for existing vpa so existing vpa basically just risk you restart those guys you update that one then you restart in places mean like the if possible of course not we cannot guarantee especially for the memory you shrink memory and you could cause off the wall so the container might still die but that's basically it's just reduce make sure the if possible then container still can run yeah it's it's not like the get rid of the vpa no it's just enhancement yeah so what vpa is part of the sig autoscaling and then it's an extended functionality in sig node is that how we're looking at this in place pod resize yes yes yes yes vpa because to make this in place part resizing we have to work together with the autoscaling right so always otherwise how you are going to get the monitoring all those kind of things and make decision right so and also this is a lot of decisions not only done by node might be initiated by node and and enablement by the node but for this particular feature actually enablement by cross organization not just node and also like the autoscaling and also another thing is that just we maybe we oversimplify here so like how you are going to in place update that eventually you have to be some priority so we are going to make that yeah my question is about vpa in in place resizing as well so from what I understand there's no restarting but does it is there going to be any kind of impact to the workload and how does it work for parts that are running something like a jvm where increase in the memory for the part may not necessarily increase it for the jvm yeah then in this case you you have to have a workload that is able to use this memory that you have yeah so not in the case of the jvm or if you use golang with like gomem max gomem or something yeah it won't help I mean there is no magic but at least that's something that is possible today and you could still tell the vpa to restart to to create a new container I guess to to to use the new yeah yeah thank you