 Hi, everyone. My nickname is Dimms. I'm on Slack and Twitter and things like that as Dimms. This is a SIG architecture intro and update room. If you're not interested in the topic, then you're in the wrong room. Hopefully you're in the right room for the right reasons. So before we get started, this is my friend. Hi, everybody. I'm John Belamerick. And Dimms and I are 2 thirds of the SIG architecture chairs, co-chairs. Right. Yeah. And he works for Google. I work for AWS. And here we are in the community trying to do things together for sure. So you're all here in KubeCon. You've seen how many people we have attending this KubeCon. And the reason you are here is because you want to learn Kubernetes, how Kubernetes works, how the Kubernetes communities works. And basic goals of the Kubernetes project when it started looked something like this. It's got to be portable, general purpose. So do you think we actually do all these things? Yes, no, nod head, raise hands. Yeah, pretty much, right? OK. So we also have community values. Kubernetes is not just something somebody is doing it out of proprietary stuff. We do have a community around it. We are doing open source. We have these community values in the Kubernetes community. I'll let you read some of these for yourself. But one of the things right at the bottom is evolution is back better than stagnation, right? You saw all the keynotes from yesterday. So we are trying. We are trying to iterate ourselves to do all the things that you all want us to do in the ecosystem, in the communities, in the projects, in the companies that you work at. We also think that being inclusive is better than exclusive. We want all of you to come here and help us make Kubernetes better, right? So that is the reason for giving the stock, to introduce ourselves, how we work in the community, how we work on things in the architecture, and hopefully, out of the stock, you'll figure out which areas that we work on, and you'll be able to help us better. So to start the topic, here is the overview of the Kubernetes project, right? We have a bunch of SIGs, special interest groups. Anybody here does not know what a SIG is, right? So you've heard about SIGs. It's special interest groups where people get together about a specific topic, whether it is networking, or storage, or testing, or usability, and things like that. So we do have a well-defined structure. In the end, Kubernetes is a CNCF project. So right at the top, we have the steering committee for Kubernetes that takes care of trying to figure out what are the needs of the project, right? Like, what do the people need? What are the CI CD needs? How much should cost for things, or things like that? So they help the overall picture, negotiate with CNCF, and other entities within CNCF, and even outside, they are the primary point of, like, OK, they are the leadership body, essentially, right? So steering does not do technical stuff by design. What they do is they give it to the SIG architecture to deal with the technical stuff. And the SIG architecture, in turn, talks to each of the other SIGs to figure out, like, what do you want to do, what do you want to do, kind of thing, right? So everybody writes a charter. All the different SIGs have a well-defined, well-scoped charter, which repositories, what areas do you work on, what do you want to do, how do you want to do it, and things like that. Most of these SIGs have multiple chairs and technical leads and things like that. So it suffices to say that all the other SIGs that are doing in-depth work in specific areas, they essentially report up to SIG architecture in the sense that if there are conflicts between SIGs, then you end up coming to SIG architecture to say, hey, we are disagreeing about something, or can you tell us how to make things better, or come up with ideas, extensibility points, and things like that. SIG architecture is the common ground where we all meet together and decide on stuff going forward. So just like other SIGs, SIG architecture also had to write a charter. And this is what we wrote in our charter. So we take care of multiple things. We mainly take care of things that are across different SIGs. Like if you're talking about a deprecation policy, there is a feature, how do we go about deprecating it? We don't want all the SIGs to have their own deprecation policies. We want to have a centralized one. So anybody who is coming to Kubernetes will know what to expect about a deprecation for a specific feature. What are the API conventions? What is a conformance test? If each vendor goes around implementing whatever they want, you're not going to get a consistent experience between the various Kubernetes distributions and vendors and the workloads that you work on are not going to be portable across. So we do have a centralized process where we define some tests which are mandatory across everybody. And it's also used for, hey, I'm a vendor. I want to use Kubernetes in my product. I want to use the Kubernetes trademark in my product. Then what do you do? Make sure you pass conformance. So that is what the SIG architecture scope is to make sure that we are taking care of things. Technical side of things, not the people side of things. People side of things is mostly on the steering side. So we're talking about cross-cutting processes. You don't have to take a screenshot of all this. The slides are going to be available. So the links, you can click on them to figure out, like, hey, how do I create a new enhancements? How do I ask for an enhancement? How do I know if one of the features is production ready or not? So we have cross-cutting processes and we try to maintain them in a central location for people to look up and things like that in the enhancement repository and so on. What other kinds of issues do we deal with? Behavioral questions and we do have some conflicts occasionally on technical matters on like, if there are multiple choices of how to do things, maybe one of them is like a tactical thing which can be quickly done and shipped, but then it might not work in the long run. There might be another one which is slightly longer distance and so on. So maybe there is a few ways of doing things. So some of us end up trying to build consensus around what is the best way to do things. So we have a mailing list, we have Slack channels, we have bi-weekly meetings. So we welcome you to all of these things and come talk to us if you have any issues, any questions on what we do and things like that. So we have a bunch of sub-projects. You all know about the API design, right? How the API is designed and things like that. So we document the design principles and evolve it over time. Like we have some examples on some of the new things that we are doing later, but we also take care of things like code organization. For example, how many of you have run into problems with Golang upgrades? Like you have to upgrade to newer Golang, you want to use Golang features, can we do it across the Kubernetes code base for everything or not, right? So we kind of like take an opinion on like, here is the versions of Golang that we should use, here is all the stable branches that we need to upgrade Golang to. So that is something that we set the tone for the entire community as well. So do you want to start here? Yeah. Sure, that'd be fine. I'll just say one thing, thank you, Dems. Recently, just the last few days, I've been thinking one of our roles and we haven't even talked about this, but it's just another way to think about it. If you look at, if we go back and you look at kind of this overview, I don't know how many of you have heard of Conway's law in software engineering? Yeah, okay, that's pretty good. So this is a law that says basically that your software ends up looking like your organization. Well, our organization, right, looks like this and our software ends up looking like this unless somebody actively works against that. So the processes that we have and the kind of both the API convention and the review processes for API reviews, all of these things are in part geared towards trying to prevent those silos from forming and our software just ending up like a bunch of independent components. Because in fact, it is like in this structure, like KubeLit is owned by SIG node, right? KubeScheduler is owned by SIG scheduling, right? Yeah, KubeCTL is from SIG CLI. Right, shocking. So, right, but what can happen in those is that people end up just, they have a scope of control, they wanna implement some functionality and they do it within their scope of control and that might not be the right answer for the project as a whole. So part of our role is trying to fight Conway's law. So along those lines though, we'll talk a little bit about each of those sub-projects. So API review or each of these sort of processes owned by these different sub-projects. So in Kubernetes when you want to add a feature to Kubernetes, very often it has some API associated with it. And so we wanna make sure that the APIs work the same way out of those different SIGs. And so there's a review process that every API that goes into Kubernetes needs to go through and that is run by this project. So this is a super interesting area if any of you are interested in kind of getting involved in Kubernetes. It's a long haul. It's a crazy area and we learned a lot over a period of time and it is mind-bending the number of things that we have seen before and just a simple example, right? Like you wanna talk about the IP addresses and like, yeah. Right, when we go from IPv4 to supporting IPv6, well everywhere throughout the API, it just says IP address. So now all of a sudden we need to pluralize that and maybe it'll have one or maybe it'll have two, maybe it'll have one that's IPv4 and one that's IPv6. So how do we handle that on upgrade? How do we handle that on rollback? It's just the... And how do we make sure that all the components are updated with the same thing? Exactly. Yeah, so. Very challenging area and also something that kind of takes years to get to a state where you're one of the reviewers but still a super valuable thing. Yeah, the fun thing about the API reviewers, like we do have a cohort that we run where we onboard people who are already working in SIGS to become API reviewers. So it's not like we're gonna leave you with the dangling. So if you decide to engage, there is a process for getting to the point where you become an API reviewer but it's gonna take a while and you need to put in the work and learn what we've learned over the years, yeah. Exactly, exactly. Code organization, Dim's talked about it already. So this is sort of, what are all our dependencies? How do we manage those dependencies? How do we upgrade those dependencies? When you upgrade a dependency, maybe you break something else, right? So it's complex and Dim's does a lot of this work. Right, so. How many of you know how many dependencies we have in Kubernetes in the vendor tree? It's more like 300. So we essentially pull in repositories from everywhere and we end up using in the vendor directory and we do have like an active process for pruning things, making sure there is no duplicates and like make sure things are updated and like fix security bugs and things like that. And it never ends, right? Because there is always churn. Sometimes repositories goes missing, right? Like somebody is a personal repository or they change the name of their org, GitHub org and like poof, it's gone, right? So we have to keep up with the times and keep up with like. Sometimes they get orphaned where nobody's maintaining it. Exactly, that has happened so many times, right? And we need to find out, oh, did anybody else fork it? Is there an active fork? Or do we need to fork it and keep it for ourselves? And like, how do we maintain the fork? If you don't know what is in the fork, right? So that's plenty of, you know, we have scars from a lot of those battles. We do, but this is actually a super interesting place for somebody who knows Go, but doesn't necessarily know all the vagaries of how Kubernetes works, right? Where you can actually get involved immediately and make a real impact. Yeah, and also the fun thing here is, it's easy to get into, right? Like we have a shell script that says, I want to update this dependency to this version, right? Anybody can run it. And then there is one more shell script that says, hey, make sure that everything is updated fine. Everybody can run it, right? And then the thing is like, you send it to the test PR and then you can watch the stuff explode. That's the explosion of red. Yeah, exactly. But you know, that's a learning curve, right? So once you get to that point, then it's like, okay, maybe I shouldn't use this specific version and I should use a lower version or maybe I need to go open an issue with that other repository and wait till they fix the problem that we see before we update ours. Enhancements, so Jim's mentioned earlier, hey, what do you do if you want to enhance Kubernetes? So we have a process for that. If you have an enhancement that's significant, not just a bug fix, which obviously we love bug fixes and please make them happen, but if you want to make a functional change or a change that's potentially disruptive to during upgrade or to users in some way, then that needs a lot of eyes on it. So this is a design process where you sort of put down, here's what I'm interested in doing, here's why I'm interested in doing it and here's the SIGs that would be, typically any change comes out of a SIG, here's the SIG that would be sponsoring it and you try and bring in the right people to start to review it and put together. And it's a living document, right? Like every time the people who wrote the initial cap might not be around in a year or two. So this is the way, this is the way and this is the place we remember things, right? So, and we know what to do next. So when we are writing the cap, we'll say, oh, alpha, here is what we need to do, beta, we here is what we need to do and then we go on tweaking it. And then when we realize that we put ourselves in a corner, then we try to get out of it by saying, okay, we are still going to be this in beta and we'll do some other things when we hit GA, things like that. So, exactly, exactly. And there's, so SIG architecture owns that process. So they're actually a group of people that manage that process because it's, as you can imagine, if you're a software developer, having to go through design reviews and all this people get irritated by this whole process. So this is one of the areas that we really would love help in making this process smoother and better. So for example, we have an issue right now where if you, you know, the burden sometimes involved in producing a cap with all the details that you need for a cap for a major feature, don't necessarily apply to certain types of sort of very lightweight changes or policy changes within the organization or something like that, but people end up going through the process anyway. So like, there's, we could definitely use help from like a project or program manager. And this is not just for you, it's for all of us too. We have to write caps for everything. That's everything all the time. So it's, we try to keep ourselves honest in that case. Yeah. Yes, for sure. And then conformance testing. So Dim's mentioned that we have a conformance testing program in Kubernetes. The way this works is actually jointly run with us and the CNCF. So the CNCF runs kind of the program. So every time GKE comes out with like a new major version, we have to run a set of conformance tests. We submit the results of those tests to the CNCF and then CNCF says, yes, you passed those so you can still use that K in GKE. Same with the EKS? Same with any vendor. Exactly. So unless we pass those tests, the word Kubernetes cannot appear in our product name. So that's owned, but that process of evaluating the tests is owned by CNCF. But the actual contents of those tests, what those tests are and is owned by SIG architecture. Right. Essentially by the community, not the vendors. So we are wearing community hat when we are talking to you all about this. Yes, getting back to the values of community over project over product, which is really, really well respected in the Kubernetes community. Production readiness review. So this is another one of these cross-cutting processes. So in that KEP, that enhancement proposal process, well, anytime you introduce a new feature to a software system, you probably are gonna break some people. It's because you'll create a bug or something like that. So the production readiness review process isn't for, is Kubernetes production ready? We know it's production ready. It's used all over the place. It's for, is the new feature production ready? So we have a bunch of questions. We ask the developers and we say it's part of that design process. And their developers are, you know, building their feature. They're concerned with thinking about their users and how their users, they can meet their users' needs. They're not always the people who are running the 10, 20, 30, 50,000 cluster fleets. And, you know, at least, you know, for the cloud providers, that's a big concern. We have many, many thousands of clusters. And if a new feature gets added, we want visibility into, you know, did that break any of our users before our users see it? Did we want the ability to turn it off if it's a beta feature and it's breaking a user somewhere? Is anybody using it? Is anybody? Yeah. How many times are they using it? Right, like, take monitoring, right? And can people go switch it on, switch it off? Can, you know, can they go back and backwards and forwards if they need to? Can they roll it out? A lot of the times, it's trying to make people think, especially the developers, who might not be used to thinking in putting themselves in the shoes of the cluster. Yeah, I mean, taking the IPv6 case. So, when IPv6 came out, I was doing the production editor review, I'm like, I asked the question, okay, what happens when you roll back or roll forward? And it's like, well, we have to basically recompute all of the endpoints in this service. And, you know, it's like, so if you've got a service with, you know, 5,000 endpoints or something, right? Like, what happens to those clients of that service during this process? And so, the developers may not have a fix for it, right? And that may be fine. But what we at least get is a document that says, okay, here's a potential risk. So, as a cluster administrator, I can say, here's the risk. When I turn this feature on, I can watch for this. I know what to look for. I know metrics to check to see if it's causing problems. Or I can warn my users, you know, hey, we're gonna have some potential issues here. So, that's really our sub-projects. The other aspect that's kind of something we focus on in the design process of Kubernetes. How do we have that? We have another 10 minutes left. We probably need to speed it up. Oh, we should pick it up, okay. So, part of our goal with the architecture of Kubernetes is extensibility. We talked about that in the beginning. And so, we kind of keep a focus on that and make sure that people are doing that. These are some of the sort of extension mechanisms built into Kubernetes. This is part of what makes it so flexible. There's an incredible number of ways to extend Kubernetes. And so, we're quite proud of that. Some of the recent activities you want to cover this? So, there's a new sub-project called API Snoop. API Snoop helps us figure out, are there any end points that we recently added which don't have test cases? So, it gives a flag to say, hey, developer, you added a new feature, but it's not well-tested. It has some API stuff that you added. Some of them are not being tested. So, it gives us a warning so we can go back and fix the problem because we want to make sure that every API that we end up shipping does have complete tests. For a long period of time, we did not test everything that we shipped. So, over a period of time, we've been able to get to the point where we are relatively good at this. So, we also constantly think about upgrade problems. Right, like how many of you have upgrade problems in your clusters or in your, right? Like, so, it's a common pattern, right? So, we have to figure out like, is there a better way to do this? Can we use new binaries with fixes for CVs with old feature flags and things like that? So, we are thinking about a new cap for that 4.3.3.0. So, in the 3.9.3.5, we were essentially thinking about what is this set of cubelets that can be supported by an API server? And so, the idea here is that can you frequently update all your nodes and only deal with the API server once in a while? So, that helps with that. And you all have scanners in your companies and like, if you have problems, then you go open an issue or talk to a vendor and say, hey, this container images, this binary, and there is this CV and we gotta fix it. Otherwise, the security people are gonna come after us and stuff like that, right? So, we use Go to compile and build and we have set of supported stable branches. For a while, we used to have supported branches where we were using old versions of Go, which were no longer supported by that community. So, we had to shift around to make sure that we are able to compile all our supported branches. We did that work over, it took probably like two months of work about six months, eight months ago. So, you wanna talk about that? Sure, so the last one here is, you may have seen the keynote yesterday that Kevin did and essentially we wanna take that work that's coming out of DRA and we're taking a look at the way that Kubernetes manages hardware resources in general and trying to improve the sort of granularity of visibility into hardware that Kubernetes has. I bring it up here because it's very cross-cutting. It touches network because of multi-network type of network interfaces. It touches obviously node because of the devices on the nodes. It touches auto scaling, scheduling. It's very broadly impactful. So, we take a look at it as part of sync architecture as well. The goals of that effort are really to enable the efficient use of all of these devices, meaning sharing them, like if you have a pool of these devices, sharing them across many workloads, but also dividing those devices and being able to allocate portions of those devices and having the scheduler cluster auto scaler and the rest of the Kubernetes control plane aware of that. And also to do that in a workload-oriented API. So, this is where I get back to Conway's law. So, we have certain APIs within Kubernetes that were built, say, by the node team that kind of are done at the node level. So, you can configure pneumatopology policies at the node level, but it's difficult for a workload author other than using a node selector to declare their intent for what type of policy to use. So, we're hoping to take this opportunity to reorient back towards a more workload-focused API where we have the workload author, the pod spec, writing in their pod spec, saying for this particular pod or this particular deployment, here's the type of topology policies I need and be able to implement those on the node, either by targeting a node with that topology policy or hopefully in the longer run, evolving topology policy to not just, the topology man is not just to have a static policy, but a dynamic policy that can be on a per workload basis. So, here, if you were attentive, the question you might ask is, hey, you just told me about all the extensibility stuff. Why can't it be an XML stuff? Can you not do a CRD and stuff like that? And the answer to that is... It's that you need all of these in-tree components like Scheduler and Cluster Autoscaler and Qubit all to understand the same language, that API that was built in. And when we added here, then we still support all these things in the internal components, plus the same things can be used by all the external components that are out there. Right, Carpenter and all those things, yeah. Yeah, external schedulers and Volcano and everybody else can inherit the same benefits that our increased stuff does, too. Exactly, and exactly. So, we're running low on time, so I'll just, I had added another slide, but we don't need to do it, because... Yeah, we can skip this. All right, you wanna take this one? Sure, so we do focus on extensions. We need to build out conformance a little bit more. So, another problem that we usually have is around beta features, right? Some of the beta features are on, some of them are off. So, we need to switch them on and take them to GA. So, you know that you can rely on them to be available across all the vendors. So, we have to also figure out like how to scale our organization a little bit more. We have a bunch of sub-projects. We are trying to invite all of you. There are some people already working on some of these things. We need your help, so there is a lot of leadership opportunities in the different sub-projects that we have and we are also writing an official report on like, here is what we did well last year and here is where we need to help, things like that. But if you come, you know, there's plenty of things to do. There is no shortage of things that we need help with, right? So, and there's few hands right now doing a lot of the heavy burden and it would be really good to have you all there to help us with it. Yeah, exactly. And so, I think we have five minutes left for questions. Any questions here? There are mics you can go to. Two mics, if you have a question. We'll be here after that too, Dave. Yes, there's a mic, it's behind you. This might be a dumb question, but does the DRA staff have any interaction with the ContainerD NRI node resource interface or is that totally... That is not a dumb question at all and the answer is it probably will at some point, but not at the moment. Yeah, so there is a few things here, right? Like, first one is making the... Okay, if you're talking about the DRA as is, right this minute, you know, there is a concept of structured versus unstructured opaque parameters stuff that we have to change to like, you know, better parameterization. Then we are talking about this cap here that is gonna land. So those are all things that we need to do in the Kubernetes itself. And then the next thing that we have to think about is, do we need to change the CRI API? Is there any more information that ContainerD needs from API server or through the workload pods and pod spec and things like that? If we need to extend the CRI API, then we have to add more things in the CRI API. That makes sense. And then we update ContainerD and Ronsi to make use of the information that we are sending them. And then, so that is the flow of how things work. I'll give you examples, right? Right now there is a cap and flow for a swap. So we started the process on the API server side, and then it's gonna go on to the... The CRI already has an implementation. The ContainerD implementation is almost done as well, I think. So there is a flow of all the stuff across the different components in the ecosystem that we have. So I hope that answers your question. But for DRA specifically, we have specific Slack channels other than Signode, you can come to Signode first. And then from the Signode, you can hop into the other channel where we're talking about all the DRA related stuff as well. And pick one of us on Slack, Dimms and Jay Bellamarek. John Bellamarek, yeah, yeah. And I'll just say one more thing about that. So the way I think about it is it's a sort of decoupled model, right? Where there's nodes publishing or even controlling components, publishing capacity information. There's schedulers and autoscalers using that information to make allocations. And then there's engines for actuating the information in that allocation. And the Cubelet is one of those and NRI could be one of those. Cubelet with topology manager could be one. NRI could be one. But those aren't clear yet. That's sort of like a step down the path. What we would take is requirements from to make sure that they get the information they need to actuate properly. Yeah, and it's harder to do this because it almost feels like we are doing this in vacuum, but we are talking to the people who are doing the schedulers. We actually stop the beta switching on of the DRA because we realize that the autoscalers don't have enough information. So we are iterating. Like I said, it's a living document and so on. Yeah. Thanks very much. Yeah. Yeah, Shriram. Yeah, so this might be an actual dumb question, but I wanted to know more about the production readiness review process and whether, so please correct me if I'm wrong. So as I understand it, it is limited just to the cap and not the actual code review and the PRR process ends where the enhancements freeze or sometime after the enhancements freeze. So part of our community value is right is we do trust our users and our trust our developers. The PRR process is about making people think about their design. The PRR reviewers aren't gonna go into each of those code paths and look at the code that's actually being submitted. That's up to the approvers for that area of code. So PRR is kind of at the design level. Like, did you think this through? Did you think this through? Okay, add this feature gate. But that gets captured in the cap, right? It basically affects the evolution of the cap. And then from there, we do rely on the actual owners of the components that are being modified to validate that the code that's being checked in matches the cap. Thank you. Any other question? Thanks a lot. Hope this was useful to you and we hope to see you on Slack at least. Thank you. Okay.