 All right. Hello, everybody. Welcome. We're going to talk today about Kubernetes SIG architecture and what it's all about and how you can join in and help out. A little bit about who we are. That's me. I'm John Belamerick from Google. And Dim's here. Why don't you take over? I think you're... My nickname is Dim's. You can find me at Dim's on Twitter, GitHub, and the CNCF and Kubernetes Slack channels. I work for Amazon Web Services as a principal engineer in the EKS. Start by trying to... Some of this is like looking backwards, you know, where we ended up. Some of these were goals that were there right at the beginning. You know, general purpose, portable, meet users halfway. So these are things that... This is why you're here, right? You're here because you're here because we were able to achieve some of these goals. There is still a lot of things to do and that's why we are here and the purpose of the SIG architecture is to serve as a space for the rest of the SIGs in Kubernetes to come together to hash out, like if there are technical decisions to be made, if there is some options on the table between multiple SIGs, how do we come to a single consensus, that kind of stuff. These are the goals, typically that guide our decision-making process also. Okay. So while we are going through the process of doing the technical decisions, we often go back to the values that we've written down in our community because it's the same issue, right? Like how do we make sure that we are taking the right decision? For example, if the tendency is to say, hey, every release, you have to do a certain thing, then do we do it manually or do we automate out, you know? So it's automatic when we push a tag for a new release, then boom, there's some stuff that is already done, right? So community or product or company that is very obvious, you know? We work for you know, we compete for customers, right? Like he's working for Google, I work for AWS and we are always competing and cooperating at the same time. Inclusive is better than exclusive. One story I would like to tell is we have people who have been in the industry for like 20 years in IBM research, we also have people who are like in their first year at college, right? Like so we welcome folks of all abilities to, you know, join and work in the SIG, learn from each other and, you know, there is always somebody in the room who knows more than you and there's always somebody in the room that knows less than you, so you have to be inclusive everywhere. Evolution is better than stagnation. If you were stuck with a bad call that we had made long time ago, then we need to figure out how to move out of it. The most recent example is we deprecated docker shim, you know, but we did it in a phase controlled manner by sending information out to a whole bunch of people to say, hey, here is a set of things that you need to do, here is how you get out of it and in the end, the docker shim, for example, was hurting our progress because we were trying to support an extra thing built into Kubernetes that shouldn't be there, right? Like if you just go to continuity project or cryo project, you will see there are like 10 plus contributors, right? And we were asking every person who was writing a PR or a KEP for node to say, hey, now you are responsible not just for the things that you do in Kubelet but also in the implementation, right? So that was a bad decision that we had made. So we tried to evolve out of it as well, so we don't stay stuck. So how does this work? So how do we do this job in the community? How do we, like, there is a process and there is groups that are associated with certain things. And the way we work is, okay, CNC is right on top. We are a CNC project. So the steering committee is responsible for anything that is happening with the project, right? Like if you need funds for running a CI job on say Alibaba or Azure or, you know, so we have to scrounge the resources, talk to the people responsible and, you know, find funding, all those nice things that we have to do. Plus the steering takes care of, like, policies for the, you know, how we govern ourselves as well, right? So the steering is right on top. Similarly, the code of conduct and the security response team are right there with the steering committee. And under these, you know, on the top, we have special interest groups. We have, it's kind of, like, evident for some of these. We have CIG architecture, which is us, you know, two of us are chairs in the CIG architecture. We have one more person who's not here today. But, you know, contributor experiences making sure that we reduce the burdens of somebody coming in with a PR, right? So they can take care of that. Similarly, we have docs, release, testing, node, auth. We have almost self-obvious stuff there. So if you are interested in Kubernetes, you would typically come and work in one of these six. If you have something to do with storage, then you go to CIG storage. And then if CIG storage and CIG node are working on something together and they need to resolve something, then they bubble up to CIG architecture. So that's the way that works. And we have, you know, other groups that help us do our job regularly as well. So that's just an overview of the project itself. And now you know where CIG architecture fits into this whole, you know, Jenga puzzle. Okay. The scope of the architecture is CIG. So let me go back a little bit, explain about like a charter, right? So one of the statements I made just now was the steering is responsible for everything, right, in Kubernetes. Now I'm also saying that if anybody, if a group of people want to do some work and they want to define the work that they are going to do, then steering will delegate that authority and they'll delegate that area to the CIG, right? So we try hard to make sure that CIGs don't overlap as much, but and they are distinct from each other. And the way we do it is by writing down what CIGs are responsible for, each CIGs are responsible for. So when we are doing CIG architecture, a couple of things that we ended up saying was, you know, that will be the group that will like collapse every, all the technical decisions upwards as needed. But also we were saying that, hey, if there are multiple vendors doing Kubernetes, then the look and feel for an application that is being deployed on Kubernetes should be the same, right? We do that through conformance test, right? When folks want to come up with a new API for doing something, we do API reviews to make sure that, it looks right and it's extensible, it's composable, it can like be updated over a period of time. Like imagine for example, one problem that we had was right at the beginning, we had one field called IP address and it was like IPv4, right? Then we said, okay, there could be more than one, right? And they could also be IPv6. So how do we go to the point where you upgrade the cube CTL, you upgrade the cube let? So there are so many, and then there is things that are in the CMI layers, we need to change. So how do we lock step, move everything forward because we have backward compatibilities, we have forward compatibilities. So that is the kind of things that we codify once we figure out like, okay, this is a bad pattern, then we write it down in the API convention. So similarly, we ended up doing many more things and we are also, the last one is very important which is how do we change ourselves going forward, right? That is we have enhancement proposal. In fact, the first proposal to the process was what is the enhancement proposal and what does it look like, right? So this is the scope of the project. So part of that scope is that we have a number of processes. So in order to we document the API conventions but somehow we have to make sure that people comply to those conventions. So there is a number of processes set up by CIG architecture and not just the two of us or the three of us, there is a whole bunch of other people who participate. So for example, the API review process, whenever a contributor submits a PR to Kubernetes, either in the main Kubernetes repository or using the Kubernetes group, then they are subject to these conventions, they are subject to this review process. So we have a bunch of people who actually look at those, make sure they are using the proper naming, that they are doing things like internally you need to use pointers to strings because you can't differentiate between an empty string and a null string unless you use a pointer. So there is a bunch of little, really my new details that need to be reviewed and make sure that all of these APIs work consistently so all the tooling on top of them works consistently. So we are responsible for that process in this group. Additionally, sort of adjacent to that is a complementary to that are the deprecation that happens when you go from a V1 alpha one to a V1 beta one to a V1 API. There is a certain time scale and a certain types of notifications we have to do. So this group also manages that process as well as those policies themselves. The production readiness review is an example of how CIG architecture puts in place sort of quality gates for the rest of the development within the organization. We'll talk about that a little bit more in a moment. But effectively, sort of this slide or this set of processes supports that scope that DIMMS showed in the previous slide. Of course, that said we tend to be a catch-all. So whenever it's not clear whenever there's disagreement among CIGs or within a CIG the TLS or the leadership can't agree they'll often come to us. We're not really an escalation point in that we have sort of the authority to say this or that. CIGs are quite independent. However the people who tend to participate a lot in the CIG are very senior engineers. Many of the people who started Kubernetes participate in the CIG and so people tend to come to us for guidance around those kind of decisions what's going to fall within those goals and those principles that DIMMS talked about in the beginning and make sure that we're sort of staying within the scope and it is within our purview to sort of publish those principles and require people to sort of interpret those principles. So in that sense we are somewhat of an escalation point that we're not to be we don't want people to come to us because they're in a fight which really hasn't often been a problem. That's a big stick and we don't want to use a big stick too often. It loses its power. So operationally we do this those processes we showed are run by different groups of people. Those are sub projects. So this is a typical structure in Kubernetes SIGS SIGS then take responsible area of scope SIGS then have sub projects and sub projects are run by individual people who actually make things happen. We have these five of them in sig architecture and I think we go into each of these so I don't need to talk about it. API review so I talked a little bit about this already if you want to jump in on any of these. Essentially as I said these individual PRs need to be reviewed and it ends up typically being somewhat of a deep design review as well because often the API is so closely connected with the functionality itself and how you make that API it's not just those little things I talked about but it's really structuring the API in a way that's going to be more familiar to our users who are using all the other APIs within Kubernetes. If you're interested in participating in this it's really great there's a shadow program so you can go to any of these links you can come to sig architecture on Slack or talk to us afterward but we can hook up with the right people essentially you say I'm interested in shadowing you go in and you watch other people do these reviews or you go in and actually you know take a pass at it yourself and then get feedback from the people who are already reviewers and approvers and I actually will beseech you if you're interested in this to do it because while we have a good set broad set of reviewers we have a very narrow set of approvers there are four people in the world who can approve Kubernetes APIs four people three of them work at Google that's not healthy I'm from Google but I don't think that's healthy I want more people I can't go to other companies and say hey bring people up to speed on this I can do that internally and we're working on that too but that's not going to help with that company so the best way to engage is just come and go to our GitHub repository and look at pull request there is a label called api-review so that is where api-reviews get triggered and if you just watch it for a few weeks you will get a sense of what is going on, who is doing what how they are doing it and then going through the two documents the markdown the files that you see there will give you a sense of they face this kind of problem before so they codified a rule for making sure that we don't run into this we don't paint ourselves into the same corner again and again and then you can appreciate what we are trying to do here and it's not just new APIs it's also like they should co-exist with the existing APIs that is very important too and it shouldn't like somebody is using did somebody else write this why does this look very odd it doesn't feel like any other APIs so consistency is very important to us as well so that you can see that we are trying to do the best here so one of the bans of what we do here is Golang comes up with the release every three months like clockwork and they support only three going backwards so we have to upgrade Kubernetes to newer versions of Go and that's not the end of it we have a lot of dependencies we depend on it's crazy you can see the picture there that's actually part of the picture that's not even the full picture so what ends up happening is periodically we have to revisit all the Go dependencies that we use and say see if there are any CVS that have been fixed in a newer release or some performance enhancements that we need or some bugs have been fixed that has been troubling we have to update our users for a long time and we have to upgrade and when you upgrade it's not just one dependency upgrade the dependency typically have other dependencies so we have to like manage the upward movement of all the dependencies as well so typical bottlenecks are like HCD dependency when we upgrade we run into trouble the open telemetry we run into trouble they had like ten different packages that we needed to upgrade to and it wasn't very clear which one works with what so we have a number of people including me who just dedicate our time to code organization and we also do things like policies for the dependency management as well we have tooling that we have developed which helps us figure out if you look at the graph of different dependencies how much is the depth of the dependencies how much is the breadth of the dependencies and like when a PR comes in is it adding more dependencies or is it dropping more dependencies like we are more happy to delete code than add code because when we delete code we end up throwing away the dependencies that only that code was using so there's a whole team of people who just do this mind numbing work but very rewarding as well because what you end up doing is you end up talking to people from all kinds of projects like I have been able to talk to open telemetry folks, HCD folks you won't even know the kind of folks that you will end up talking to and collaborating with Microsoft folks on HCS for example so it is the effort that you take over some of these things when we want to drop duplicate dependencies fork of one thing but it's used differently by two different packages and things like that it's a very satisfying thing when we get to the point where we actually reduce why do we want to reduce the dependencies it is because all the dependencies get into your binaries and the binaries bloat up so when you add more dependencies it is very hard to upgrade dependencies as well as your binaries bloat that is the reason why we try to prune dependencies as well Alright, enhancements process so as Dim's mentioned earlier Kubernetes has grown into a very not only large project from the point of view of number of contributors and number of users but of course changes to Kubernetes have the possibility to cause massive damage across the economy frankly so what that means is that we are way past, years past the stage of you know just willy nilly hey this would be cool let's add this right we need to be extremely cautious and mature about how we implement functionality within Kubernetes and really deeply understand the implications of those changes so several years ago this process this Kubernetes enhancement process was defined in order to achieve that and essentially what we do in this is every you know obviously if you are fixing a bug or something like that you don't need to go through this process but it's effectively a design review process for any major change, for any major new feature we work together as an open source community to gain consensus on that feature yeah one of the important things that we do here also is when a feature comes in it comes in as alpha right and then from alpha it goes to beta from beta to GA so we track the evolution of the feature and when you get to the GA you can drop the feature gate essentially right like you don't need the trigger it's always on but when it is an alpha and beta it is good to have the feature gates to like toggle it on and off depending on whether it is a development cluster you're trying to do something new or if you want to push into production hey I don't want to take the risk of an alpha or beta feature unless I really really need it for something yeah exactly so this is the process by which we come to consensus it's the process by which we track every time you go through that those stages you're going to say okay I'm promoting this particular feature which is this particular cap from alpha to beta and it's going to get some additional reviews it's going to get some material that are laid out that's sort of like you need to reach this the cap author says before I go to beta here's the things I expect to get out of the alpha release for example so this is the process by which we do that we have a sub a sub project team that manages that process in conjunction with the release team the release team actually sort of runs that process or runs it encourages the authors of those caps to move them to that process but sort of defining what the contents what the criteria are is done with this team here do you know why this is important this is important because when you look at change logs for a release you will see exactly what changed and this is the process that gives that information to us so we would have written it down for you in the change log otherwise you know otherwise it's very difficult to track with the amount of PRs that is coming in it's very difficult to track what is going who is doing what and like what do we need to publish to our end users like you know if you have this thing on in your cluster do not turn this feature flag ever like you know like we yell at you and scream at you in the change logs and that comes through this process one other area that's the responsibility of sigarchitecture is conformance testing so if you go you know if you look at EKS we're all what are called we all have that k and we all can use the name Kubernetes so the CNCF owns that trademark Kubernetes and they have a process and each one of us who produces either a hosted Kubernetes or a distribution we have to go and submit a bunch of test results showing that we comply with a certain set of conformance tests so that the process is run by the CNCF and we can only use the name Kubernetes if we meet those tests but actually what those tests are is what the conformance sub-project group decides so we actually choose the tests that define what Kubernetes is and who can use the name Kubernetes so over the it also feeds into the enhancement process because when you go GA in the enhancements for a new feature it better have a conformance test how will you know it will work exactly the same across providers so we make sure that when people go to beta itself they have a bunch of tests that will test the conformance test the feature and make sure it is working right and when we go to GA we turn on a flag that says now this test is a conformance test it must be run by a vendor by any vendor that wants use the trademark yes exactly and so one of the things that we have been doing in this sub-project for several years now is we had a ton of technical debt so we had a ton of functionality within Kubernetes that was tested but did not have conformance tests was not actually wasn't considered part of what you must implement in order to be in Kubernetes and so conformance tests have to meet certain criteria because for example they have to be able to run across all these different providers they have to be able to run across different architectures there is a bunch of a whole set of criteria they can't rely on optional features our back in fact is we all use our back but actually our back is not part of conformance because you don't actually have to turn our back on and so there are some subtleties there so we have had a team we hired a team of folks from IEI an organization out of New Zealand and they have been churning away at that debt for a couple of years now and we in this just the last week or two we have gotten to 96% coverage it is right there 95.7% so this started at 60% a couple of years ago and we put those down and we also have a lot of high place policies that said if you go to GA and you are not an optional feature and you meet certain criteria you must have a conformance test so we didn't dig ourselves deeper but we have been backfilling tests for a couple of years so we are excited to see that moving forward . Then the last So, this we instituted a few releases ago, I think 1.21, it became mandatory. In that cap process, we added one more little gate, and this is, we get complaints, right? We're currently in a position where we're putting gates on people and we're creating bureaucracy in their minds, but it's for the reason I said earlier, like you check something into Kubernetes and it goes out to, you know, half a million clusters, you know, and all of a sudden nobody can, you know, all the planes are down for, you know, three days, that's a real problem. So, we've kind of put these gates in to make sure that developers are doing what they're supposed to do. One of the things, and we're both software engineers, we know, right, you're building your feature and you're excited about it and you're thinking about all the happy path and you're thinking about how is it going to work, it's going to work great and it's going to do all this great stuff for people, but often people lose sight of those unhappy paths. So, production readiness is just a simple set of questions that we added to the CAP process where we make people think about all those unhappy paths. We make people, we ask them the simple question of how do I turn this thing off if it's not working, right? And we make sure that people actually have implemented those feature gates properly. I mean, it's not a police force, it's really, we trust our SIGs, we trust our people that are building things, but it's just to make sure it's documented for users, to make sure that they've thought it through. So, having now this actually mandatory for the last couple years, we every year do a survey and I'm very happy to say that in our most recent survey we asked is communities more reliable than one year ago and three quarters of the operators, cluster administrators out there that responded to the survey did say yes, and another 20% said they're not sure, just a tiny little percent said no, so we're very, very, very happy to see that sort of result. Yeah, yeah, let's, we run short of time, slides will be posted for sure, so, you know, check three slides that I skipped over, how to participate, you know, come to the main SIG architecture meeting, come to the sub-project meetings that you're interested in, there are project boards, mailing list, Slack, there's one URL, you can find the information for the rest of the things, we would like you to ask a question, any question, it doesn't matter, you know, all of us started at one point or another, so we don't judge you for the question that you're asking, please pick up, please ask, please ask how we are doing things and why we are doing things and we'd be happy to walk you through things, like another example that John was already talking about was some of the shadow programs that we are running, which are really good for, you know, both, you know, beginners who are just trying to learn what we are doing as well as experienced people who will then turn around and do it and use it for their own projects as such. So, yeah, and I'll say that some of those, so API review, for example, that's going to probably require quite a bit of knowledge of Kubernetes and APIs and all that, but like production writing is actually a great profile of a person that's doing that as an SRE, right, it's somebody who's, you don't need to know the Kubernetes code, you need to know what it's like to have 10,000 clusters and have something broken somewhere and you don't know what it is and how to find it, right, that's actually the kind of viewpoint we need and so coming to those. And it gets a code organization, right, in code organization, all you need to do know is, you know, Golang, do you know how to compile? That's it, right, like if you have to update a dependency we have written down the steps, all you need to do is like figure out which dependency you have to upgrade and what version you need to upgrade it to and run some scripts and push the PR and the CI jobs will alert you if it goes red and then you iterate, right, so if you just know, go, that's good enough for you to start with in messing around with the, you know, code organization's subproject, okay? So any questions? Any questions? Okay, thanks a lot, we'll be around for a few minutes. Thank you.