 Hello, hello. Hi, everyone. Thank you so much for joining us today. I am Layla Jalali. I am an engineering manager at Google. I've been a member of the community for three, four years now. And I'm so proud to be a member of this amazing community. Here, Stefan is with me. Today, we are going to talk about API machinery, the SIG overview. The first part of the talk is overview, introduction, how you can get involved and help us. And then, Stefan will have two topics and a demo, amazing demo. Let's start. OK, so. Nope. Nice. Perfect. OK, where are we on the journey? Covenants started 2013, 2014, and now, after 10 years, we have 83,000 contributors, and 96% of organizations are using Kubernetes or evaluating it. And the blue line shows the search term on Google for Kubernetes API server. And you see that it's increased over time. There are many projects in Kubernetes, and API machinery is one of the important SIGs of Kubernetes. Here you see the overview of what is going on in the cluster in Kubernetes, and the center is Kube API server. And we have a scheduler to schedule the pods without assigned nodes. We have controller manager, and different controllers have all compiled into a single binary in controller manager. We have cloud controller manager for cloud-specific logic, and the node, and Kube led, Kube proxy for to implement the Kubernetes service logic, or the network rules. And what do we have for API machinery is highlighted here. Some of the controllers are in API machinery, not all of them. The controller manager, parts of the machinery to create the objects and APIs, and all the communications on the cluster goes through this Kube API server. Now the question is, why we see this API machinery everywhere? And it reminds me of this picture of Lion King. Everywhere Light Touch is API machinery. I see this everywhere in Kubernetes. And this is from our charter. SICK API machinery is responsible for the development and enhancement of Kubernetes cluster control plane. This scope includes the API server, the persistent layer, or the data store for Kubernetes, the controller manager, cloud controller manager, CRDs, webhooks, and more. API machinery doesn't mean all the APIs in Kubernetes. That is a misconception. And we own some of the APIs. We know about the APIs. We might be able to answer some of the questions on the APIs, but not all the individual APIs owned by API machinery. And the charter and that link has the detailed list of what owned by API machinery, the mechanisms to read, modify, delete, the objects, parsing, conversions, defaulting, validation, the open API, discovery, CRDs, webhooks, client and former libraries, how to maintain the healthy system that is very important and controller manager, garbage collection, the namespace cluster lifecycle, and our new SICK, which is SICK HCD and also scalability. They all own that persistent layer of data store. And what is out of scope? I already mentioned that those individual APIs, how to work with them, the applications, are out of scope for the API machinery. Now the question is, why is API machinery so complex? The onboarding is so hard. This was one of the most complex projects that I worked on. And everyone knows that this is very difficult. It is hard to contribute. And let's take a look at why this is the case. This is a Kubernetes cluster. We have the control plane. In the center, we have this API center. And the controller manager, cloud controller manager, scheduler, HCD layer, there we built some extensions to work on the things on the control plane, the things that are outside the control plane. And there are different users there that might be human users. That might be clients that are just like Python client, other things. We built these extensions for the things that are in the control plane or outside. Each of these things in the control plane have its own process. And it adds to the complexity. Then to build a system that is highly available and fault tolerant, we have the replicas of this. For example, three or more replicas. So we make sure if the control plane is failed, another control plane, make sure the cluster is up and running. So we have different instances of the cluster and the control plane and the API server. And all this communication goes through API server. And this HCD connects through like it's stacked HCD nodes or external HCD nodes that adds to the complexity. For the load balancing, there are cloud provider specific load balancing, leader election things. But you see this picture shows the complexity when you have different components. It might have different version of Kubernetes. And now there are questions about what would be the maximum version of SKU and the need for safer upgrades in Kubernetes. One of the examples when I was thinking about the complexity was the aggregated discovery. And this picture is from OpenAPI 2022 where whenever the clients make the calls for the discovery, the discovery objects were so small and there were so many of them. There were so many API calls to get all those objects and to know what is available on the cluster like the operations. And then in 2023, another layer of aggregation built on top of the server that all the services from the client to ask for the discovery goes through the aggregated endpoint. That means that the latency that we saw in those API calls removed, the performance increased. And you see how much this picture is simple by just having that. And there were alternatives to aggregated discovery, to alternative implementation that the team went through and work on that. Jeff Ying, Anton Bliss, and Alex Zelinski was part of that. So you see one thing, one proposal, how it connects to so many different things in API machinery. And that is one of the reasons that it is complex. We had two approaches to see how much code do we own. The first one was using siglabels. And we said, these siglabels might be introduced in 2018. So we went through another approach that was like file cataloging. And we got 25% to 40% of the code ownership for API machinery, depending on which approach we use. This is huge. And this is another reason why we see this complexity. This is from 1.29 enhancement tracking. We have different things going on. The things related to performance, priority and furnace, by, I think, Mike Spencer, to generalize the API request to have the priority and accuse for each priority then seaboard serializer from Ben, the work from Ben. And then we have the informers, the work from Locust, to get the stream of data. The work on sales, sales is a huge project. Cici is working on that. Giohoi Feng, Cici has a talk tomorrow. It is around noon and declarative everything. I really encourage everyone to join that talk. And Stefan will talk about some parts of sales today. The work that Alex Zelensky is working on. And then we have things related to maintenance, things for SLL improvement, the support for list queries, for the commandless API, and also the work for the support web sockets that Sean Sullivan is leading that. And these are the main work. There are other works that are going on. The aggregated discovery was one of them that I already talked about. And now I talk about who is the leadership of the SIG. We have David Eats and Federico Mangevani. They are SIG chairs. And for the SIG TEALs, we have Joe Bates and David Eats. They're working tremendously with the people in the team and all the projects and the directions of the SIG. They are helping and giving guidance and there are some to work with. Then how you can get involved if you're thinking about joining and working on something. We have the SIG meetings like other SIGs. This is like a 60-minute meeting that is bi-weekly. And then we have the bug triage meeting, which is twice a week. This is a great way to join us and focus on how we can get involved not only on the like issues that are raised, but the tasks that are helping the larger projects. We have the working groups that are API Expression, CubeBuilder, the working group that is for sale is very active. And they have their own meetings, their own Slack channels, and communications. There is an upcoming project-based mentorship program. We had a list of mentors and a list of projects. We are going to announce this. And people can apply for this mentorship program. It is three to six months, one-to-one mentorship. And it is project-based because we want those larger projects to have more people. But it doesn't mean that you can not suggest other projects. We are there to see the ideas. We hope to have this program kick off next year in January. And with that, I think I give it to Stefan. Yeah, so you have seen all the caps, lots of work going on. I want to highlight two topics, two caps, which are a bit underrated. Not so feature-heavy, but one is more like about the philosophy of what we are doing. And the second one is about CRDs. Everybody uses CRDs. Validation, place of all, everybody knows the pain of changing validation. So this is the second topic. So first one, philosophy I said, that's a cap, 4080. It's not really a feature thing. It's more like a code reorganization. It's about building or extracting something from Kubernetes, adding a new layer. In the layers we already have, so everybody knows KIT as API server, for example, as one layer, one repository. And at the end, there is, of course, Kube API server. Kube API server uses the API execution, API servers, aggregator, and then, of course, the actual APIs of Kube. And what we are doing here is adding another layer called a generic control plane, basically building a cluster, like an API server with controllers, which feels like Kube, but there are no ports, no networking, nothing, which is basically port-related. And it feels natural to have that today everything in Kube API server is basically hardwired. So you can do that when people have done it, but you have to fork, basically, Kube API server to do it. And this approach or this attempt is to make this clearer and allow use of Kubernetes for other purposes, but also to improve the code quality. Like, Kube API server is really a spaghetti ball with everything connected to everything, basically. So it's very hard to understand, and this layer will help. It's ongoing, so there are some changes in 128, and some 29. It's not finished yet, so it's ongoing work. So the basic idea, and this is not really, it's a sketch. So we might have or might get a new staging repository called generic control plane, but don't take this by word yet. It might be called a little bit different, or this is just a sketch for the direction we want to go to, like the vision. And what we want to put there is basically what we have in those packages today. So there's control plane API server in KK today, and there's Kube API server options, Kube API server admission. And if you look into them, they are kind of generic already. Interestingly, package control plane is not generic. So it's a big, I mean, we inherited that, or we named those packages years ago when there was not such a plan. So they're not really representing what we have put into those packages. And we will change that. That's part of this 4080 cap. And basically, when you think about what is a generic control plane, you will come to something like that. CRDs, obviously, are generic. Namespaces are generic. Some of the resources in core, like secrets, config maps, all the authorization is generic. Service accounts, it's a whole admission stack. And we have many things like web hooks and policies nowadays. It's generic. Quota is mostly generic, aggregation, API services as well. And of course, for all those APIs, you have controllers which are running. And some of them are really, I mean, you basically need them for a cluster, which feels like you garbage collection, for example. Things should go away when you remove an owner of an object, and the children should be also removed. Namespace deletion is a thing you just expect from a cluster. And if you want to have resource quota, there's a quota controller plus an admission plugin to do that. So this is basically what you want to have. Get something like a cube control plane with those features. But some of them, obviously, are optional. So you can think of, basically, depending on the purpose of your control plane, you don't want authorization because you have a different mechanism for authorization. Or secrets and config maps don't make sense. Or you don't want extensibility. Maybe you don't want admission web hooks at all because you have everything coded in Go. So all those things can be disabled. And the idea is to make this much easier than today. So trick demo, it's not super extensive. But let me see. Just want to give you a flavor of what we are doing. So if you have seen or built an API server, there's this very central method. It's called create server chain. And basically what this is doing is it takes a CID API server, so the API extension creates that one. And then it puts another API server in front. Namely, that's a cube API server. So the main one of cube is the one which implements config maps and secrets and other things. And the third one is usually the aggregator. So the aggregator is what has been shown before. It serves discovery, aggregated discovery. It serves open API and those things. So it's pretty central, actually. It's not only implementing aggregation, like redirecting to other servers. It's doing much more in Kubernetes. And those three make up cube APIs over, basically. And yeah, you can run this thing. So this is a prototype. It's not merged yet. So just to show, most of you will know the test server, which we have integration tests. So you can run that and see what comes out. So if you run that, it just prints as a test the APIs which we have. So all the API groups and resources here, config maps, secrets, service accounts, and so on, everything which is part of an API server. And you already notice there are things like priority and fairness, for example, which kind of is essential for a control plane if you want to have priorities in request handling. So there are things, also, leases, which is unexpected. So there are some things which belong to a control plane. And we'll come to that, what a control plane actually is in a second. And what I tried, I took that, and I wanted to go one step further. So what is a minimal control plane which we can build that way? So I did the same thing, the same experiment, and removed things. And hoped that it doesn't crash. So authorization doesn't crash because there's no airbag anymore. So those things you have to fix. And you see the list is already a lot shorter. So this could be, basically, a very minimal control plane. And it would work. But it's not so simple, like I just showed quickly. If you launch that completely and you wait for readiness of the API server, you will notice it will not start. Like, you see all those reflector issues here. You see endpoints and services. And you will think about why endpoints and services, where are they even used, right? But you will find out our creation layer. Cube ArcGator, it uses pots and secrets, right? So you need those concepts to even run ArcGation. So there's more work to do. Like, if you want to do that, we would need an ArcGator, which is not ArcGating. It's not redirecting anywhere else. Or with a different mechanism, whatever. So there's more work to do than just disabling APIs. All right. So let's go back. So what is a minimal-cube control plane? And this is a philosophically topic, right? We have just seen that. This was generic, the first one, the first example. Everybody agrees this is a control plane. This feels like a Kubernetes cluster when you work with it. It has everything, I mean, not the workload resources, but it has everything else, basically. It's extensible. Like, you can have CEDs. You can have admission webbooks and policies. Those things are complete in that cluster. You have AirBug and you have Authentication. So you can call out to another authorization webbook. Authentication webbook. All those things work, right? But yeah, is this a control plane? The minimal one is the second example I had. I would say yes. You don't have CEDs anymore. So I just disabled this part where I instantiate the API extension API server. And what you are left with is basically those hard-coded Golang APIs. If you want to use that, you could do it, but you have to implement basically your APIs and go. So but this could be a valid use case, right? If you look in the ecosystem, people are doing experiments. And they call that Kubernetes. Like they have implementations which very much look like Kubernetes. So one, I'm not sure people have seen that. There's this tool tilt. I think it's all about Docker now. If you use the command line tool and say API resources, it will show you cube resources. This is an API server. And to my knowledge, it's using Kubernetes. So if anybody knows about that, I'm happy to hear. But if you look on the API resources, this is super minimal. There is no AirBug anymore. There is no, not even a namespace exists, right? So it's super minimal. And is this Kubernetes? Is this an API server? We could call it Kubernetes. So it's an open question. And to my knowledge, as I said, this uses Kubernetes under the hood. And you can go one step further. So Archon was just announced a week ago or so. And Darren, the author, the main author of this tooling, he implements APIs which are tube-cutter compatible, even controller compatible. So you can run informers against that. So they support all the quads, including watches. But it's not cube API server. It's not API machinery. There are some parts they borrow from API machinery, but it implements the API, but it's not cube anymore. So I wouldn't call that a cube cluster. There's no feeling of a cluster. So you don't use that, you use cube-cutter to do quads. But that's basically it. And so that's a philosophical compression whether we call that a minimal control plane, what is this? And of course, there's a question to the SICK and that's a question basically for the ecosystem. Do we want to support those things, those experiments maybe could be part of the SICK or we leave that as a contributions outside? Nothing to answer today, but I just want to show that. All right, that's the first topic. Second one is about CEDs. Very concrete, 4008 is a cap. And Alex basically showed up a few months ago and said, I hate how we do validation in cubes. It's so awkward to change validations. And most of you will probably have experienced that themselves, like when you strengthen or weaken validation for fields, you always have to think there's something at CED already. So users might have objects which don't validate anymore. And the consequences of that are pretty extreme. So we will see it in a second. So it's an example here, a custom resource definition with a schema and it has two fields, replicas and IPs. Don't ask me about the context here, it's just made up. But replicas, it's obviously a number and IPs is a string, right? So many people start like that, generate with QBuilder usually and they get a schema and they're happy. And then users put something into the fields like 2.5. IPs which have five components, obviously doesn't make sense. It's not what we wanted, right? But people have created those objects in the cluster. So they are in at CED. And if you now go ahead in the next version, you realize, okay, I forgot the minimum. That's the first mistake. And the type number was maybe not good. Integer is much better, right? For the IP, I want a format or a regex or something like that. I want to restrict it to real IPs. And what will happen, those objects are basically immutable. So when those fields have those strange values, you cannot edit the spec anymore because validation fails. You cannot even annotate. Like if you have a tool which annotates objects, this will fail because the spec has invalid values. And there can be like machinery issues, finalizers. If there are finalizers in those objects, you cannot remove them because the spec is invalid. So it's a very bad situation. And we are living with that for years, right? So yeah, but CL. So CL solves everything, right? So if I want to strengthen validation, why can't I just do it in CEL and basically reference the old self, right? Doesn't work. No, it works. Okay, maybe not. So you see a mouse cursor here. So you reference old self and you check that old self, like in the example of replicas, it was not an integer before. And if it wasn't one before, hasn't been one before, I'm fine with that, right? I don't issue or show an error. This is basically what it is saying. If it's not an integer in the old self, then I don't care, but if it has been an integer before, then I want to check. And if it's not the case, I show the error, only integers are allowed. And I try the same thing, of course, for the IPs. And you will realize when you're trying that, this is not what you want, actually, because in the moment you reference old self, this is an update rule, which means it's not validated on create, which means it's not what you want because you want new objects to follow the rules, right? So what you find out quickly, you cannot express that with the CL we have. So it was an oversight kind of, but you cannot express that. So we need something new. And what we came up with, so this is mainly Alex's work, so he's pushing that topic and doing all the development of the theory and implementation. What we came up with, or mostly Alex came up with, we validate per field. And we call it redshitting. Redshitting validation is something we have in Go everywhere today. Sometimes we introduce some new validation and we check that the old object validated and then the new one also has to validate. So we do that in code, but we haven't done it in CIDs. So the idea is when replicas are not changed, like replicas keeps a value of 2.5, in the example, then this green box is not required. Same thing for the IP. So on field level, basically you look on the Open API schema and this applies to the field it's specified in. And we call that obviously redshitting and you see this graphic here. This is the logic behind. In the moment you validate once, it must validate all the time of the future. Sometimes this is not enough. Sometimes we need more logic and as we have seen CL alone is not able to express those things. So CL gets an extension, it's called optional old self and those rules are not update rules anymore. They are checked on update and on create and you have this has value and value function on the old self to express what you want to express. So this way you can basically define arbitrary redshitting validation rules. So this is better in 129 and yeah, for that reason. So it's available soon. I hope it helps people writing validations and makes life easier. And what I want to point out here, if you see what I just described, this looks super simple, right? But it took months to get the understanding what we actually want because you can do much more redshitting validation, especially the automatic one in CADs. It's not obvious that this is what we want. So there's a lot of work behind, although it looks super simple. And yeah, that's why I called it, it's running under the radar. It's something everybody will have soon and I hope it helps people. So thanks to Alex for all the great work and for pushing or starting this topic. All right, that's it, I think. We have seven minutes left. Just to point out, on Thursday, there is the old Meets the Six meeting. It's called Meets the Community Contributor Community. Three hours, we are there. So if you have any questions about any of the caps or anything else, meet us there. Yeah, that's all. Thank you and we have time for questions, I think. Yes. Yeah, so KCP did this work manually, right? We, Qt didn't support this generic control pane yet. So it's very similar in the sense that the work here, which I presented plus workspaces, this is KCP kind of maybe plus API bindings. But yeah, it's a similar direction, obviously. Yes. I don't think the controller runtime library has anything specific. It just works with the generic control plane. And this is the idea, right? All the tooling in the ecosystem should just work. Controller runtime, things like GitOps tools, they should just work because it's a cube cluster, right? But the question is like the other examples I brought up, which are not clusters, are they compatible with the tooling in the ecosystem? It's not so clear, right? Thank you for your talk, Stefan Leyla. Thank you. In your estimation, what do you see the future to be in these like Acorn and Tilt that are popping up? Are these built for purpose kind of control planes where you could see maybe they get used in more for compliance reasons that are off the map? I mean, it looks a little askewed. It doesn't look like it's gonna be mainstreamed to me. But what is your general thought thinking around it? In Acorn, that's their API, right? You can use GitOps against Acorn and deploy your applications. So it's their API and it's even the implementation they use for their controllers, right? They like controller pattern internally, so they want to implement controllers via informers and events, everything that you know, so, yeah. Okay, thank you. Yes, I have to ask Joe, maybe. Yes, there's something called match conditions on admission mode books and we're super excited about it. So you can put a sell expression that sees the old object and the new object and can make any decision it wants about whether to call out to an admission webhook, yeah. Thanks. Further questions? Hey, do you view what the generic control plane that people could piecemeal specific Kubernetes APIs into their own control planes? And if so, do you think like that modularity would then lead to making the client go interfaces a bit more modular too? So basically, you can use client go, even if it's big, right, you have all the APIs. You can use it against the control plane just for sequences works. I don't think there are plans to split it up at the moment. Okay, so, David's on Thursday and whether you can pick APIs from Kube. I mean, this is the idea here. I mean, the code is there, right? You can pick some part. I mean, I picked service accounts just to be part of this and you could have, I don't know, just the Ingress APIs. You could do that. Whether that makes sense depends on your use case. Okay, thanks. All right, thank you. Thank you.