 Hi, everyone. Welcome to CKPI Machinery Advanced Topics, our maintainer's track for Kupken North America 2022. My name is Federico Buantrovani. I'm co-chair of CKPI Machinery and Open Source Kubernetes. Two very special guests joined me today to provide two great presentations. David Eates from Red Hat, who is also my co-chair and tech lead in CKPI Machinery, and Jeffrey Gein, who is one of our top contributors in CKPI Machinery and other areas of Kubernetes. The two topics we are going to talk about today are the power and the danger of aggregated API servers, and Open API v3, which recently graduated to Beta and is making its way to GA. So I will leave it with David and with Jeffrey, and I will give you some information at the end of the talks. See you later. Hi, my name is Jeffrey, and I'm a software engineer at Google and a contributor to CKPI Machinery. Today I'll be talking about Open API v3. Open API v3 is a feature that has gone beta in Kubernetes version 124. To give a quick introduction, the Open API is a language agnostic interface in the format that is both human and machine readable to describe the capabilities for a service, the service in our case being Kubernetes. The Open API is provided in both a JSON and protobuf format, and some of the consumers within Kubernetes include Q-Control Explain, auto completion for UIs, and we also use it for generating documentation. The official Kubernetes documentation is generated from the Open API and also clients. This talk is about Open API v3, but let's just briefly go over some differences between v2 and v3. Open API v3 is a restructure of v2 and allows reuse of API components and also provides some new features. The most notable of these features is the extended JSON schema support. The Open API v3 uses a newer version of JSON schema that supports four additional pretty important fields. These are one of any of default and nullable. For those of you already familiar with custom resource definitions, you might know that Open API v3 is already supported in the structural schema. I like to highlight one key difference between a schema and a specification. A schema represents the definition for a particular type or multiple definitions while specification refers to the entire Open API document and includes the schema but also includes additional things like pass and etc. So one thing to note is that without this Open API v3 feature, we wouldn't be publishing the Open API v3 specification, and that means that we'd still be publishing the Open API v2. What this means is that the API server is able to understand Open API v3 through the structural schema but because it's not able to publish v3, certain fields, certain objects need to get converted into Open API v2. Now, for most fields, that's not a problem. It's a direct conversion, but for certain fields, they're only representable in v3 and not in v2. So this includes things like default, any of. One example here is that on Kubernetes, we have this type called interstring, which the type could either be an integer or a string. And in Open API v2, because when I put to represent it, that just gets stripped down to no type. Some additional examples, if you have any of in your validations, your validations, then the any of gets stripped and also nullable. So one important thing to note is that a lot of these fields are removed recursively, which means that if you have a root object that let's say is nullable and that gets stripped, then the additional properties will also get stripped. So on the left, you can see that we have a true object with two fields, vmv. And by going through the Open API v2 conversion, all those fields and properties are dropped. Now, this is not really ideal for clients who rely on the published schema, the source of truth, because this is just complete information. Now, the good news is that with Open API v3, this is all preserved. So on the left is basically what you get when you see a published Open API v3. It's a pretty much a lossless version of what the structural schema defines. Now, let's talk about the Open API v3 endpoint. So currently, we have two Open API endpoints. One is the slash Open API slash v2 endpoint. And the new endpoint is slash Open API slash v3. Now, if a user of the Go client library client Go, then you'll probably know that we have a function in a discovery client to obtain the Open API v2 schema called Open API schema. Similarly, we create a new function called Open API v3. And the transition should be pretty seamless to use Open API v3 from Open API v2. Now, I'll go into a bit more of the internal details for NonGo clients, and also providing a couple more reasons to use Open API v3 over v2. One improvement that we made in Open API v3 is that we've basically split up our Open API. And essentially what that means is that when you hit the root endpoint slash Open API slash v3, you'll get a list of group versions rather than the entire document. And the group versions, you can then use these group versions to access the particular URL. So let's say the example of apps slash v1 would be slash Open API slash v3 slash APIs slash apps slash v1. And this will contain the Open API schema for only the apps slash v1 endpoint. This is obviously much smaller. And for clients who require just a small subset of group versions, they do not need to download the entire document. Also, a couple additional reasons. Number one, obviously, as we discussed already, the size. An aggregate Open API v2 is on the magnitude of megabytes. And you might think this is not very large, but this increases based on the complexity and the number of CRDs that an API server supports and also number of aggregated API servers. So this can add up to be quite large. Also, let's say there is an incremental update to a CRD. With Open API v2, an update to a single resource will basically cascade into a recomputation of the entire schema. And this leads to CPU and memory usage specs. And this kind of spike, depending on the amount of resources that are available, could potentially cause an utter memory error and causing the API server to restart, which is not ideal. Finally, there is the complexity of aggregation. So Open API v2 has an aggregator that will send a request to all aggregated API servers. This includes built-in types, CRDs, and aggregate API servers. And it will download and merge all the Open API specs. This obviously takes up resources. And Open API v3 basically skips out the aggregator and acts as a processing. So it will be much faster and less resource intensive. Now, let's talk about an advanced feature of Open API v3. And that is called cash fasting. So this is the output that you get when you go to the flash Open API v2 input. You'll notice that we have the list of group versions. But we also have additional thing attached at the end of a URL for fetching the group version. And that is this cash parameter. This cash parameter works similar to an e-tag, but it allows us to skip one pretty critical request. So I'll first describe what happens with e-tags. So with e-tags, number one, you'll send a get request to the Open API slash v3 endpoints or return to 200 with the list of all the group versions. Now, for each group version, you would first check your cash to see if you send a request to a group version. And let's see you have, then you'll have an e-tag associated with that request. And basically, you're not really sure if the latest version and the version you have is the same. So you need to send a request to the server with the particular e-tag that you have. And the server will either tell you, oh, this e-tag is up to date, in which case, you'll get a 304 not modified. And you can just serve your cash version. Or the server will tell you that the e-tag is outdated and send you the response. Now, we're looking at a case that if the e-tag is not changed, then we get a 304 from the server. And this means that we save on the size of the Open API, because we don't need to download the entire Open API again. But let's count the number of requests. We need to send one request to get the Open API root document. And we just send another request only for the server to tell us that nothing has changed. Depending on the group versions, this could amount to quite a lot of requests. And we would do this pretty frequently to make sure that clients have the latest version. So how does this work with cash busting? Well, with cash busting, it's, the beginning is kind of similar. We would have an Open API. So actually three get requests. And we would get the list of group versions, but obviously with an additional hash parameter. Now, the benefit of this is that when we check the cash, now we know the exact URL that we need to hit. And basically the server will give us two additional parameters headers that are important for cash busting. And that is number one, cash control set to immutable and expiry set to a date far from now, let's say one year. And these parameters allow the client to know that, let's say you have a particular Open API with a private hash. We don't actually need to send a request to the server to see if the hash is, or the ETag is up to date because you already have a hash. We chose to go through the hash instead of let's say getting the client to reuse the ETag, because this is actually a feature built in to many HCP libraries that offer caching. So a lot of times this is something that is done with the Go library. There's no additional checks that clients independent on their side, as long as they're able to use the library that supports this feature that looks at the cash control and the expires directive. And this basically reduces the number of requests that is needed to fetch the Open API on incremental updates. As for future work, Open API v3 is currently in beta and will look to move to GA very soon. Clients of the Open API v2 will also be slowly migrated using Open API v3, starting with Q control explain. This upgrade also provides additional functionality with templating, including things like putting a markdown, HTML, or even custom templates. Thanks. Let's talk about the power and dangers of aggregated API servers. We'll start by talking about what API aggregation is and how it works. We'll continue on to what features you can build using it. And then we'll talk about some of the risks in using the feature and what some of the limitations are. So let's compare a CRD flow to an aggregated API server flow. In a CRD flow, the client sends a request to the QAPI server and the QAPI server handles the request locally before writing to SCD and sending a result back to the client. In an aggregated API server flow, the client speaks to the QAPI server as normal, but the QAPI server identifies the traffic as needed to get proxied to an external API server. At that point, the external API server takes over and decides how to handle the request, eventually storing it in SCD or anything else, or perhaps nothing, and returning the result back to the client. The CRD flow is constrained, but safe. It's safe because all the processing happens in the QAPI server, and it's constrained because you're constrained by what the QAPI authors added as CRD features. With an aggregated API server, you are fully unconstrained. Once your server takes the request, it can do anything it wants with it. But there are some risks in using this approach when you get into the availability of the aggregated API server and how it impacts the cluster. So let's zoom in on how this proxy method works. The QAPI server has an API resource called API service, and this API service holds the group and version that is going to be proxy. So this one here saying first.com for V1 means that any request for V1 first.com resources will go to service one. This is a service running on your cluster, and this represents a pod running on the cluster hosting that service. You can have multiple API services for different API groups and for different versions of the same group. Let's work through an example. So here we have a client who is doing a cube control get of foos for V1 first.com. This request is going to go to the QAPI server, and the QAPI server is going to authenticate Bob and then realize that it needs to send the request to this API service. At this point, the QAPI server will proxy the request to the service running on the cluster, and the external API server will receive the request. Let's drill into how authentication is going to happen on this external API server. So on this external API server, we have a delegated authenticator. This is the authenticator that is built in to the generic API server provided by Kubernetes. And when it's configured to use in cluster authentication, which is the default, it will self discover inside the cluster. The proxy client CA used to recognize the QAPI server as a front proxy, and the front proxy will be the first choice. So if the request comes in and it has a client CA that matches our client certificate that matches this proxy CA, then it will be terminated as a front proxy and it will pull the headers out and use that to assert the user. The second thing in the chain is going to be a client certificate also discovered on the cluster, and that can be used to terminate any other sort of client certificate that identifies the user itself, not a proxy. An example of this would be something like a Kubelet. And then finally, it can also handle tokens. So a token review will be sent back to the QAPI server running in the cluster and the QAPI server will return back to the aggregated API server, what user that token belongs to. So the next question becomes authorization. Do you do it at the QAPI server or in the aggregated API server? And the answer here is both. It has to be both. And we're going to zoom in to explain why that is. So when we look at the connections for how an aggregated API server can receive a request, we can actually see there's multiple paths. One path is the one we've been talking about. Client goes to the QAPI server, QAPI server proxies through and identifies the service running in the cluster and the pod handles the request. That's the normal path because that's how clients mostly access APIs. But because we're running as a service in the cluster, fronting a pod in the cluster, anything that has access to the service network or the pod network can actually make a direct request. Because of that, authorization has to happen in both spots. The first spot is at the QAPI server. This is an RBAC check. Usually it can be anything, how you have it configured. But generally RBAC and it will check to see, does Bob have access to run a get on foos in v1first.com? If the answer is no, the request won't be proxied. But if the answer is yes, it will get proxied through or if a client bypasses the QAPI server and goes directly, the aggregated API server will run another authorization check. And now let's look at what it does there. Here we have a delegated authorizer. Again, this is the one built into the generic API server provided by Q as a library. And this one has multiple items in the chain as well. The first is a hard coded set. And this hard coded set you often see used to allow system anonymous to do health checks. And that makes it very efficient for the Qube to do health checks. Keep in mind, this has to be configured by the aggregated API server. The stock library allows it but doesn't specify it. The next piece is the always allowed group. Inside of a Kubernetes cluster, system.masters is hard coded as always being able to run any request that it wants. And this allows the cluster to self bootstrap. And then if a request does not match either of these two categories, a subject access review is done. And that is the API that the QAPI server exposes that allows for web hook authorization. And the delegated authorizer will do it so that the rules match exactly. If the QAPI server says you can do it, the aggregated API server will believe that you can do it. So now that we've talked about the mechanics of how we actually perform the proxying, let's talk about what it allows you to do. The first thing we're going to talk about is binary storage formats. And the OpenShift API server is a good example of this. The OpenShift API server needed performance that was greater than we could get using JSON. And to make this work, we added protobuf, same as the Qube API server uses. Now, CRDs do not as of yet have a binary storage format. Hint, if anyone wants to write a cap, I'd be very interested in reading it. But as of now, they don't. And so an aggregated API server allowed us to use protobuf serialization in SCD and as the wire format and the QAPI server passes the information back. It let us hit much higher performance targets. The next option that we can talk about is a no storage server. You might say, what good is that? Well, if you're familiar with the v1, beta1 metrics, this is the one that backs HPAs, for instance. The Kubernetes 6 project metrics server actually has no storage, no persistent storage. It reads and scrapes values from the nodes, holds those values in memory, and replies to API server requests that way. There's no persistent storage. And the last feature that I'm going to mention is that you can have multiple implementations. And here, again, metrics is a good example. So metrics in memory lets you get to a certain level of scale. But there's an alternative implementation that fulfills the same API, still v1, beta1 metrics, Kate's IO, but it does it using Prometheus. And so Prometheus actually acts as the scraper. It stores metrics values over time. And then the Prometheus adapter will query Prometheus to get the answer. And the client, that client that accesses the QAPI server, has no idea which API server is actually servicing the request. So this allows you to drop in different implementations if you want to experiment or if you have different needs in your particular cluster. So all this sounds really cool, better performance, more storage options, what can go wrong? I talked about some risks at the beginning. Well, let's talk about those. So HA Qube API servers is a very common configuration, right? If you don't want disruption, you normally run three QAPI servers with three at CDs. But when you do this, you can get interesting behavior on aggregated API servers. So if you have some kind of network disruption, maybe the node that is running API server one cannot reach the service network or cannot reach the pod network for whatever reason, then what happens is end clients, those users who are trying to get your metrics or get the openshift APIs from the previous slides, they see a one out of three chance of just not having that API available. And that leads to I'll say user frustration. But it's not it's not fatal. It's just frustration. It's very important to have reliable access. The next class of problem is actually a little more severe. If you lose an aggregated API server entirely, no Qube API server can reach it. Maybe, maybe you deleted the pod, maybe you're suffering some kind of full outage. That causes API discovery to fail. And if you remember from our presentation, I believe it was last year, the API discovery is the set of APIs that are used to build rest mappings. And rest mappings are used to drive behavior in Qube control, garbage collection and namespace cleanup. And so if you cannot get your discovery API to work, you will end up in cases where garbage collection effectively hangs for some resources. And certain types of owner references that you go to create, ones that are supposed to block owner deletion, can actually fail if discovery doesn't work. The other thing that can happen is that namespaces can get stuck finalizing and they won't delete. And there'll be a message in status when this happens. It says discovery checks have failed. But as a significant risk and a very annoying problem, there's one special case for namespaces that I'm about to go into that is very interesting. So here we have a namespace deletion cycle. Yes, cycle. So this came up when we were developing operators for OpenShift v4. We had aggregated API servers, remember, hosted in our cluster and they existed in a particular namespace. The example here is an aggregated namespace holding an external API server. And someone deletes the namespace. They showed enough that they deleted it. Now, as soon as this happens, the namespace goes into finalizing state. And that means no creates are allowed. So you can delete content from the namespace, but you cannot create anything new. And deletion happens immediately. So this pod gets deleted. Now, when this pod gets deleted, the namespace lifecycle controller gets stuck because it needs to discover all the namespace resources in the cluster so they can delete them and be able to make the delete calls. But it needs to make delete calls to the external API server, which was just deleted. So what happens is the namespace controller can't make progress without the external API server. And the external API server can't start until the controller deletes the namespace. Now, there is a way out in the cases where you know you're deleting this and you're going to bring it back. And that is to actually make sure that everything has been cleaned up and then manually clear the finalizer. Once you do this in case of OpenShift, for example, an operator will automatically create that namespace again and reproduce the aggregated API server that you need. This gets you out from being stuck here and gets you back up and running again. But you need to be aware that this is something that can happen. So now the limitations. They can do a lot. Aggregated API servers can do things that no other technique we have allows. But the one thing they can't do is coexist in the same group inversion with a CRD. So if you have an API service for v1first.com, you cannot create CRDs in first.com. So this means that like your thought of, hey, I wonder if I can use this to create sub resources for my CRD. You can't. And you also can't mix those special types of storage, for instance, with a standard CRD either. Maybe someday this will come along. But as of right now, you can't. All right. So that is all I have about aggregated API servers. I hope you found that interesting. And I'll turn it back over to Fede. Hello again. I hope you enjoyed these two talks as much as I did. I'm always impressed by the level of depth and quality of the presentations that my colleagues prepared. Thank you so much, David and Jeffrey for working through aggregated API servers and open API v3. To close the chat, I will just remind you that CAPM machinery is part of open source Kubernetes. We have open meetings twice a month. We have twice a week, three ashes for pull requests and issues. We have two subworking groups. So you can join any part at any time that you feel could be interesting for you and you can contribute. I will leave this slide here so you know where to find us and how to connect with us through mail, through Slack, or through our homepage or in GitHub directly. Thank you again for coming and I will leave it open now for questions to the session. Thank you again. Have a great CUPCOM.