 Hello, Kubernetes community. Welcome to our SIG API machinery deep dive session. My name is Federico Bonjavani, and I am the co-chair of SIG API machinery. Our agenda today will have three topics and a small bonus information at the end. Our first topic is going to be namespace termination. And Daniel Smith from Google is going to walk out through it using his signature handmade presentation style. Our seventh topic is going to be REST mappings. And Stefan Schimanski from Red Hat is going to unveil all the secrets on how it works and how to deal with special situations. In our third topic, David is from Red Hat will explain for us the mysteries of garbage collection and how does it work in all the special cases. Finally, I will give you a farewell with some quick info about the SIG and our meetings. So without any further delay, let's get to it and go to the deep waters of SIG API machinery. Hi, everyone. Today I'd like to tell you a story. The story is called 60807. I'm Daniel Smith. I co-TL the API machinery SIG. We're working on Kubernetes since before it was open source. So I'm lava lamp on the original lava lamp on Twitter. I work for Google, a software engineer. I'm not speaking on behalf of Google other than to support my presence in open source. I wrote Daniel drawing my slides today. Turns out I can't draw and talk at the same time, so we're doing this in two steps. Story begins with a user. We won't be surprised that there's also a Kubernetes cluster in this story. Every cluster needs an API server. And of particular concern in our story today is the namespace, which I'm going to draw in a big, fat line to give this namespace some sort of name. We'll call it NS2020 because this user would like to delete this namespace. Here's the action. User says delete, API server says sure, that's the technical HTTP response code. User posts to recreate, API server says no, it still exists. There's a conflict. API server is the villain. How could this happen to us? This makes the user sad and angry. How could we do this to the user? Let's talk a little bit about how resources and Kubernetes are deleted, while objects and Kubernetes are deleted the same way. So although we're talking about a namespace right now, there is some background information about deletion that might be useful. As with many of much of what I'm saying, this is a simplification because there's more to it for some particular objects. But roughly speaking, all Kubernetes objects have metadata. And inside this metadata, there's two fields relevant to deletion. First one is deletion timestamp. If that has a non-empty value, then the deletion process has started for that resource. And there's also a list of finalizers. And the deletion is not final until all of those finalizers have been removed. Slightly confusing because the literal name with the finalizer that we care about today is Kubernetes. And this is the finalizer that is removed when the namespace object is empty. So until that is removed, the namespace is not deleted from the system. I'm leaving out some details about namespaces have special places they keep their finalizers. Roughly speaking, all objects work like this. So let's look at our scenario again. The component in Kubernetes responsible for removing this finalizer is in Controller Manager. It's the namespace lifecycle controller. Its job is to watch namespaces that have begun the deletion process, ensure that they are empty, that all the resources inside them have been deleted, and then remove that finalizer. Namespace lifecycle controller is the villain. To list the resources inside of a namespace, you need to know which types of resources there are to be listed. The process of figuring that out is called discovery. At this point, you might be asking yourself, how does discovery work? Well, I'm about to tell you. All API server, all API requests in Kubernetes have a URL path. This identifies what exactly you're making the request about, right? And again, this is a simplification, but for our purposes, all API requests start with the word APIs. And the next thing that comes is a group, the API group. The next thing is a version, and after that is the resource. The resource is the thing that goes in the URL path. The kind is the type, right, in a type system sense. So if you make an API request and you just say APIs, if you stop there, then you get a list of group versions that API server knows about. And if you list the group, you get a list of versions that are inside that group, right? List of versions. And if you list a group and a version, then API server will tell you which kinds are inside that group version. And if you say more than that, then you're making an actual API request, which will do you some good list objects or just something useful. So how does the API server fulfill this contract? To do that, we should talk a little bit about API server. API server is actually three API servers in a trench code. The first one is the aggregator. The second one is the built-ins. It serves pods and services and stuff. And the last one is the extensions API server. More commonly known as CRDs. And if a request can't be served by the aggregator, it goes to the built-ins. If that one doesn't want it, it goes to the extensions. And if that one doesn't want it, it gets a 404, which I've conveniently drawn off the bottom of the screen for you. Now, the aggregator may know what to do with a request and it may be for a external API server. This means that that request must be proxied. It is not served from the same process as the Kubernetes API server. The aggregator is the villain. The canonical example is the metrics API server, which Kubernetes ships by default. Now, this means that there's various things that could go wrong. Maybe your network is not working. At least that link of your network. Or maybe the metrics API server is co-located with a process that's hogging the node. Or maybe it just has the wrong resource requirements. In any case, the aggregator needs to get the kinds from the target API server. Only the groups and versions are registered with the system. If that external API server can't be reached, then there's no way for the aggregator to tell you which kinds are in the group version that you're asking about. The metrics API server is the villain. It's kind of unfortunate, but it is also unavoidable because the extension API server is the only thing that knows what it serves. The aggregator cannot know that. It would be very heavy-handed to require API authors to pre-register that with the aggregator. It also wouldn't help the problem because even if you know what kinds, if you can't actually go out and list those things, then you also can't confirm that they're deleted. If the lifecycle controller can't discover the kinds that it should be deleting, it is never certain that the namespace is empty. And if it's not certain that the namespace is empty, then it can't remove that finalizer. So whose job is it to make sure that this thing is working anyway? I think it is the job of the system administrator, the cluster administrator, rather. And I think the cluster administrator should be performing this job by using some monitoring. The metrics API server is served by a regular Kubernetes pod, so it should be possible to monitor it. Another way you can monitor the overall health and not just the specific metrics API server, but you just run the command keep control API resources once in a while. It does this discovery process and it'll tell you if some group is not working. And then you can look at the API service object for that group and figure out what component is not happy. So thank you for listening to my talk and watching Robo Daniel draw these. If you think that we could do better, I encourage you to come help us improve our error messages or our design. Yeah. Thanks. The system administrators that go in. Welcome to the second part of the CK API machinery deep dive today. I want to talk about a concept which is pretty central in Kubernetes, but it's not well known to many people. And maybe if they use Clinko and build some non-trivial controllers, they have met it, but they might have question marks around that. Seeing less mapping in action is very simple. Just say Q-Cuttle get pods in your terminal and get pods is a command which is generic. So generic means that it doesn't know anything about pods. It has to work with any or every resource that is available and known to the cluster. And to make that happen, it has to query the discovery information of the Coupe API server. So it has to ask it what is the pod and to do that. If you increase the verbosity to at least six, you will see what it does. It queries slash API to get the versions of the DC API, it queries slash APIs to get API groups and their versions. And then it continues and queries like more than 30 API group versions. So API slash V1, API slash apps, slash V1 and 30 more. And after doing that, it has all the information to find out what a pod is. And it sees a pod is a resource which shows up in the API slash V1. So it's a legacy or API version V1 resource. And from this information, it also knows that it's a namespace. So a pod is a namespace resource. And from this information, it can construct the URL to actually list the pods in the system. It knows it's a namespace. So it adds the namespace as default. Default is the code one, the code namespace. So it knows the URL just by discovering information how to get the pods. That's what it does. In the types of Kubernetes, there are kinds and resources. Kinds are the uppercase singular words which you find in manifest usually. And the resources, they are components of URLs. So the group version and the resource name, last information of the scope, that's all needed to build the URL. And the mapping between those two worlds, it's called rest mapping. And that's our topic today. Group version resources and group version kind, we call them fully qualified or complete if the group, the version, and the name are provided. If the group is empty, it's a legacy core group like for pods. We have the same for kinds. And we can also talk about partial resources of partial kinds. So in the first case, the version is missing. So apps replica set, it's a partial kind. V1 replica sets, lowercase, it's a partial resource. So a group is missing. And if both information, both kinds of information are missing, it's also partial. So replica sets alone has no group, has no version. And that's what we know. Everybody knows from Q-Cuttle and usually you do it on the command line. Q-Cuttle get replica sets. It's a partial resource. And the rest of them is used to fill the missing information and to get the fully qualified resource which then can be used to query the replica sets. There's a syntax, a Q-Cuttle syntax to say this is group apps or dot apps tells Q-Cuttle the group is given. You can even just use a prefix and this is completed by the rest mapper. So dot A is completed to apps V1. And of course you can have complete group version resources dot V1 dot apps. If you try this prefix matching, if you have a version it will fail because the rest of it doesn't support that. If you try dot V1 without a group, Q-Cuttle parsing doesn't work. So it's also rejected. Everything I talked about here works for singular and plural. So you can give a singular word here, a singular partial resource that's completed to a complete plural resource. Where are rest mappers used? So we saw Q-Cuttle. The app itself has some use of rest mappers. So there's some GC garbage collection related admission plug in here. But the main consumer of those rest mappers, especially discovery based ones are the controllers, controller manager. Horizontal auto scaling can work for all resources that have a scaled sub-resource. So this is a polymorphic or generic use case. That's why rest mappings involved in port description budget is also involved. And the garbage collectors may be the most interesting case here. And then we will spend several slides talking about the garbage collector and this uses rest mapping at its core. Q-Cuttle doesn't use rest mapping surprisingly, but it uses discovery. So it's not that surprising because discovery and rest mapping are deeply connected. Quick look on the rest mapping interface in Go. There are three kinds. The yellow functions here, they map partial resources to kinds. They map the right ones, map partial resources to complete resources for the qualified resources. And the green ones take group kinds, complete, so group cannot be omitted. Complete group kinds and versions to rest mappings and rest mappings, if you look here, what are rest mappings is actually the resource, the kind fully qualified and scope. So basically the last one is kind to resource. And the green one is most interesting in the context of garbage collection because garbage collection uses that. Owner preferences are group kinds. And so everything which we talk about here makes a difference or has influence on garbage collection. So let's take a look on the yellow and the right ones where we can have partial input. So partial input is completed if we already passed complete information. So fully qualified resources. Of course we get back fully qualified resources again, not surprising. What especially here if you have all three components and they're not empty, there's no prefix matching. So if you have just a, but you pass a version as well, you get an error. This prefix matching only works if the version is omitted. So yeah, for the case where the version is omitted but the group is complete, obviously it looks through its formation of group versions and finds replica sets. Here in this case in V1 and V1, Veta one of the apps group. If you just have the prefix matching which we already have seen happen. So apps V1, replica set is found and apps V1, Veta one, replica sets. The rest of the also looks in authentication K, SIO and probably some other groups with A but there's no replica sets so those are not returned by the rest of them. The last one, the group is missing. It's filled in as expected. And if group and version are missing, that's the usual case on the command line then group and version are added. The order of the results depends on the preferred group version order. You saw that in the discovery information already and other than that resource names are ordered alphabetically. All right, so discovery and rest mappings are connected. Let's take a quick look on discovery information on the slash APIs endpoint. You get all API groups and their versions and their preferred versions on slash API steps. You get basically the same but just for the app scope. In both cases, you don't get resources. So if you want to know about resources, you have to go to one level deeper APIs, apps V1 and then you get everything about all your resources. In the case, replica sets, lowercase, plural is the name of the resource, singular name, it's MD so the rest of them will fill in the lowercase kind. The scope of this is the namespace so we have to add the namespace to the URL, the kinds replica set uppercase as we expected. The verbs are given so we know what we can do with the resource, short names are returned for kubectl so alias is on the command line and categories so you can list all resources and replica sets are more than that. All right, so keep in mind to get resources you have to have the third column as well. From discovery info, so you can use a discovery client and you can wrap it with a caching layer so there's a disk cache and a memcache discovery. Wrapper memcache is for controllers because they are long running and the disk cache is used by kubectl for example. And if you have a cached discovery client you can pass that to the constructor for discovery restmapper and you get a restmapper as you expect implementing the interface we have just seen. There's a reset method on the restmapper so you can validate the cache manually if you like. This also happens when you have a typo and a typo means the cache is the cache miss and the cache is invalidated. This means you get those 30 plus discovery calls again and again. So depending on your use case if you have a controller there's a risk of hot looping so keep this in mind, this might be important. So where it's used, where we saw this here what I want to highlight is that the controller manager calls reset every 30 seconds and this is a way it gets to know about new resources. So if you create a CID in the cluster it takes max 30 seconds until garbage collection knows about the new CID. We talked about that we need this version discovery endpoint to get resources. The consequence is if you have aggregated API servers in the system you need this red arrow here to know about the aggregated resources. If this breaks down for networking issues or because API server redeploys or something like that discovery clients and restmappers won't see the resources in aggregated V1. This means that the discovery client will have an error will return an error and the restmapper will have incomplete information and this might have consequences so keep this in mind. The discovery client is kind of graceful so it returns errors directly but it also gives you the partial information it was able to gather from the cluster. So always expect a non-null first result even though the error is also non-null and cope with that whatever this means in your use case. The discovery restmapper is very graceful so it just continues with partial results and ignores errors but obviously incomplete information might have consequences for your use case. So think about your controller and where a user restmapper try to fail gracefully because you have to expect errors to happen and stay consistent in a way which matches your use case. So we saw the namespace controller which blocks rather than doing stupid things it blocks the work. Maybe this is much more important to stay consistent like that than continuing working gracefully. Garbage collection David will talk about in a second that's also important. Tube cutter can just print a warning or something like that that's completely fine but depending on use cases behavior must be different and with that I pass over to David. Now let's move on to garbage collection of Kubernetes API resources. The first thing to consider is which binaries are involved. And we have a cube API server which does basic crud on the resource and tracks state like is the resource present? Has it been removed or does it have a deletion timestamp set but finalizers that prevent it from being removed? And then we have the cube controller manager which runs a garbage collector controller and it looks for resources with apps that are missing owners and takes action on them. We'll get to the particular actions later on in some examples. To describe the relationship between the resources we have owner references on children. Children list their parents as an owner ref you can have more than one and a namespaced child can only refer to parents in the same namespace or in cluster scope namespaces. Having the owner references on the child allows permissions to be listed on the child and not on the parent which makes it safer to express them via the API. Looking at an owner reference here's an example in a resource manifest that shows a single owner reference. There is a section that refers to coordinates of where to find the parent. In this case we were looking for a config map named I3. You'll notice there is no namespace so this means that it has to be in the same namespace or cluster scoped. We have a block on a deletion field that only does something with foreground deletion of parents. It has no effect in the default case. There is a controller field which doesn't actually affect GC behavior at all. It's used by higher order logic to control ownership of items. And then there's a UID. The UID refers to the UID of the parent. And it's necessary to handle name reuse of parents. If you rapidly delete and recreate parents we need to know if it's the same parent in which case you need to be preserved or in a delete and recreate case it's different and the owner ref is now invalid and the resource needs to be collected. It does mean that you can't hard code them into manifests which is a pain point but we don't know of another way to solve the delete and recreate use case. So the mechanics of actual deletion whereas an option to delete in the background this is the default case. It means that resources are removed immediately as soon as you have no finalizers and the garbage collection controller finds child resources that have no additional owners in the background and deletes them. This is what happens when you run Qt control delete a particular config map. So in this example we are going to delete i3. The delete command is sends a propagation policy of background to the QAPI server and i3 is immediately removed from the API. The garbage collection controller notices and then deletes K3 because it no longer has a valid owner reference. Then the garbage collection controller notices that 02 and 03 have no owner references and they get deleted in any order. Another option for deletion is to orphan it. You might use this means that resources are going to be deleted but instead of cascading through garbage collection and deleting the children it'll simply remove the owner references to the children. This is something you might do if you want to replace a parent for some reason like maybe there's an immutable field and you need to change what that immutable field is. You have to delete and recreate but you don't want to cascade through everything. To do this you would do a Qt control delete with a cascade equals false and it would send a propagation policy of orphan. So if we work through the same example only this time we're going to delete i3 with an orphan policy then i3 is marked for deletion. Deletion time stamp is set and an orphan finalizer is added to the list. The garbage collection controller notices this, finds the children and removes their owner references. Once the owner reference is removed then the orphan finalizer is removed and i3 is deleted. So you can see here we end up K3 and the other owner references are still intact and now we can recreate i3 and relink it if we wish. Foreground, deletion of the foreground is one of the more complicated ones and this is what block owner deletion is for. It allows for more control over the ordering in which resources are deleted. This means that we're able to have parents wait for children to be removed before they themselves are removed. It's important to note that this only works if the parent is deleted with foreground deletion. If it isn't, then your block owner deletion doesn't actually do anything. So it's more of a hint. There's actually no command to trigger this so instead we have an example of using a delete dash dash raw to send a propagation policy of foreground to the QAPI server. So you can see here we have our same example. We have owner references with block owner deletion set for a couple of the cases and we're gonna walk this through. First thing that happens is that I3 is marked for deletion and a foreground deletion is added to the finalized list. The garbage collection controller notices this and goes to the next level and marks K3 the same way. But K3 cannot be removed from the API yet because there's a block owner deletion from O2. Instead, we actually have to go and remove O2 and O3 first. Once O2 is removed, it becomes possible to delete K3. And once K3 is removed, we can delete I3. You can see here that the ordering is actually the reverse of the background deletion. But remember, it's optional. If you deleted I3 with background deletion, you wouldn't get this order. So if you need to preserve your resource, you're gonna want to set a finalizer. So bugs, we do have bugs. One of them is with block owner deletion. If you have two parents of a child, then it doesn't behave right. In fact, if you delete I3 in this case, I'll just skip through the pieces we already know. We're marked for deletion. And now we're at the point where K3 should not be deleted until O2 has been removed. O2 should not be removed early, but the garbage collection controller removes the owner reference from O2 to K3. By doing this, it becomes a valid target for deletion and K3 can be deleted. This is a bug. We end up deleting I3, K3 and O3 because the owner reference from O2 no longer exists. We don't currently have a PR to fix this bug. It is less than ideal, but I hit it again while creating this demo, figured I'd mention it. There's another case where clustered namespace references can cause deletion of resources with valid owners. So in this case, we have a cluster-scoped resource N2 which has taken an owner ref against a namespace resource in your call. I said, you should never do this. It's not allowed. The API doesn't prevent it. And sometimes it appears to work because what'll happen is I3 exists in the cache in the garbage collector and N2 does an existence check with a UID and it matches I3. So it appears to have a valid owner. But on every restart, the CUBE controller manager effectively has a race. It races to see whether it observes I3 before it observes N2. If you lose the race, N2 is observed first. There's an I3 existence check, but there is no I3 at the cluster-scope. Remember, there's no namespace. So the cluster-scoped resource can't depend on a namespace-scoped one. That causes N2 to be deleted and that's probably okay because it has an invalid owner reference. But K3 was deleted as well, even though it was well-formed because we saw the UID disappear. This is a bug. We have a PR to fix it that we'll be looking at doing in 120. In a related problem, the same thing can happen across namespace where if you race and I3 exists before you see K2, then K2 won't be deleted. But if the race goes the other way, the existence check fails because there's no I3 in namespace one. And so K2 and K3 are both deleted. And one more common case of seeing this and it's inside of a single namespace where the kind is set incorrectly. When this happens, you end up with a deletion and it causes K2 and K3 in the namespace to be deleted. As I said, we have a PR fixing this in 120 and it will prevent the good owner references from having their resources removed. It was a whirlwind tour through Garbage Collection. And if you want to ask more questions about it and you don't get them in this session, you can find us in Slack and then in the community meeting and mailing list. Got links here. All right. Hello again. I hope you enjoyed our did that topics from our presenters. Personally, I really did. Before ending this session, I wanted to remind you that we have regular sick meetings every two weeks and twice a week we do our bug and pull request three ashes for 30 minutes. It's a great way to get involved. Join our mailing list to get the invites. We also have CubeBuilder and API expression working groups as part of our C and they have their own respective meetings twice a month. We are all always in our Slack channels. Finally, I will leave you this slide with contact and material information in case you want to know more or do more with us. In the name of the entire stick, thank you for attending our session and we wish you a good and enjoyable keep them. Bye-bye.