 So I think let's get started. First of all, thank you so much for making it. It's the last good talk of a very, very long week. So thank you, everyone, for making it. And I hope you've had a nice, fun, productive, successful KubeCon. If not anything else, I hope you've made a new friend. So my name is Madhav. I work at VMware. I'm a technical lead of, say, Contributor Experience. I'm a GitHub admin for the Kubernetes project. And I've worked on the Kubernetes storage layer on some really, really fun parts of it. So because of that, I've had to shed many a tier trying to understand what was going on. So hopefully I can peel that onion of abstraction for you without the tiers. So let's get started. Before we start, just a public call to help. We need help migrating proud jobs to community clusters. So if you're a person interested in infrastructure and want to get involved in the Kubernetes community, check that link out. So before we get into the core of it, let's get a few prerequisites out of the way. So we'll take a 50,000 super, super high-level view of what the Kubernetes machine looks like and how it works. We have something called the API server, which clients such as KubeCut will interact with. And Kubernetes being something called a declarative system, we declare what we want our intended state to be. So in this case, we want a deployment which has three pods. And we specify that as something called replicas. So if I do KubeCut will apply with this deployment, what happens behind the scenes? API server takes this deployment, persists it into HCD. And also now we have a bunch of controllers that run on the cluster. So these controllers are binaries or the TLDR of this is that they're binaries that look for changes that happen in the cluster state. They need to know where we are currently. They need to know where we need to be. And then based on these two things, the sort of calculate a set of actions to take in order to get to where we need to be. So we have, there are a lot of controllers in the Kubernetes cluster by default, but I've specified three of them here. We have the scheduler, we have the deployment controller, and we have the replicas at controller. So the deployment controller sees that, also we have a bunch of nodes in the cluster, it's not as a kubeless, which also is a controller. And each of them talk to API server. No one talks to LCD directly, everything goes through the API server. So deployment controller sees that as a deployment, it's like, okay, cool, but the deployment doesn't have a replica set. So it creates a replica set, great. Replica set controller wakes up and sees it, oh, there's a new deployment, there's a new replica set we have. And then it sees that it has no pods in it because our pods are supposed to be three. So it creates three pods for us, great. And then the scheduler wakes up and sees that we have three pods, but then none of them are scheduled onto any of the nodes. So it uses its own algorithm to figure out what nodes to schedule these pods on and schedules the pods onto these nodes. So scheduler's job is done. Finally, kubeless on the nodes wake up and see that, oh, there are pods on my node that are scheduled that I don't have yet running. So it takes these pods, makes it a part of its local desired state and reconciles its state on the node locally. So the kubeless sees that, okay, I have pods zero and pod one and the other kubeless sees I have pod two, but none of them are running. So it starts these pods and now you have three pods of your deployment, managed by a replica set, all via this controller magic, right? Now, the whole point over here and how we got to where, how we got to these three pods running is, we have these bunch of controllers that are running. And the fundamental concept here is that the controllers need to know what the current state of the system is, because if they don't know the current state, there is no way they can get to the desired state. Second, the controllers need to know what the desired state of the system is because we need a node star to go to and as soon as it knows both of these things, it can calculate a set of corrective actions to move from the current state to the desired state. But how does it get to know what the desired state is or how does it get to know what the current state is rather? So to understand that, let's take this sentence, right? Kubernetes is a declarative event event system. Kubernetes is declarative, what does that mean? It means that we specify intent. We don't specify how we want something to be done. We specify what needs to be done. And the way it works is, as I said, we have the current state, we have the desired state and the controller takes actions on the current state to drive it to the desired state. So we need to start somewhere and in order to take actions, we need to know what the current state looks like. So right now, we know what the desired state is because we are the ones who applied kubectl and the controller knows what the deployment file was, but we don't know what the current state of the cluster is. So to get to know what the current state is, we perform what are called list operations against the cluster. So here, for example, this is saying, give me all the pods in the cluster that are in the default namespace and you get a response object like this, which is a collection called pod list and items is the key which holds all of the pods in the default namespace. Also, responses can get pretty huge. You might not want all of the pods in the default. So for example, if you have 100,000 pods across your clusters for some reason and you want all of them and you list all of them, that response can get pretty huge. So Kubernetes also supports paginating these lists. So you can say, give me the first 100 pods and then once that happens, Kubernetes returns the first 100 pods and also gives you a continuation token with it. So what the continuation token does is you can give it back to the API server on the next list request and it gives you the next set of 100 pods and the next set of 100 pods till you get all of the pods in your cluster. So event-driven system, what does that mean, right? Now, we know what the current state is because of our list operation, but now we need to know what changes occur in the system in order to react to these changes and take corrective actions. One alternative is we can keep listing all the time which is super inefficient, but what Kubernetes gives us is this sort of event-driven mechanism where you get events such as add events, modify events, delete events and so on. So if I see that deployment is added, my action is, does it have a replica set if no create one? And the replica set controller sees that, okay, I have an add event for a replica set, does it have pods, no create them and so on and so forth. So Kubernetes in itself, you won't find things like message buses and all of that inside Kubernetes, but as a mental model, it's helpful thinking of it as an event-driven architecture and that's a great blog post if you wanna check it out. So I have my state of the world through list. I now need to know when events happen that modify the state so I can take corrective action for them. And you might be seeing this term resource version in all of these response objects. Resource versions essentially give you recency information about your cluster or the data that you're getting back. So over here, resource version is 1452. It means that the response that I got back was at a version 1452 and we'll talk about what a version means. And this basically tells you how recent your data is. So now that I know I have data as recent as 1452, I can tell Kubernetes that give me all changes in the cluster for all pods starting from 1452. I don't care what's happened before because I already know in 1452. So give me everything post that. So as and when these events happen, I get a stream of notifications on a single connection that I can react to. So this is the watch mechanism of Kubernetes. So together, list and watch help enable this controller pattern and sort of is the key to understanding the Kubernetes architecture as a whole. So before we move forward, we talked a little bit about resource version and we weren't very specific about that. So let's dive a little bit into what that means. So resource versions are opaque strings representing an internal version of an object. When I say an internal version, we basically surface the multi-version concurrency control that HCD uses internally to keep a history of changes for each object and we surface that back to users via a resource version. And essentially in Kubernetes, resource versions are one big global logical clock. You can order all the events in your system globally and you can do, you can linearize all the events in a system globally through a resource version. Resource versions are backed by HCD store versions, which by design provide a global ordering and they increase monotonically anytime, any change in the cluster occurs. It doesn't really matter if you change a pod or a replica or a service. If any change happens, the resource version is gonna change monotonically increasing. So most importantly, they enable something called optimistic concurrency control. So what that means is if I have two controllers running and one of them writes state, the other one won't be able to write state if it doesn't already know that this state is written. So it will try to by being optimistic that I know what's the latest state, but it might not be able to. So resource versions enable that as well. If you want more insights into how resource versions sort of map onto HCD storage versions, there was a great talk by a friend and colleague, Priyanka Sagu on Tuesday. So you should check that out. So coming to the storage layer, right? Let's look at what the storage layer looked like in the past so that we can better understand how it is right now. So we have the API server and we have HCD. A client comes up, opens up a watch request against pods. Great, API server opens up a watch request against HCD. Another client comes up, opens up a watch request against pods. API server was like, okay, you know what, cool, I know how to open up a watch request. So it opens up a watch request against HCD. And another client comes up, it opens up a watch request, great. This is fine. I mean, Kubernetes doesn't really run scalable applications, right? It's meant to run my personal blog, right? With no one visits ever. But that's not the case. More people might open up watch requests for pods and other resources. And API server is gonna open a separate watch connection to HCD every single time. And this isn't scalable. Not just is it opening up a separate watch request every single time, HCD has to send all of this data, package it in a way, compress it if compressions are enabled, marshal it, all of that, send it back to API server. API server takes this data, un-marshals it, performs conversion, prepares a response, sends it back. So it's a whole lot of work that happens both at the HCD side of things and the API server side of things. So like both are loaded in this case. What this means is if you have more replicas of your controller for higher availability, higher availability after a certain point is gonna result in lesser scalability for you. Because you're opening up that many more duplicate resource connections to HCD. So how does the Kubernetes storage layer look like in the present? And these are diagrams I'm very, very proud of because they look so pretty and they're all pink. So let's take a look at that. As with any computer science problem, we solved this problem with a layer of interaction. So we have our API server and we have all these clients opening up watch requests, but let's zoom in a little bit here. So we have the API server, but inside the API server, we have something called the cashier. And all of these things in the brackets that I mentioned there, these are actual Kubernetes type names. So in case you wanna deep dive into the code later on for whatever reason, you can follow along with these slides along as a resource to look at the code because I found that to be a little challenging and hopefully this helps you. So you have the cashier and then you have the store. You have something called the store. And a store is nothing but an, it's a Golang map, but it has some indexing support built on top of it. But the idea here is that the store is meant to be a reflection of the state in HCD. That's the whole idea. Also, you have a watch cache, something called a watch cache. So this is a FIFO buffer of finite size that exists along with the store. Now, the watch cache is populated asynchronously from HCD because the cashier opens up a watch connection. So whenever you have all these ad events, modify events, delete events that happen in HCD, all of that get propagated back into the watch cache. And the watch cache populates all of that back into the store via callbacks. Now, you have the client which opens up pods. Great, we have a connection open. Now that watch request can be served either from the watch cache or from the store. As we will soon see, I will tell you what each of these things mean. Another client comes up, opens up a watch request. It's served from the store, for example. Another client comes up, it's served from the watch cache, for example. But the interesting part is that each of these clients have come up, opened up watch requests. We haven't opened up a new watch connection to HCD. We've only kept the single watch connection that's helping us serve all these requests. Now, that's for pods. But what about for resource, for replica sites, for services, for all of those other object types? You have one cashier per object type, per group resource and Kubernetes terminology. And all of these are created at startup time. All of these are created when API server is initialized. So you have all of these clients coming in, opening up watch requests. Great, we have a cashier for all of them. If you have multiple of those, the same cashier is gonna serve the request. So the store component is meant to reflect the state of HCD, as I said. Cashier per object type is created at the API server startup time. And the caching layer can be disabled all together if you don't want it. Because sometimes you have really, really tight memory constraints and a caching layer might just add on to that. So if you choose to disable the cacher, you are essentially saying that, you know what? I need extreme data consistency that I'm always gonna go to HCD for. And I don't mind the extra latency that I get with that. And we'll talk about what that means soon. But you have an option of disabling caching altogether. But if you want to keep caching for some object types, but not for the other, you can do that as well. You have flags in the API server that can do that for you. You can disable caching on a per object type basis. Now we've seen what the three layers sort of of storage look like. But to understand how they impact scalability for your requests and how you can interact with them in a better manner while understanding the trade-offs, it's important to look at how different requests interact with the present storage layer. Before we do that, we need to look at this thing called resource versions again. We need to look at how they're interpreted by Kubernetes. Because this is, I'm not gonna lie, it gets hairy. But the reason for that is because we don't want to break our users. Kubernetes did things a certain way of while ago, but then in order to make sure that we scale while maintaining compatibility, we had to make some not so neat trade-offs in terms of how resource versions are interpreted and analyzed. So I'm gonna give you a brief overview of that. It's still not the whole picture, but I'm gonna link to where you can read more about that. So in each type of CRUD request, you can pass a resource version parameter. That is, you can say that, hey, give me all the pods starting at this resource version, or you can say, give me all the replica sets at this resource version, or you can even say, I don't care about consistency at all, just give me whatever you know. That is also an option because that's least latency, but also arbitrarily stale data. So all of these are also possibilities. So knowing how behavior changes with resource version interpretation can be crucial to scalability. So for get requests, like get list, so get list is just a list request and watch. Watch is also a get request with just a get request parameter attached to it. So if resource version is just an empty string, that is, if it's not set, it's interpreted as give me the most recent data. What that means is most of these requests that have a empty resource version are gonna go to HCD by default, because HCD is the source of truth. It has to have the most recent data. So this is what that is interpreted as. If the resource version is a string zero, then it means that any data is fine. Arbitrarily stale data is also fine. And finally, you have some specific resource version N, that means data at N. And this is also, this has two interpretations that I will tell you now. Before I do that, most recent data is ensured by doing a quorum read in HCD. So whenever you do a more, whenever the read request goes to HCD via this empty resource version, HCD does a round of raft. It makes sure that you get the most consistent read and all reads in HCD by default are linearized. There is an option for serializable reads and transactions, but all reads in HCD, that is at least all reads at Kubernetes issues to HCD are linearized by default. So that's what the empty resource version means. Now, I said that resource version equal to N is interpreted in two different ways. It's via this parameter called resource version match. You have to specify this parameter if you specify a resource version N in your requests. So you have two of those, which is not older than and exact. Not older than basically tells you, give me data that is at least as new as N and exact tells you, give me exactly N. I want the resource version N. So you either get N or you get a response saying that N has been compacted or a resource version too old or 410, I think the error code is in Kubernetes. So as I said, this still isn't the full picture. Please see that link for more information. Okay, so now let's look at how each request behaves with Kubernetes. First, you have the create request. Create request goes straight to HCD. They go straight to HCD and object is created in HCD. All that happens, great. But as I said, WatchCache is watching HCD right now. So as soon as an object is created in HCD, that changes propagated back to WatchCache, which propagates it back to the store. So now we have a consistent state of HCD from off the created object back in the WatchCache again. So as when you do reads, it's now available in the WatchCache. Delete is the same way. You send the request directly to HCD, but there is an extra step here if you want to read the code. We try to delete the version of the object that is present in HCD. Because as I said, you can have multiple versions of objects depending on how things are updated because HCD implements multi-version concurrency control. So we try to delete the version of the object that the API server knows about, but if it's not able to, it just deletes the object in HCD. And as usual, that changes propagated back to WatchCache via the watch that's open against HCD. Now, updated operations, right? Same thing. You try to update the version of the object that's there in the WatchCache. If not, you just update the object and then propagate back all the changes via the watch. Now, get request. This is where things are to get interesting. And this is where I'm gonna probably start rambling a little bit. Just a heads up. If that happens, we will talk offline. But yeah, if resource version is empty string, as I said, this means give me the most recent data. Give me the most consistent data. So the read goes to HCD. HCD does a Corum read, a round of raft happens, and so on and so forth. Now, if it's zero, it means that any arbitrary sale data is fine. So it goes to the WatchCache and it just doesn't look up on the map and we're like, okay, this is the data, cool. I don't care how old it is, just give it to me. So that's what happens. But if you specify resource version N, the WatchCache actually waits for the data that's in HCD to be propagated back for a little while. It waits for a little while. So as soon as it's propagated back, it returns that. But the request does not go to HCD. All of this happens asynchronously. So it waits for approximately three seconds till this happens. If the request is not, if the data is not propagated back in three seconds, it's gonna return an error code. And this error code, if you're using default Kubernetes clients, is interpreted to be a retry. And as soon as that retry happens, that request is going to go to HCD directly. So you don't need to intervene at all. It's just that you try to do a low latency read from the cache. It didn't work out cool. You're gonna go to HCD, which is guaranteed to have your answer for you. So once that happens, the read happens on the WatchCache and so on and so forth. Now, get list. This is where I was talking about things get a little rambly. This is the function that decides if the request goes to HCD or not. And we're gonna walk through this function line by line and break it down and try to understand how you can maximize scalability here. So first check is consistent read from storage. That is if resource version is empty, by default means go to HCD because we want the most recent data straightforward enough. Second has continuation. So remember initially I told you Kubernetes supports paginating register requests. So you can say, give me the first hundred pods. It gives you the first hundred pods. But along with that, it gives you a continuation token. And in the next list request, you can provide that continuation token and gives you the next hundred pods and so on and so on and so on. But if you specify a continuation token in your list request, that is always going to go to HCD. The reason for this is that WatchCache does not support pagination. It cannot serve paginated list requests just yet. I actually did some work on the WatchCache to paginate it and this sort of worked pretty well. But it adds a lot of complexity. So I don't know what's going to happen with this, but the work was pretty interesting. So instead of implementing the WatchCache as a map or the store as a map rather, we implemented it using B trees and we did like copy on write semantics and it was a whole bunch of cool stuff. So if you want to talk about that later, we can geek out. Has limit, right? Like when I was doing my dry runs, this is the part where I was like, I was like the crazy CNC of landscape me explaining it. But I'm going to break this down. I'm going to try and break this down. So we're going to look at the first part of the condition, right? Bread dot limit. So if I specify a limit, it means that I'm going to specify a continuation later. And WatchCache does not support continuations and it does not support pagination. So we're going to send that request to HCD. That's the first aspect of it. But if no limit is set, we can serve the list from the WatchCache itself because that doesn't, that means that we're not going to support, we're not going to specify a continuation and WatchCache is capable of just solving a list request by itself. However, if we set a limit and put a resource version that is equal to zero, that's, we are just going to disregard limit altogether. And we're going to say, I don't care what you said to me. I'm going to serve the list request from the WatchCache no matter what. So that's what it says. So when I specify resource version zero, I'm disregarding the limit entirely and that condition is going to evaluate the faults. Well, so trying to understand why that's the case. So the first thing is resource version zero is any data semantics. I don't care about consistency. So serving it from the WatchCache makes sense. But why do we do this? Why do we disregard limit altogether? More importantly, it allows us to support requests which we know are going to be extremely large in size. And these are requests that are so large in size that HCD will face significant load if loaded from that. So what I mean by that is in a moderately large, okay, in a large cluster, we can have nodes in the order of thousands and each node can have balls in the order of hundreds. Now, if the QBlade or the stateful set controller were to list all these pods, it's a huge amount of load on HCD. So what we do is whenever the initial, the first list request is issued, we ensure that we put the resource version at zero so that always, always WatchCache is the one that serves it. It does not go to HCD so that we don't overload HCD. We serve it from the WatchCache with arbitrarily stale data because we just want an initial idea of what these resource versions are going to be like. And we just want an initial idea of what the current state is going to be like. Later on, as and when we get more and more events, we can incrementally build up to the actual current state. So, and again, you have the unsupported match option. WatchCache only supports not older than. So if you specify exact semantics, because we want exact data, WatchCache can't guarantee that because at the end of the day, it's a caching layer. So we end up sending the request to HCD again. So the only time we serve a list request from the WatchCache is if we specify a non-empty resource version, it is not a paginated list and we specify not older than semantics. So if you don't have strong consistency guarantees that you require, always, it's always beneficial for you to specify a resource version and do a non-paginated list. If you're okay with that and your cluster can tolerate it because you're gonna get really, really good performance benefits from that. Great, so there's a few gotchas to keep in mind here, right? Because we did so much complexity, everything is done, we are good to go, but no. This also has some problems and that's why I'm here to talk about it. When you need consistent lists and the request goes to HCD, API server can also see spikes in memory because we saw that we force a resource version zero, we disregard limit, all those scalability hacks we've done, but we still see spikes in memory and that is because when we get data from HCD, it's unmarshalled under a lock. It take conversions happen, we prepare a response under a lock, not anymore, some parts of it are not prepared under a lock and I have another talk about that. I can send you the links to that. But yeah, you do a lot of work at the API server, most of it under a lock, just to serve the data back and this is because you're requesting all of this huge amounts of data from HCD. And sometimes, paginating responses also will not help if each chunk of the response is huge in itself. If I have 100,000 points and I ask for, if I ask for like, let's say 1,000, that's okay, but it's gonna take me a lot of time. So if I ask for bigger and bigger chunks, I have bigger and bigger memory for prints and so on and so forth. So there is a cap right now in the Kubernetes community called Watch List. So this proposes serving list requests, but using Watch Semantics. So you can stream list data back. So this gives you predictive memory footprints and it handles the lack of pagination in the watch cache because now you can serve all the list requests even if it's just a small chunk, but in a streaming manner. So this is in alpha right now as of 1.28. So if this is something of interest to you, please try it out, provide feedback to the community. It's set to go to beta in 1.29, which is set to release end of November. So get list is another time, another gotcha is time traveling. So you have something called stale reads from the watch cache and this usually happens in an HA setup. So you have two Kubernetes API servers that have watch cache and cache was enabled, each of which can have arbitrarily stale data. So whenever you do the first initial list, which is now forced with a zero resource version, which means give me any data, you can get data that you've already seen in the past because one of those API servers might be arbitrarily behind and as soon as you see that data, you start reacting to the data and taking wrong actions. So you're vulnerable to something called stale reads in controllers and this happens in HA API server setups. Externally to Kubernetes, there are a few tools that have come out from collaboration between industry and academia. Two of such tools are sieve and actor, which can automatically detect such issues and more through automatic reliability testing of your controllers. And we have researchers who do that in the room right now. So we have Tian Yen and we also have Thrivik Raman here. So if this is something of interest to you, please go talk to them after this. Now within Kubernetes itself, there are a couple of caps that are attempting to solve this problem. So the first one I talked about is watch list, right? So this is we are streaming data back into the watch cache because we have a watch opens like open against that CD and we're using the same semantics to serve list requests. So essentially you are serving data as and when it comes, you're not waiting for a lock and serving arbitrarily stale data. And then you have another cap called consistent reads from cache. So this cap argues that data in the watch cache, most times is recent enough. We don't need to go to HCD all the time. So this cap sort of tells you works on the fact that if I provide a specific resource version or if I don't provide a resource version at all, it basically waits for the cache to get populated back, waits for the cache to be consistent enough. What enough means is you can read the cap and then after that happens, you can serve the, you can serve watch requests from the watch cache itself instead of going and list request from the watch cache itself instead of going back to HCD. This is also an alpha right now. So if you're interested, try it out, provide feedback and hopefully we can move this to beta in the future Kubernetes releases. You also get some nice performance benefits from these caps. So the watch list cap that I talked about, I don't think the numbers are visible, but before this, this is CPU on the left is before. So that's around averaging around four to five cores. And on the left, on the right, it's after, that is after watch list is enabled. This is memory for CPU footprint of the API server. On the right, it's averaging around 1.2 cores of CPU. So you get some nice performance benefits and the same goes for memory as well. And similarly for the consistent reads from the cache cap, you also have some nice preliminary performance improvements that you see, not just for the API server, but also for HCD. Now, we're done with list. Finally, we can talk about the watch request. Nothing too special here. We've already talked about most of the things. So if it's an empty string, we delegate the request to HCD as always. Otherwise we serve it from the watch cache. But the way we do it is we create something called the cache watcher. I'm naming these things so that when you go, look at the code, it's familiar to you. So we create this thing called cache watcher, which helps you serve these watch requests to each of the clients, right? And it also helps deduplicate some of the objects that you need to serve back. So it's a pretty nice abstraction. But the way the cache watcher works is it allocates an input buffer statically at start time. And the way it allocates this is depending on what type of object is being watched. And the sizes of these allocations are sort of determined through heuristics that we found out from our scale testing in the Kubernetes community. Now, as soon as the buffer becomes full, we terminate the watch, the client gets a 4.10, we restart or we re-establish a watch and then things continue so on and so forth. However, the cost of keeping up with watch events, the cost of keeping up with watch events is essentially establishing a new connection because as soon as the buffer is full, we stop it and we wanna keep up. So we establish a new watch and then keep on going. But establishing a new watch, all of those things also have a cost. And this wouldn't be happening if, for example, if you have a lot of events, but a static size allocated is very small because heuristics are heuristics at the end of the day. They aren't fully accurate. So however, the slow client, slow server or a storm of rapid updates is enough to necessitate a new watch connection. So there is also some very interesting discussion in the community happening of how we can dynamically allocate the size of this buffer. And the issue has some really good analysis of how that can be done in the future. He's also the person, by the way, who implemented the watch list feature. So if you get a chance to talk to him, you should. Okay, we are done. That was a lot. But in conclusion, the list plus watch pattern is the central theme to how Kubernetes, the Kubernetes machine works. Different requests interact differently with each layer of the storage layer of Kubernetes, depending on the resource version specified, the resource version match specified, and a combination of those two. And specification of these parameters can help you make the trade-off between data consistency, latency, and that can majorly impact scalability of your cluster. Now, unless you have strict consistency guarantees, rely on the watch cache, trust it, but also be mindful of the scale rates problem. Thank you for listening and coming over to the last session of the day. If you wanna reach out, I'll give you three to reach out.