 Hello everyone I'm Madhu CS. I'm a software engineer at Robin Hood. We are a financial services company and I'm a tech lead for this team or our called us continue orchestration, which is responsible for running all of compute at Robin Hood and Robin Hood has chosen Kubernetes as the compute platform of choice. So pretty much all of our compute except for some legacy stuff It is all on Kubernetes Today, I'll be talking about Controllers and how we push the limits of the controllers. These are production war stories But before I get into the details of the agenda, I want to say this I've prepared or we have prepared this talk to be a talk for people at all levels Everybody is welcome, but that said if you are an experienced cluster operator here You might still some of the Overview concepts here might be a repetition for you But we really hope that you can take some of these things home as an inspiration or to your company back as an inspiration and pay a little bit more closer attention to some of these areas or Maybe do I mean revisit some of these concepts and see what is working and what is not if you are a beginner There is definitely an overview in the beginning, but some of the specifics may go over her head, but that's okay You don't have to take everything here, but we still feel or hope that it will give you a solid grounding To the concept so that when you're ready to write your own controllers You can use all these lessons or learnings and hopefully it will Let open the hood a little bit and show the machinery inside so that you know how exactly these things work and Coming back to the topic itself Nothing Kubernetes controller pattern is a highly scalable pattern, right? We have seen how community scales But nothing is infinitely scalable not even slashed ever slash null, right? so We I'll talk about how we have pushed the limits of this pattern and the incidence or outages that we run into when we try to do that the lessons we have learned from it and What we are do how we solve those problems and the ongoing work right now to Ensure that we don't we don't run into these limits again and again So I begin with a quick overview of overview of controllers and then we'll do two case studies. That's where I'll be spending most of my time One case studies about where we went too far or when we hit the limits of the patterns and we'll talk about the best practices or lessons learned So this is the overview, right? Kubernetes philosophy is you the user of Kubernetes submits a declarative intent as in They submit the intent of what they want the world to look like which is called as their desired state and Then there are controllers reading this intent and Taking action to bring the current state of the world to the declared state of the world The controllers are constantly observing both states of the world the absurd state of the world as it is in the intent store Which is the episode I'll get to it in a bit and then it's also observing the current state And it's always always trying to bring the current state of the world to the desired state of the world I think the key concept here to remember is it's always in that direction The current state is always brought to match the desired state Not the other way around if the desired state changes controller does its work to bring the Current state to the newer desired state or if the current state changes for some reason because somebody went and did something Manually in the infrastructure it again tries to bring the current state to match the desired state This is the key concept and this is this is the philosophy on which everything in Kubernetes is built There is some documentation as an there is official documentation around the concepts of controller if you go look at the documentation one of the Interesting examples that it talks about the classic example is the example of a thermostat in the physical world right if you set the temperature of your thermostat to 70 degree Fahrenheit Fahrenheit is constantly watching for observing the temperature of the room it Does not immediately make the temperature 70 degree Fahrenheit in the room But it's observing the temperature of the room and every time it's not 70 degree Fahrenheit It either turns on the heater to emit heat or the AC to cool down the room to bring the room temperature to 70 degree Fahrenheit There is more about this analogy in the documentation You can check out the slide I can share so you should have the link but bringing this to Kubernetes world and to be more specific Cube led although not a not called a controller the agent that run The agents that run on Kubernetes nodes are controllers to although they are node agents They are controllers to they watch the API server, which is the intense store to see if the cubelets have to run new parts and If any parts get assigned to the node Those and those parts aren't running on the node The cubelets job is to instruct the local container runtime on the node to start those new parts So cubelets are cut our controllers to With that we'll talk a little bit about the specifics of the implementation of a controller This talks about the semantics of the controller so far we talked about the semantics of the API We'll talk about the semantics of the controller This is the actual machinery that we use today for writing controllers, right? If you go look at go client go which is the client library that's used for writing controllers You may be using higher-level libraries as well like controller on time with your builder, but overall they all depend on this machinery where When the controller starts up it Contacts the intense store or the API server to get the current Desired state of the world as in it tries to get the entire snapshot of the desired state of the world when it starts up to make sure that it knows what the current desired state is and This operation is a very expensive operation because when it does this list it Requires or it asks for the snapshot of the entire desired state of the world for all the resources it is interested in The key here and this goes into where we push the limits So I want the one thing that I want all of you to remember is this listing operation is a very expensive operation So what controller does is? Well, it can continue listing once in a while periodically pull the API server or the intense store and Keep getting the desired state of the world, but we have established that it's already an expensive operation, right? So you don't want to be doing often So the machinery lets you use this API called as the watch API Where the controllers can say I now have the entire snapshot of the state of the world or the desired state of the world and From now on start giving me only the changes for the desired state of the world Which happens let me use my mouse pointer here give me a sec so APS I mean controller says I have listed you from now on give me watch events Which is the changes to the objects that I'm interested in or the resource I'm interested in But that's not it right if it were only this this prop for today's computers. It's not a big deal. We are replicating data here It's not really expensive But that said Controllers also have to do some additional work right as in they're not just Replicating or duplicating data or listing data from the API server They do some real work for the change events or for the resources and then update the status pack and The order of magnitude of resources every controller watches for is also very different There are some APIs or controllers where they're only order of hundreds of resources if you're talking about something like CRD's or something the CRD controller is only watching for hundreds of resources there can be thousands of resources or if you're thinking about things like pods There can be half a million pods within a cluster because that's the scalability limit of Kubernetes in large clusters today, right? So if you are getting tens of thousands or hundreds of thousands of pods and then doing some work on those parts It gets really expensive if you keep pulling it all the time and going through all the pods all the time Even though if they haven't changed so that's why we need this complex machinery of Listing and then followed by watches one other important thing to remember is Kubernetes API machinery does this or the client API machine. It does this so well that After it gets the snapshot it can clearly instruct the API server to send all the events Since that snapshot the consistency is maintained in the sense that you are guaranteed or the controller is guaranteed to not miss any of the Changes that happens to the resources since the list response was sent out So these are the key concepts and here is a hilarious example that we have come across right Oh, one other thing that I did not introduce to us Kubernetes also has this concept of CRD's custom resource definitions that which is the key or the core foundation of the Kubernetes extensibility model It lets you the user of Kubernetes to extend the Kubernetes the way you want to extend it You can define your own API as a schema and you can have your users use those schemas or APIs For getting something down and in this hilarious example Oh actually even before I get them sorry You can define your own schemas and then you can write your own controller because schemas themselves as we have seen don't do anything Right, they just get stored as an API resources just get stored in the intense store You need a controller to take some action or do something on top of those APIs So Kubernetes allows you to write those controllers as you want and the interesting thing about Kubernetes is It doesn't let you write your controller against just your own CRD's or APIs It lets you write controllers against existing native APIs as well And you can write controllers that cross the boundaries between your custom resource or custom APIs and the native APIs And that leads to an extremely powerful pattern. So coming back to this hilarious example There is this is a real GitHub project. By the way, you can follow the link. There is this pizza controller There are two CRD's here pizza store CRD and pizza order CRD I don't think anybody is opening a pizza store here. So I'm not going to talk about it But we're all interested in eating pizza, right? So there is a pizza order CRD What it does is if you're interested in eating a pizza if you want a pizza you can go submit a pizza order resource to the API server where the CRD exists and The controller makes the call to Domino's API to order a pizza Unfortunately, that's all it can do today It can't tell you whether you're full or not or if you are satisfied with the pizza We constantly keep hearing from the authors that that's coming in V2, but let's see when V2 happens But this talks about both CRD's and controllers and taking this example a little bit a little bit further You could consider writing your own controller that watches for new service accounts that get created in your Kubernetes cluster and order a complimentary pizza for Every new person who comes into the cluster. That's the extensibility power here With all this introduction, let's dive into one of our first case studies The title here is a little clickbait. Yeah, I would say but hopefully it makes sense towards the end of this whole case study so I was on call a few months ago and I Got a bunch of alerts and people started complaining that Cron jobs could no longer resolve DNS That was interesting because anybody has worked in infrastructure recognizes this sort of a page, right or an alert, right? What is it about cron jobs and DNS like why would DNS resolutions fail for cron jobs? Well, the real answer is it does not make any sense. There is no correlation like if you are experienced not experienced It really does not make sense So we started digging deeper. I gathered around an incident response team We started hearing from more and more application developers back-end service developers that they were seeing this and we quickly Discovered why it was cron jobs The thing was it was not not just cron jobs any new pod or any new workload that came up had this problem and The actual problem was not even just DNS Any new part that came up would have a total network blackout for a few minutes and then it would resolve itself it was anywhere between two to ten minutes and No packets would go out. No pack. I mean packets would actually go out But they would not come back in the return path wasn't working And the problem was the blackout time was typically longer than the readiness Probes for most of these parts the parts. So parts would constantly die or the containers would constantly die and come back up And they would see the problem again, and this would happen over and over and over Fortunately for us running parts continue to work This is the key thing or the core tenet of Kubernetes container orchestration design Even if there is a problem with your control plane the programming of parts and everything The data plane is supposed to continue to work, right? So the data plane was healthy existing parts continue to work We just wouldn't do scaling operations or deploy a new version of the software think things like that So fortunately, we did not have any customer impact With that I want to take a slight detour to introduce you to the Networking stack that we run so that you can understand what happened here better. So we are a full AWS shop and We started our Kubernetes journey about four years ago or a little longer than four years ago And we have been migrating all the services from raw easy to VMs to our self-managed Kubernetes clusters but we have We migrations are migrations are multi-year efforts, right? So we we knew that we had to live in this world dual world where there was there were workloads on Kubernetes and then there will be workloads in easy to At the moment we have migrated pretty much all our stateless workloads onto Kubernetes and there are there only a few stateful workloads on Easy to but we still continue to live in this world. So the solution that we have built. I mean in the AWS world if you want to do Authentication or authorization the easiest way to do especially in the raw easy to world is to use AWS security groups, right? Security groups is essentially a network segmentation as a proxy for your authentication and authorization You know, it's a poor man solution, but that's what we have So we wanted to retain that for our Kubernetes workloads too. So what we built is we built this in-house component or system called Calico cloud controllers because our networking stack on communities is based on Calico so what this does the Calico cloud controllers do is They read the security group configurations in AWS as in it periodically pulls AWS makes the describe security groups and describe network interfaces call and It syncs that information into the Kubernetes clusters as in it creates the global network policy or network policy custom resources in the cluster Calico which is there is this component called as Tyfa here, but I'll get into that a little bit later But overall Calico which runs the node agents these node agents are essentially controllers as well They're watching for these network policy resources and every time a new pod comes in It programs the IP tables on the local nodes to ensure that the security group Rules are enforced at the IP tables layer as in pods can either send packets or not send packets Depending on the IP tables rules configuration But these IP tables rules configuration itself is a translation of our security groups configuration So this is the networking stack we have and this how it looks like roughly right if part Yeah, is allowed to talk to party different application the traffic can flow. Otherwise the traffic cannot flow Now let's with that introduction or conceptual overview. Let's get back to the incident itself, right? So as we spend time digging into things we looked at all sorts of logs We run a case CNI plugin Which is responsible for doing IP acquisition for every part that comes up and setting up the routes for the Pods as the IPs are acquired and allocated or assigned to the pods That is supposed to as I said supposed to do the full route management for the power routing rules management for the pod Right, so we looked at CNI logs. We looked at Calico logs to try to understand what was going on And then we found this. I don't know if you all can read this We found this really interesting log line in Calico node agent logs Calico was Saying this removing old route and it corresponded to the pods that just came up That was interesting because what is an old route because these pods are brand new very fresh, right? So the trigger was unclear, but we at least understood what was going on with Calico So we put some mitigations in place so that we could open for market the following day Yeah, we also have a big stock brokerage product, right? So we put this mitigation so that it would give it would give us some more time to go root cause What was going on and fix the root cause so we went on a root causing expedition Which was all about digging deeper and deeper into what was going on and a lot of code reading at the time already That's how many of our incidents look like and that's going to be one of the lessons learned here So we looked at some of the dashboards and we have a really nice dashboard for all Calico operational metrics and the first two things that we see when we open the dashboard is This graph called as typo pink latency and typos connection drop rate And typho is this component that exists in the Calico ecosystem Which acts as a middle-layer or intermediate system that? caches the updates in the API server and does connection pooling and it does that because for large clusters like ours Thousands of nodes per cluster if every Calico node agent is talking to the API server API server just doesn't scale it completely falls over so Typho does both the caching and the pooling so that not every Calico agent has to go to the API server But Cal typho also some runs some periodic checks against the Calico node agents to make sure that the node agents are healthy And one of the things it does is it sends ping requests often to the node agents to make sure that the node agents are fine So we saw the pink latencies from the typho side go up and if you look at this graph The the time at which it started going up perfectly correlated with the time the incident started This is exactly when I started getting paged and people started complaining. So that was interesting So at this point everything is pointing to Calico being the problem But we are not still not sure what's up with Calico. So we start reading the Calico code The other concern here is why now we have handled much higher loads during previous black swivels We looked at Port churn rate. We looked at security group change rate There wasn't enough Port churn rate or security group changes at the time. So it did not make sense There is a part of work stream within the incident response group That's reading the code and trying to make sense of Calico typho there is this other group of people who are just looking at the metrics and Somebody comes across this graph and this is the QPS rate for the service accounts graph it's hard to come across this graph because You Kubernetes has hundreds of resources or API types, right? You can't be building separate graphs for it's just not practical to build separate graphs or panels for every single API type and if you aggregate everything into a single panel the problem is The rate of changes or the number of requests that flow through for pause is so high that It basically drowns everything else in the graph like every single other API request type will look like this Zero line in that graph. So it's really hard to see what is going on with what API type We have tried a bunch of things about with anomaly detection, but nothing has actually really worked for us But we found this graph and if you look at this, this is the service account mutation rate or the update rate and This peak there was a really sharp increase here and this perfectly correlated again with when the incident started and more interestingly we saw this across all our production clusters and Then we started digging deeper and deeper and we started pulling in or paging people in from various partner teams the on calls from various on-call teams and then we realized that There was a potentially faulty change that was introduced to our in-house application development framework And I'll take a minute to explain what our framework is. We have this framework called as archetype Which is an abstraction that we provide for application teams to develop their applications applications back-end services Whatever you want to use in terminal in the lingo or terminology Archetype we are completely bought into the Kubernetes style of APS the declarative modeling declarative style of API is controllers and everything So archetype are essentially a bunch of CRDs We have specifically we have a CRD called as the application CRD And we have a CRD called as the components CRD And the idea here is that a given application is composed of a bunch of components And there are controllers for each of this and different components have different types of controllers and things like that So what happened here exactly as I said each archetype application can have multiple components that usually an application is composed of multiple components right different server instances worker demon sites, etc and there was a change to the component controllers that was pushed to add annotations to the Kubernetes native service account objects and what this annotation was was the controller added a Computed signature of the owning component to the service account as in whatever Component the service account was supposed to belong to us a computer signature was computed for the component and that signature was added as an Annotation Kubernetes annotation to the service account object interestingly and unfortunately At that point in time there was ongoing work. I mean the ongoing work hadn't landed yet a Single service account could be shared by multiple components. We were working on or we were Discussing doing designs or splitting that out as an we wanted to have one service account for component But that work hadn't landed so what happened was and this is where the deoling in the introduction or the title comes in Right the controllers ended up fighting with each other one one controller would come in and say This is the owning component and add the signature for it another controller would see that no That's not the owning component and it would rewrite the signature and this generated a significantly large number of update Or mutation operations on the service account objects that it completely overwhelmed the system and This is where things get really interesting What do service accounts have anything to do with networking or anything right? We were confused at the point and then as we were reading the code I mean somebody on the team also knew about this But we also realized at the point this is this is where all this incident response and interesting out war stories come in even though You theoretically know about things. It's hard to remember or recollect some of these things when you are in this high Situations like incident. I mean like these production incidents, right? somebody realized that Kubernetes network policies also have this mode where Kubernetes implementations of network policies can Program policies based on service accounts and Calico does that Unfortunately, there is no way in Calico. We don't use network service account based network policies at Robin Hood But unfortunately, there is no way in Calico to tell Calico to not watch for service accounts Even though we don't use it. It just continues watching for service accounts. So When the mutation rate started going up Calico's in in the happy state It's not a problem at all because there are not as many mutations to service account objects But in this unhappy state, it was a problem because Calico's in-memory controller queue, which is based on the Kubernetes client go machinery Got so backed up that it starved the important updates and those important updates was new pod creation Although Calico was not supposed to be doing any route management for us the AWS case C&I was supposed to be doing the route management for us There was a bug in Calico that did not read the configuration Correctly the configuration basically asked Calico not to do any route management But the bug did not translate. I mean did not let the configuration to translate correctly and we hit that bug and it's now a known bug in both AWS and AWS C&I and Calico communities So we have currently patched the bug in our local implementation and there is some upstream work going on right now So Calico would still do the route management. It was started off new pod updates So it wouldn't know anything about the pause C&I would come in Do the IP assignments and all the IP management stuff and then set up the routes Calico not knowing anything about these new parts because it was so backed up It would just go and delete the routes It wouldn't recognize and that would cause a total network blackout for a few minutes and then eventually Calico would catch up anywhere between two to ten minutes depending on the node and what else is going on the node and The route the routes would come back and parts would what I mean network would work again so this is the first interesting incident right and I Switch over to talk about the second incident and I'll talk about all the lessons learned together in the end There's also another interesting incident. This happened a while back. I don't remember exactly remember when We scaled up our systems in preparation for something we scaled up our Kubernetes clusters and everything applications everything for some something that we were anticipating And then we did a production freeze. We said we were anticipating this event so we are not going to roll out any new software and Then the event happened and we wanted to scale down because it did not make sense to be at that scale and spend money so We started scaling down All the backed up changes during the production freeze started going out as it started getting deployed and as we did this scale down API servers started dying An API server would die They would get restarted stay healthy for a bit and they would die again And we were losing the control plane and this is an interesting situation Right because some of our run books were based on running cube control commands when incidents happen so and because cube control has to talk to the API server to run the commands and if the API server is dying You are in a very difficult spot because you can't even get the basic information like what nodes are running What pods are running etc etc? We had other run books to go directly on to the nodes and look for things and basically that's what we used But we couldn't use a lot of our other run books again Fortunately, this goes back to the core Kubernetes standard right control plane is down. They had a plane should continue to function So there was no customer impact again So we continue doing what we did as as in the previous incident or case study Interestingly also we went to the API server was the problem. We went to the API server dashboard and These two are literally the first two graphs that we see on our API server dashboard. This is the CPU usage and memory usage and Again same story, right? The moment where it starts going up as in the CPU usage and memory usage start going up correlates with when the incident started So we had a bunch of theories we started looking at logs again like in the previous incident and there was this particular log line in the API server logs and Again, we run self-managed because of these reasons because we want to have access to these kinds of things during incidents And because it's our self-managed cluster. We have access to all these logs. So we went and started looking at these logs and There was this particular log line. It said something about the audit buffer This is for the audit logs that Kubernetes API server generates, right? This threw us off a little bit because this was a red herring We thought or we started thinking that we have an in-house implementation of the logging web audit logging webhook that Ships the audit logs to our logging system So we thought that implementation was buggy or something there was some problem with the implementation And it wasn't receiving the audit logs So that got backed up in memory in the API server and API server We didn't we started reading the API server code I mean I'm personally familiar with the API server code because packet Google I worked on Kubernetes and Control plane so we started reading the Kubernetes API server code and we realized that Kubernetes API server does not do any circuit breaking when the audit webhooks don't behave properly or misbehave We had the theory but that did not fly for long We looked at we were looking at cubelet logs We saw a lot of config map secret mounting log lines in cubelet logs but that was another theory that did not work out and then Because the usage had increased in the API server. We said well Let's let's throw a little bit more money into the API servers right hasn't make the API servers Some what hasn't the nodes Larger than what they are right now. This is the cultural thing in the team This is a emigrant is the term that we use for these kinds of things. It comes from the Simpsons So we said well, let's emigrant the API server and maybe it will make the problem go away for now It did not make the problem go away. It made the problem less severe as an API servers were still dying but not as frequently and Then we started looking into the API server logs again and again And then we saw that we were getting this HTTP for 10 errors in the API server HTTP for 10 Gone is also a well documented Semantic in the Kubernetes documentation. I won't read out the entire thing, but the TLDR here is Actually, even before the TLDR a little bit of conceptual introduction, right? When the controllers do a list and then followed by a watch they have to get the watch events So in order to send those watch events to all the controllers or clients The API server keeps a watch of all the historical versions or changes so that it can send those changes to all the clients and That is stored in a cache in the API server as an in-memory cache in the API server called as the watch cache There is an argument or a debate whether it should be called watch cache or not and whether the name is a misnomer but that aside there is a watch cache and If for some reason the API server is not able to send watch events Before the objects are removed from the watch cache the API server gives up and sends HTTP for 10 gone error or error status code to the clients and at that point the clients are As in clients including the controllers are expected to restart their watches by doing a List operation as in a realist and remember what I said earlier lists are super super expensive So we were seeing four times and to give a little bit overview again pictorially communities do a list of I mean controllers do a list operation on startup and then they begin watching and Then APS server returns the changes occasionally APS servers can say four ten gone. It's not a problem But if it starts happening Frequently it is a problem and as we started looking at the cubelet logs It started becoming increasingly clear to us that the client in this case or the controller in this case was actually Cubelets and because the controller our clusters had thousands of nodes thousands of cubelets right thousands of cubelets were getting HTTP for 10 gone and Then they had to restart their watches by doing this expensive realist operation against the API server. So And this happened this actually happened when we were trying to either run large load tests or when we were trying to redeploy or update the versions of larger services that we have and Because we were in this production freeze and then now we were in the top here There was this bug that was introduced in our frameworks where the number of parts that were created for some of these larger services were in the order of Thousands in a matter of minute. So this our systems would create thousands of parts in a matter of minute. Oh my god, that's not good. I can do this later and Because the number of parts that were getting updated were so high The watch cash would have a very high misread as in things would go the historical versions would go stale Really really really quickly because they were getting updated as in the versions were getting updated for given parts at a very high rate and Every client in this case cubelets were doing this right thousands of cubelets are now trying to are getting four ten guns because the historical versions are not there in the API server and they missed their watch events and Every every cubelet is trying to do a realist and followed by a watch this further further overloaded the API server and send them In a into a downward spiral and eventually to their deaths So that's the two incidents here Here are the takeaways at least from I mean not from these two incidents The takeaways are mostly from all the production outages or incidents that we have run into Hopefully these are some inspiration for us the number one take away for us is observability is the key to operational success And there are three dimensions to observability, right? There are there are metrics logs and in Kubernetes case audit logs I'll talk about all of them. I mean there is tracing as well, but I won't get into tracing because The upstream support for tracing of Kubernetes components has not been great so far There is a cap right now and I know David Ashpole and other things are working on bringing tracing into Kubernetes components But because at least at the time there was no prior art We also haven't done a lot of work in terms of distributed tracing in the communities components or the extensions or controllers that we have written We hope to do that as as the upstream work lands So leaving that aside metrics right Kubernetes components export a ton of metrics tons and tons of them Our lesson is script them all because you never know which metrics you you need But that's it though It's not practical to build dashboards out of all those metrics because there are just so many of them if you put all of them then your dashboards mean nothing so it's a little bit of a training or an exercise to Understand what are the most meaningful graphs or panels that you want to put in the dashboard in our case having the type of pink latency and the connection drop rates at the top for the first incident and The APS server CPU and memory usage metrics as the top metrics on the APS server dashboard really really saved us because that was the first thing That we were seeing when we went to those dashboards so structuring your dashboards in a way that makes these kinds of important metrics more visible or accessible is really important and Because you never know what you need have an ability to construct these graphs ad hoc Right if you're using Grafana Grafana has a really nice way of building ad hoc graphs for all sorts of metrics So if you're scraping everything you should you should be able to build these graphs ad hoc Coming to logs logs from Kubernetes components are really useful at least in our experience But if you're reading the logs only from the logs system without actually looking at the code It's quite a feat as in these were the logs written by someone else right the author of the code by the developer So you need to train yourself to understand what those logs mean to understand what the system is I think everybody not Kubernetes right everybody except everybody who has operational experience Knows about this So you got to train at least our lesson learned here as we got to train our teams to make sure that They understand what these logs are for these particular components as in that muscle memory or that muscle has to be built within the team API server logs haven't been that useful for us I mean generally they're noisy and haven't been useful for us in practice don't don't get me wrong I have contributed to the API server myself So I understand how the code is written At least it feels to me like the API server logs are written for Developers to debug the system than for the operators of the clusters So we usually generally don't run to the API server logs as the first thing But but that aside having a good lock varying and filtering system or the aggregating UI comes in really handy Coming to audit logs. This is where I think Things are super helpful as for API server is concerned Kubernetes exports audit logs right and it fills the gaps left by the debug logs that the API server provides The audit logs haven't just helped with security stuff for us. It has also been an extremely powerful debugging tool It tells you who made the calls how frequently were the calls made and at what times This is how we found that there were tons and tons of realists that were happening And those realists were made by the Cubeless during the second case study and have a similar or same aggregating querying and filtering logging system or UI for audit logs as well because it matters you need to You need to understand how these audit logs work and you need to be able to filter these logs and view them during incidents So that's the first take away the second take away here is around change management and having some visibility around the changes that go into the system Ensure that you have some way to or some visibility into all the changes that are pushed into your cluster here It could be native Kubernetes cluster components. It could be extension size and controllers or other things that you write and Have some mechanism to have those notifications somewhere as in all the change notifications It could be a very simple system just throwing out ideas here It does not have to be this but it could be as simple as Notifications for the summary of changes that just rolled out put it on a slack channel But have a dedicated slack channel where you can just go and scroll and see what happened and have the ability to construct the timeline of changes because Things generally break because something changed if nothing changes nothing breaks, right? And the Kubernetes extensibility model as I was saying is very powerful, but it's also a double-edged sword It's not just you or the Kubernetes operators who can write these extensions Anybody who is using Kubernetes can write these extensions, right? And that means anybody who's writing these extensions can make mistakes So ensure that you work closely with your partner teams for writing these extensions understand the use case help them build similar systems as yours for change tracking so that you have visibility into them as well and Finally, this is really the last thing here Don't be afraid to read the code. I think my at least our personal Detect here or my personal take here is Code does not lie as in lie meaning there might be bugs in code But when you read the code, you'll see the bug as well as in if you're paying attention Documentation can be out of date or can have typos a bunch of things, right? But code is exactly what is running in your system. Make sure that you have the code I mean if you're running open source components at least make sure you have your code checked out at the version that you're running Don't be afraid to read the code Kubernetes is large is a very large code base. I understand that I did some analysis of API server when I was back at Google And it was already just the API machinery was around 1.1 or 1.2 million lines of code back then This is excluding all the controllers and all the other pieces in Kubernetes So it's a large surface area not not everybody will become an expert in everything in Kubernetes It's just not practically possible but try to We have tried to do this and my recommendation or advice would be to build expertise within each of these areas in in your team You could the the expertise the areas could just follow the sig model that the Kubernetes has Signet working you have a parallel analogous group within expertise group Within and is being raised. I'll wrap up. This is the last slide So yeah build expertise make sure that people in those areas are familiar with the code for the components that they own both Internal and open source and then also ensure that there are debugging tools I mean people within those groups are comfortable with the debugging groups Debugging tools within that area t-speed dump IP etc for networking people off-frame graphs etc etc for all components With that, that's the end of my talk. Thank you I went three minutes over, but I'm also happy to take questions. I'll be around or if people have questions here I'm happy to answer now as well