 Hello and welcome to our talk on building resilient services on Kubernetes, and thank you for being resilient and staying till 5 p.m. last talk of the day. My name is Anusha Raghunathan, and with me is Todd Akinstam, and we're both principal software engineers working at Intuit. Now one of the common questions that we get asked is, can I run my business critical applications on Kubernetes? And is resiliency on Kubernetes a myth? For which I say, of course you can run your business critical applications on Kubernetes. You will get to see that we run a lot of our critical applications at Intuit on our Kubernetes based platform infrastructure. And no, it is not a myth. It is quite possible to eliminate most infrastructure related issues and be able to run highly resilient and available services on Kubernetes. Now this is just a 25 minute talk, so we've chosen the top few concerns that we've seen over the years from our service developers, and we're going to share some best practices with you. We'll start with the background of Intuit and its infrastructure at a glance, and then jump straight into health checks specifically with respect to load balancers. We'll talk about gotchas around port terminations and port startups. Talk about port disruption budgets and how you can use it to prevent port disruptions, and finish off with some takeaways. Now Intuit is a global fintech company that builds several financial products and services for you. So if you have ever used QuickBooks for accounting and payroll, TurboTax for tax prep, know that it's actually running on our Kubernetes based platform infrastructure. Now here are some numbers to show the scale at which we operate. The one thing that I want to highlight is that we serve about 7,000 internal application developers, and all of them run various different things on Kubernetes. Todd and I are part of the Kubernetes platform team that manages this infrastructure. Now what do these application developers care a lot about? Well, as far as language is concerned, we primarily focus on Java Spring Boot. And so in this talk, we will cover a lot about Spring Boot specific details, including stuff that we do during port startups and termination. Having said that, this is not very specific to Java Spring Boot alone. Some of the best practices can be applied to other languages like Golang or Python as well. As far as crowd providers, we are an AWS shop. And so you will see us throwing words like application load balancers, ALDs, AMIs, AZ rebalancing. Again, this is not just a AWS specific problem or best practice. You could apply it to your cloud provider of choice or any other infrastructure that you might be using. What's not at scope for this talk are client resiliency patterns. We expect that clients will actually do the right thing to use best practices such as retries, timeouts caching, warming, you name it. That's a topic for another day. Today, we will be focusing on server resiliency patterns, specifically how do you protect your servers from known and unknown infrastructure changes with respect to Kubernetes. Let's talk about health checks and load balancers. And before we dwell deeper, I wanted to give a flow of how ingress traffic flows into pods and the setup that we have. An application developer wants to reach an endpoint that is being exposed by their service. And the service, in this case, for example, there are three pods that are backing up the service and that is fronted by an ingress object. And we have the AWS ALB load balancer and the ALB target group that is basically fronting this from the cloud provider side. And notice that the target ALB group basically has the pods registered in the target group so that it can actually have the flow of traffic to the pods. And there is the ALB ingress controller, which is basically a cluster add-on that is running on all of our Kubernetes clusters that does a delicate dance of orchestrating the traffic and registering and deregistering pods with the ALB target group. Now we all know that just because a pod is up and running, it might not be just ready to accept traffic. And so the pod might actually have to set up maybe some database connection, warm up its cache, do a lot of housekeeping before it actually becomes ready to take traffic. So what are some of the issues that we've seen with respect to pod readiness gates and readiness probes? So readiness probes is quite common knowledge that you can actually have it as part of your pod spec and you expect the readiness probe to return a 200. And in Kubernetes, a readiness probe basically means that it's ready to take traffic and any failure to actually pass the readiness probe means that after a certain amount of time it is taken out of rotation and once it becomes ready it is again put back in rotation. But what is also just as important are pod readiness gates. And this is a primitive that is specific to AWS ALB. Again it could be specific if you could have your own implementation. But the general idea is pod readiness gates are needed to achieve zero downtime during a new application rollout. Now when you have a new version of your application being deployed, the time it takes for that set of pods to come up is a lot lesser than the time taken for the same set of pods to be registered by the ALB ingress controller with the ALB target groups. Now why is this a problem? Again there is premature registration. These pods end up being in a state like that is not healthy yet in the ALB target group. It could be in a state such as initial or it's in a draining phase or so on. But traffic is being sent to the load balancer thinking that these pods are ready and this ends up causing a service outage. The pod readiness gates are precisely to solve this problem and the way it solves it is by not unregistering the old pods that before the new pods have been registered. So this is a gate to make sure that you don't get a service outage. The way AWS ALB implements this is by having an annotation as you see here. It's a key value pair that is injected into your namespace and your controller has to respect that and implement whatever is needed in the background. So again, key takeaway look at your load balancer and how it is directing traffic, what is its tolerations and you can implement your solutions accordingly. So here are some of the problems and solutions we've seen with readiness probes and readiness gates. If your pods are not in a ready state go check your readiness probes. Typically this is a pretty straightforward check. You can actually go to your readiness health checks and make sure that they are passing. New pods never become ready. It's not so much that all pods are bad, only the new pods are bad and the old pods are just fine. This could be a case where the readiness probe specifically to the new pods have changed and that there is a problem. Again, go look at your new pods as readiness probes to make sure that they are passing. Pods are restarting unexpectedly. Now here's where I want to highlight our liveness probes. Liveness probes are different from readiness probes in that liveness probes check to make sure that your pods are alive and if Kubelet determines that your pod is not alive it's going to actually restart the pod with the expectation that it's going to come alive. This is different from readiness probes where Kubernetes will actually just isolate the pod and take it out of rotation from serving traffic and it will go back into joining the group if it starts getting ready. Now what we've seen at least a hint to it is that developers just basically have the same probes for both liveness and readiness without understanding the difference much and end up shooting themselves in the foot. So our recommendation is hey go remove that liveness probe and have better checks to actually make sure that your pod is working well. In your case you might actually just go and fix your liveness probe if it is written well enough. Pod is in clash loop back off. This is mostly an application issue from what we have seen. So go inspect your container logs and fix your app issue. New pods are slow to take traffic from ALD. This is a variation where old pods are fine but new pods are in. Again, could be something related to a new rollout. Check your app code. And finally, pods are bouncing in and out of ALB target groups which means that it will be ready at one point and then not ready and basically just keeps flipping back and forth. And we've noticed that this is an issue mainly during a gate rush situation where the pod is taking a lot of traffic and then it gets overwhelmed. It's out of threads and then it becomes not ready. And then what happens is it gets taken out of rotation. Then it is able to actually drain its requests and then it becomes ready again and then put back in rotation. Then again gets overwhelmed and puts back out of rotation. So it goes back and forth. Todd will explain in detail about the situation but in general, the issue is that you're not having pods pre-warm and use some good pre-warming strategies to take care of the situation. Let's look at pod termination and the gotchas around it. So why do pods terminate? Well, in Kubernetes, you almost have to expect pods to go terminate mainly because a lot of normal operation bring down the pods, bring down the nodes, bring down the clusters. So as a service owner, you have to almost expect your pods to go down. And some of the common reasons why a pod would go down are during an application rollout. You have a new version of your pod and the old versions go down and your new version comes up. Application scales down. Your horizontal pod auto-scaler might determine that for the current load, you don't need as many pods. You might have started with, let's say, 50 pods and then realized that you don't need that and it scales down to maybe 10. And then so you've got to have those 40 pods scaled down and that terminates those. During cluster upgrade operations, this is definitely one of those love-hate relationships that Todd and I have. Every few weeks, we have to actually rotate out our 315-plus clusters because of compliance reasons, because of AMI rotations, because of Kubernetes cluster upgrades, new features that we built into the platform, what have you. So this will basically bring down pods, nodes, terminate a whole bunch of things. So that's, again, a normal maintenance routine operation that you have to account for for your pods to go down. And finally, cluster scale down operations. Cluster auto-scaler might actually just evict nodes to basically consolidate a bunch of nodes together, I mean, pods together, and that can actually bring down pods as well. Now, let's look at the pod termination life cycle to understand what could actually go wrong if you don't have the right checks and balances. Now, you see that there is a request to delete a pod and after some point in time, there is going to be a sick term that is sent to the pod. And then after some more time, there is sick kill that is sent to the pod, and finally the pod gets deleted. Now, there is a time in between the request to delete the pod and sick term, where you can have the ability to add what is called a pre-stop hook. Now, the pre-stop hook is part of the pod specification, and you can actually execute, like you can have handlers there. What we recommend people do during this pre-stop hook is to treat it as a cool-down period for your ALB traffic. In other words, stop taking new traffic and start draining the inflight request that you have. And there is a time in between the sick term and the sick kill where you can handle your graceful shutdown. Now, if you are writing a serious Kubernetes service that maintains persistent database connections, WebSocket connections, and so on, this is a good time for you to actually close those connections and drain those requests. And finally, the pod is getting deleted. The one thing that I want to call out is that at some point in between the sick term and the sick kill, your IP is still being used. And then there are inflight requests that will take care of it. So you want to make sure that you drain those requests and then exit gracefully. If you need more time in your service to actually perform those operations, know that the graceful shutdown by default is a 30 second, so Kublet will send after 30 seconds after the sick term. It will actually shut down. So if you want to override, if you need more time to actually take care of your inflight requests, then you can override that using that termination grace period in seconds, option that I will show in a bit. And then you will be able to still gracefully shut down your pod. What if you don't do this? Well, your clients will end up getting 500s, you don't want that. So here is what we do at Intowid. So we have what is called the Java starter kit, which is basically a wrapper for Spring Boot services that are vended out of what we call the paved road or golden path. But if you don't need to do what we do, but in essence, the wrapper basically makes use of a Spring Boot feature called server shut down equals graceful. And what that does is be able to implement a graceful shutdown in your Java Spring Boot service. What this does is enables Tomcat and Jetty to stop accepting new requests. And in-flight requests will be drained. And then there is also a way to override the wait time in here. If you are not a Java or a Spring Boot shop, you can still handle your own symptom handler in your service code. And for that, again, this is the key I was talking about, the termination grace period seconds. You can add it to your pod spec if you want to buy more time from the cooblet to perform your graceful operations. Let's say you have a goal line code, then you can use that. And finally, adding a lifecycle pre-stop hook in your pod spec will help you get that cool down period, like I spoke about earlier. I'm gonna quickly show an example of how you actually do this in your pod spec, the lifecycle pre-stop hook. In this case, it's just doing an exec of a sleep command, just buying some time so that you have the cool down period for the ALB traffic. And then the graceful shutdown period is slightly more than that. Here is an example of how you have Java code implementing the add shutdown hook handler so that you can actually perform your own custom symptom graceful shutdown here. So that was about health checks, load balancers and pods going down. But what about pods coming up? For that, I'd like to hand it over to Todd. Yeah, thanks Anusha. Just like pods must be terminated in a resilient manner, it's also important to introduce new replicas of your application correctly. Otherwise you can have resiliency problems with the new pods. Usually this all revolves around the pod being sent traffic or too much traffic before it's truly ready to receive it. This can manifest itself in a number of different problems. New pods must only pass their health checks and readiness probes when they are truly ready to begin handling load. However, there are some cases where an application may be ready to accept some requests but may not be prepared to handle its full share of the load. In some cases, it may be possible for the application to synthetically pre-warm itself. For example, a Java application may implement custom class loaders to load the required classes proactively. In addition, an application with a local in-memory cache might restore or pre-populate that cache at startup time instead of lazily when it receives cache misses. Applications may need to establish database connections or connection pools with multiple databases when the pod first starts. And establishing these connections may take some additional time based on the load of the database and other factors. If only a few connections are available in the connection pool when a pod starts and begins servicing heavy load, some of the requests may be blocked while waiting for a connection. You can also generate synthetic load using the post start container lifecycle hook to send kind of no op or internal loopback requests back to your pod. And that would, those requests itself would pre-warm the pod. A common misconception is that you can just increase the initial delay seconds of the readiness probe to address this kind of warming issues. The problem is unless the application is performing proactive pre-warming, like what I've just described, it doesn't help because the actual servicing of the request is what actually warms up the service. Without any load to trigger classes to be loaded or cache to be populated, the application will never really get warm no matter how long you wait for that readiness probe. If you can't easily pre-warm the application actively, then what's needed is to allow new pods to be registered as targets and let them pass the readiness probe, but only begin receiving a relatively small number of requests to begin with to allow this warming up. And so for the ALB, this is called the slow start mode where the ALB will go ahead and register those targets, but only send a ramp of requests slowly incrementing the level of load to the new targets in the target group. And again, your load balancer will have a similar functionality. This might cause some requests to have high latency, but those requests will warm the application and then subsequent requests will be faster. Now let's talk about pod disruption budgets. Using a pod disruption budget really allows an application to specify how many replicas must be remained in service at any given point during voluntary disruptions. For instance, this is very useful during cluster upgrades, which can be disruptive as nodes may be cordon drained and removed during the process, resulting in pods being evicted and terminated and then rescheduled on different nodes. Some common problems that we've seen with PDBs add into it are sometimes the PDBs are either missing or misconfigured in a way where the PDB doesn't even select any valid pods. You know, this is a problem where the label selector is incorrect, and in this case it's just as if there was no PDB, and really any pods of the application could be disrupted at any time. The other common problem is you might have a PDB configured for an application that's in continual crash loop back off state. That crashing pod actually counts as a disruption, and so other pods will not get terminated and if you are trying to do some cluster upgrades or other maintenance activities on the cluster, it could block those. The other case we've seen add into it is multiple PDBs configured to target the same pods. So you essentially have two PDBs for the same application. This is also bad because both of those PDBs are evaluating the situation independently and they kind of fight over each other and end up not allowing cluster upgrades and other activities to go forward. So what are some of the best practice to implement for PDB? First, obviously create alerts for pods in crash loop back off and then quickly fix those problems. Something wrong with your code. You shouldn't really let pods stay in a crash loop back off for a long period of time, especially in a production environment. If you don't really care about those pods, if they're still under development, then go ahead and delete the PDB or delete the deployment object for the application itself, but otherwise you might be blocking some critical operations. You also want to make sure the PDB is properly configured with the correct selector and has a reasonable max unavailable setting. In general, I recommend using percentages instead of specific replica counts. That's because as your application needs to start scaling up, if you start to get more customers and you have more and more replicas in your application, then you might forget to also increase the absolute number of disrupted replicas in your PDB. So using a percentage allows the PDB to grow along with your app. So I'd say keep the PDB configuration simple. Don't specify both min available and max unavailable. It'll just make things confusing. Pick one strategy and go with that. So now I'd like to cover a few takeaways that we discussed in the talk. The first is that you really can eliminate most infrastructure 500 errors by following a few resiliency best practices. One, which we're not covering in this talk, but you should really do it, is use some common client resiliency patterns like Retry, Circuit Breaker. There's lots of resources on how to do this, and that's really the first step. If your server is also a client to other dependencies, you really need to implement that so your clients are also resilient. You need to understand the pod creation and termination lifecycle and really use readiness probes and readiness gates to control the traffic to your pod. Also realize that pods can be terminated any time, so you need to handle SIG term and gracefully shut down those pods, clean up after yourself, and if needed, you can use the termination grace period and a pre-stop book when needed. You also need to realize that there can be a gate rust situation, especially when you're under heavy load and you're trying to scale up new pods, and you need to think of the resiliency of new pods, and so you're reducing the pod startup time as much as possible will definitely help, but also you need to use techniques like the slow start and pre-warming to avoid new pods getting more load than they can handle, and then lastly use pod disruption budgets to protect your pod and make sure that you always have as much resources as needed to service your application. That's it for our talk.