 Thank you for joining us in this talk on health checks. We will begin with quick introductions. I'm Venil Narona. I work on the service mesh platform at Stripe. I've been a contributor in the Istio and Envoy communities and I really enjoy distributed systems. My name is John Murray. I am a engineer on a service networking team at Stripe. I'm building our internal service mesh. I am an occasional Envoy contributor when the need arises and outside of work I am a C++ enthusiast. In this talk, we will have a look at the conceptual overview of health checks. Some options within Envoy for health checking, how we perform health checks at Stripe, some problems that we have encountered with scale, and finally we will share our own thoughts regarding health checks. Let's now begin with the conceptual overview of health checks. What is health checking? To understand that, let's take an example of two services, service A and service B. And service A here simply makes a request to service B to get the cat's gift. And if service B is down, it's going to result in an error. This error would go unnoticed until there are several such requests that fail and someone then realizes that there is a spike in errors on a dashboard somewhere or someone gets paged due to these high number of errors. This is pretty bad because the time to recovery from a failure scenario is quite high in this system. And also several requests would have failed in the meanwhile. What can we do better? We could introduce a new component called the health checker and the health checker could basically do two things. It could first query for the status of the server that has service B in this case and then report the same to the client which is service A in this case. Service A can then make a decision on whether to actually make a request to service B that is if it is healthy or simply hold back or act differently if service B is down. This pattern is very useful if you have redundancy in the system. Let's see how. In this example, we have three instances of service B, two of which are down and only one is healthy. And because now we have the health checker in the system, it clearly knows which instance is healthy and which ones are not. It then reports this information to service A and now service A can make a wise decision on where exactly to route the request. And because it routes the request to only the healthy service B instance, it's going to get the correct response also right at the first time. Now that we know what are health checks, let us look at some options within Envoy for health checker. Mainly there are three options within Envoy for health checking, active health checks, passive health checks or outlier detection and externally sourced health. What are active health checks? Active health checking is where you configure Envoy to make explicit calls to these upstream services to figure out if they're healthy or not. These calls are made to these upstream services in addition to the regular data plane traffic that this Envoy is servicing. On the right, you see an example of how we can configure health checks within Envoy. In this example, we have configured Envoy to make calls to mycluster.service.envoy every five seconds at the health check path. Now if the request from Envoy fails due to a five xx error or if it times out that is after four seconds, Envoy is going to consider a particular upstream as unhealthy. The next option here is passive health checks which is also known as outlier detection. In this case, Envoy doesn't really have to make extra calls to these upstream services to figure out if they're healthy or not. It simply relies on regular data plane traffic to figure out if upstream services are healthy. And it does that by inspecting properties of request response such as response codes, latency or other attributes. On the right side, you see example configurations. The first one says that Envoy should consider a host unhealthy if it returns 20 consecutive five xx responses. Envoy would then eject that host from the healthy pool of hosts for two minutes. The second configuration is used for detecting timeout or connection issues that is Envoy is going to consider a link to a service unhealthy if there are consecutive five locally originating errors. The primary difference between active health checks and passive health checks is that active health checks is where you need to configure Envoy to make extra calls along with regular data plane traffic. But in passive health checks, Envoy doesn't use extra network bandwidth to figure out if a upstream is healthy or not. Let's now look at the final option which is externally sourced health. In this case, Envoy doesn't really use any traffic between itself and the service to figure out the status of these upstream services. Envoy instead relies on the XDS server to inform it about the status of each of these upstream services. There is an external health checker component which knows the exact status of each of these upstream services and it feeds this information to the XDS server. And then Envoy using the EDS mechanism that is the endpoint discovery service mechanism figures out if a given endpoint is healthy or not. In this case, we have marked the first upstream as healthy, the second one as unhealthy and the third instance as draining. Envoy can therefore route to the first healthy instance and it would refrain from routing to the second or the third instance because they are unhealthy or draining. Let's now look at how we do health checks at Stripe. And for that, I hand it over to John. Thanks, Anil. Let's jump into health checking at Stripe. So we do active health checking by Envoy. We do, health checking setup is a little bit different from the default active health checks. We used cached values that are proxied via an internal cluster. This means that we have a cluster configured on the server side Envoy that health checks the local application on a fairly frequent interval. And then we have a standardized health check endpoint that Envoy exposes where we return the health status of that cluster, meaning that we never actually passed through any traffic to our back ends for health checking. It's all handled within Envoy. Our health checks also aggregate host health with service health. So that local cluster reforms health checks to the local backend, but then we also have kind of a different set of sidecars and processes that determine if a server is healthy. And so our health checks will also look at this value and decide whether we should be healthier or not. This is useful for things like lifespan management where we need to like recently shut down hosts or if we're in the incident, we can kind of just turn off like everything to a host and our service measures kind of integrated into these systems. We also implement kind of local active draining. So if we detect the host becomes unhealthy, then we can kind of set some flags with Envoy to proactively train connections and then fail health checks. Historically, this has worked really well for Stripe. We have some low volume, latency sensitive connections. So you can think of like some cross region, maybe one data center in US, another data center in Europe or Asia. And for regulatory reasons, we might have to locate some of our data processing within those regions, but then it may periodically request something back from a US data center. These connections are very latency sensitive because the cost is very high and relatively low volume. So having active health checking means that we don't have to rely a ton on retries. We generally can land on a healthy host because of active health checks. And also we have traffic shifting controls. We think blue, green deployments or things of that nature, hosts go from none or very little traffic, high amount of traffic in a very short period of time. Having active health checking ensures that when we switch over, we don't see a burst of failures that may be associated with if we only relied on passive health checking, for example. So that's all I'm gonna say about stripes health checking, but I do wanna talk quite a bit about problems that we've kind of seen at scale and some solutions in my concern. So an overview of the problems, slow time to detection, right? This is problematic for high volume connections. So connections where there's a lot of traffic flowing if your health check interval is fairly wide, then there's a decent amount of time where failures can happen before you react. High health check volume, right? So you have these like shared foundational services that have a ton of downstreams and all doing all this health check upstream. And that can cause a lot of problems in the active pros themselves because of this volume can kind of add significant overhead. And lastly, network costs, right? The health checks kind of traverse all the paths regardless of all the effort and work we've put in to optimizing our network efficiency and costs. Okay, so let's talk about slow time detection. So let's start with an example. We're gonna have active health check set to an interval of five seconds, meaning that our downstream is pulling the upstream for itself every five seconds. A network link processes 25,000 QPS. So by network link, I mean a single downstream upstream pair, right? It's ending 25,000 requests a second. So the potential impact of that server becoming unhealthy is essentially the duration. Number of seconds times QPS, right? So in this case, we might see a potentially 125,000 failed requests solely using active health checking before realizing there's an issue. And that is a fairly large impact. So what can we do to mitigate this? Well, one is we can reduce the active health check interval, right? Instead of five seconds, maybe it's one second. That reduces our impact from 125,000 to 25,000. It's not great, but it's a slight improvement. The drawback to this is that there's additional study state costs, right? Even when there's not traffic flowing, you're sending active health checks to that new interval that may cause more load on the server. And there's only so far we can drive this down to try to mitigate that potential impact amount. A better strategy was to implement passive health checking, right? Or an envoy outlier protection, same thing. So this is faster than active health checking, right? Cause as we saw earlier, we can look at, you know, consecutive I to XX and maybe we'll fail after 50 requests, right? That happened very, very fast. It has lower overhead compared to active health checking, right? Meaning that we only pay the cost when there's actual traffic flowing from a downstream to an upstream. And if we want a holistic setup, we can still configure this with active health checking, right? If we still want to satisfy that traffic shifting scenario that we talked about earlier or those low volume, lazy sensitive links, we can still have active health checking on top of it to provide more of a holistic solution. Okay, so let's talk about high health check volume. You know, typical diagram you see with a hacked health check as an envoy checks a single upstream, but what happens if that upstream is kind of a core or foundational service in your platform? In that case, it likely has a lot of downstreams, kind of all performing active health checking on a single upstream host. So what are some of the issues we've encountered? Well, first is like high backend load, right? I mean, think about your backend application actually serving all these health check requests. And this load can impact normal traffic as well if your backend application becomes overloaded. High connection load, right? We have to have a connection in order to make an active health check. And if we have a lot of downstreams like in that diagram, checking a single upstream host, you're just gonna have a lot of connections dedicated to health checking. In fact, these health check load could easily exceed your actual traffic load in terms of the connections created to service that. High CPU load, right? So both the application and envoy still have to do work every time they make a health check request. So anecdotally, where are some things we've seen? Well, we've seen a host burning two plus CPU cores exclusively to serving health check traffic, right? One of these kind of like central services that had a lot of downstreams health checking it. We've had to deploy larger instance types just to account for the overhead of health check volume. And we've also in the past observed a general slown of envoy, right? Even with implemented caching and just hitting those cached endpoints, envoy may take 30 plus seconds to respond simply because it was so overwhelmed with the amount of health check traffic. We've also seen hosts kind of reaching their file descriptor limits because of the number of connections created for health checking. I mean, this has a real impact on live traffic as well as a bunch of other functions of the host. All right, so mitigations. What can we do to make this better? One, we can cache the backend health. There's two types of caching here. One is the traditional type of caching where the response from the backend health check service is cached for some duration of time. And then once it expires, active health checks are passed through to the application again, and we cache type value again. This reduces the load on the backend applications, but we still may see spikes of activity as the cache expires. And then there's the kind of internal cluster caching which kind of what I described earlier that Stripe does, which has better semantics and that never kind of overwhelms the backend service. And the traffic doesn't spike as the cache expires. One thing we can do is we can reduce the health check interval. Right? So maybe instead of five seconds, we make it 10 seconds, right? That reduces our load because it reduces the number of requests that we send. The caveats here are that this may violate your business constraints, right? Maybe your traffic shifting controls or your deployment systems are relying on that interval value for how they operate. And as we saw in the previous example, maybe it competes with faster time to detection. Reduce reachability. Note there's no asterisks here. There's no drawback to this. This is essentially meaning that you should be looking at your surface graph and the active traffic that is flowing across it. And for links that are maybe defined in the reachability graph that aren't being used, you should prune those. There's really no drawback to making sure your reachability graph is properly pruned. Control plane subsets. So this is the idea that the control plane delivers a subset of your surface graph to each envoy, artificially shrinking the size of your fleet to reduce the help check volume that's being created, right? Each envoy node will then help check less upstream because they simply don't know about them. This however is very difficult to correct for errors and having the full set of nodes in the reachability graph to an envoy is really important for reliability and error recovery. And lastly, we can move to a centralized help check system, right? This is most interesting because it's the most flexible. We can account for volume. We can account for the interval, right? So if we have centralized help check nodes, we know that we control the number of connections or the number of nodes that are performing those active health checks. Because of this, we're just opening a lot less connections, right? to any individual surface. And we could potentially even increase our interval to be like really small, like one second or half second because we are also controlling the volume both of requests and connections. Big asterisk here though, because this is a lot of work if you're gonna go this route and there are likely a ton of hidden landmines with this approach. Okay, next is network costs. Here you can see we have three different services running in three different regions or availability zones. And that thick bar kind of represents a cost boundary, right? And we can see that we're doing health checks across these cost boundaries. And similarly, if we have another data center, it's likely doing the exact same thing, it's health checking services across these cost boundaries. And then kind of worst of all is that you're gonna be health checking across data centers. And this active health check data path really just kind of mirrors your reachability graph, right? And this can get expensive. And this is also likely kind of ignoring any sort of optimization you've done on your typical data path either for efficiency or for cost, right? So what can we do? Well, we can increase health check interval, right? So I'm going like go for like five seconds, 10 seconds. Again, this may violate some of your business principles or properties and this may not work for everyone. You can exclusively use passive health checks, right? So you're not incurring that additional cost and it actually will take advantage of all of your optimizations you've done in terms of efficiency and costs and your network data path, you know, first drive this doesn't work because we still need those active health checks for those traffic shifting parameters and for those kind of low volume latency sensitive links, right? Again, we can reduce the reachability. You should do this. This is a definite win and very few drawbacks here. Again, we can do control point subsets, right? We can reduce the number of edges in the reachability graph and therefore we can reduce costs. Again, this is still hard to correct for errors. And again, we can move to centralized health checks. And the interesting part of this design is that using this as something like this we could optimize for locality, putting those health check nodes local in the region of data center and for the items of health checking, we can do smart things like sharing this across those cost boundaries to kind of reduce the impact. And so what are the conclusions we can draw from this? Well, first all strategies are gonna be context dependent, right? Depending on your business use cases and needs the particular approach that you take or their particular combination of approaches you take will really kind of depend on what your outcomes you're trying to drive are. Another thing take away is just one strategy may not be enough. Some sort of active passive health checking combination is likely the best approach for the most holistic health checking strategy. However, it's important to keep in mind the data path is never 100% safe. Retries and hedging will continue to remain relevant and useful tools that you should be using. And lastly, let scale drive your design. Don't over optimize early on. I think you may be surprised as your organization grows with the use cases that your customers may bring to you. And yeah, thank you for attending the talk today. I should mention that Stripe is hiring. Feel free to reach out and now we can open up for Q and A.