 Hello, my name is Chris Dutra and I lead site reliability engineering efforts within the market's technology organization at JP Morgan Chase. Today's presentation is going to demonstrate how Istio Service Mesh, combined with observability to add-ons like Prometheus, Grafana, and Yeager, can provide SREs with advanced data to define, track, and manage the SLOs of their applications. As a refresher, let's understand what service level objectives are and why they are useful. SLOs provide engineering teams with a baseline customer expectation for the services that they offer. These are often referred to as a measure of the customer experience, which, if you think about it, makes sense. If an app on your phone continues to crash frequently, would you still try to use it or find an alternative? And consequently, if you were on the engineering team for that app, would you want to know if that experience was bad and could you correct it quickly enough to not have irreparable harm? There are many different kinds of SLOs, most involving things like availability, as we mentioned above, latency, and correctness. For example, a mobile app's ability to run cleanly without crashing can be thought of as an availability SLO. Consequently, a web page's ability to load, present content, or an API to return 200 status codes versus 500 is another way of thinking of availability SLOs. Latency refers to how fast or slow a service responds to a user. This one can vary a lot based on tolerances given to specific mediums. And SLOs are often constructed to meet these demands. Content loading on a mobile phone, for example, is often met with stricter tolerances for latency than a traditional web page. For a variety of reasons, there's a smaller form factor, so there's end user expectation of it loading quicker. Or maybe the mobile, you're trying to load something on your phone because you're on the go, you're trying to catch a train, you want to look at the train times. If that's slow, that latency is going to be much stricter tolerance than it would be for sitting at your desktop and waiting for a web page to load. Another SLO that's often defined is correctness. This one's a little bit harder to define and track because it involves whether you're giving the customer the right data. So for example, if you open up your platform and your checking account has the wrong balance, or maybe unbeknownst to you, you see the information of another user that you weren't supposed to see, that's a correctness problem. Consequently, if you went to go buy shares in a particular company and you get a trade confirmation back to you that is not yours or is not the securities that you wanted to buy, that's also a correctness problem. Ultimately, all of that ties back to the customer experience. From not able to load the app or load the service or it's taking forever, or I'm getting bad data, it's a bad experience and customers will respond to that negatively. Let's think about SLOs in a real world example, like for example, going out to eat at a restaurant with your friends or family. You order a dish, but then something happens in the restaurant. Maybe your plate takes over an hour to come out because the kitchen was backed up and it's cold. Perhaps the server sent you dinner to another table and gave you the wrong dish. Or worse, it never came out at all. Regardless of the outcome, it was a bad experience as the customer and you would vow never to return or maybe leave a negative review on social media. At the end of the day, the customer is not going to care about what caused or why they had a bad experience. Their food wasn't delivered. It doesn't matter if it was the server who had made the mistake or the kitchen was just backed up. They will just remember it ultimately is bad. But as SREs, we should care. After all, restaurants are constructed very similar to how we look at our platforms today. It's an orchestration of services working together to provide a positive outcome. You know, having host staff, servers, kitchen staff, management work and concert, alongside technology solutions like point of sale systems, inventory management solutions, all of these need to work together and are often interlocked and interdependent on each other to make sure that patrons are able to get their food. Just like microservices do at a large scale, enterprise architecture. If one of the components is unreliable, you could put the entire experience at risk and certain transactions or flows may result in a negative financial reputation or other impact. So understanding that flow or user journey is often part of the SLO definition process and where we usually start. So we can detect dependent services and help define an accurate SLO. Traditionally, not an easy problem to conquer because when you think of large scale enterprises or very large platforms that are running, there's a lot of complexities that arise that make it more difficult to be able to accurately have enough data to create that SLO. Part of this is complexity. We have the benefit now of having increasingly polyglot services thanks to, you know, the container revolution that's been going on for about, you know, the past decade. And that's all great, but it also makes it a little bit more challenging for us to be able to observe interactions between services because they're not all the same language, they're not all the same tools. There has been work here in the open telemetry space when you think of distributed tracing, but that also requires all applications to opt into it. So that may require code changes and if you're looking at legacy applications or it may be difficult to actually pair all the traces and spans together, you end up with a lot of asynchronous traces and that causes a lot of engineering work to kind of tie those flows together. And sometimes you don't know who the customer is. You know, you may have multiple different transactions going through a system. You know, if you think of an Uber or a Lyft and their platforms that they've demonstrated, you know, you have, you know, a service that will request a car. And you know, on the other side, you'll have a driver that will accept that request. You know, there's different services that may be used at the same time, but they're being used in different ways in different transactions, much like if you had a trading system and you were making a bond trade or an equities trade, they may be using some of the same pieces, but their flows or transactions are different. So it's, and oftentimes you have upstream and downstream consumers that are not humans. You may be part of a hundred or thousand piece flow where you're dependent on upstream data to make, you know, and in the entire flow is dependent on you to be able to, you know, provide the service, the level of service that's needed to make that transaction successful. So this is where it gets really tricky. Fortunately with Istio, we have a way of being able to provide those flows and user journeys to help define those SLOs without too much intrusion. And we're going to get into that now. So let's talk about how we can define SLOs using Istio and some of its observability add-ons and metrics that it exposes through the mesh. So there are different types of tooling that can be installed alongside Istio. For example, Prometheus, you can also have Grafana, you can have Jaeger, you can have Keali, and they all have different purposes, but they're all key cogs in the wheel of helping us have that data to be able to define those SLOs. For, to start, there's a lot of mesh driven metrics that you get. You get availability, latency, and saturation metrics that talk about mesh to mesh, you know, end points. So think of services and Kubernetes, you know, your product page calling a downstream service, we have service to service metrics that we can use that can help inform SLOs. Second, we have tracing. Tracing allows us to, you know, when we talked about the opt-in problem that we had before with traditional applications, now by enabling Jaeger, we can just have mesh to mesh traces. Traces and spans are generated on the communication between services in the service mesh. So that way you don't need to have any code changes to your application, you don't need to manually set up tracing, you don't need to do any sort of integration there. Istio will take care of that for you and provide data that is meaningful, especially from a dependency diagram perspective. Then you also have tools like Keali, which does a lot of the similar stuff, gives you kind of some real-time metrics and helps you understand what service is calling what. Also the Jaeger UI, we can see as we mentioned the dependency graphs and Grafana. Grafana takes all of this together and provides that information that you can help drive better SLO creation. And we're going to get into that now by using our favorite example app. So we're going to be looking at Book Info, which is, if you're not aware, is the example application that's provided in most of the Istio documentation. It's a good because it shows multiple services and multiple versions of those services that we can work with to, you know, we're going to construct an SLO for the product page, which is what the end user will see and how the details, ratings, and review services may help inform some of those SLOs and be able to craft SLOs for them to make sure that that transaction flow is successful. For technical purposes, I'm using Kubernetes 1.27. I am running this on EKS in AWS with the application load balancer controller. And as I mentioned before, we have the default profile for Istio CTL. So Istio was installed through that route with the following add-ons, Prometheus, Yeager, Grafana, and Kiali. So here's our Book Info application running, you know, at a externally facing website. And, you know, this is what we typically expect to see. There's a login, you can kind of put some stuff in here, which is fine. You know, as you refresh this page, you should see reviews served by different deployments here. So you saw reviews v2, then v3, then v1. This is typically how the Book Info application works. What we've done for demonstration purposes is that I've had Locust running for a little bit now to have a bunch of requests hitting this. So we have some nice data. So we have a good baseline to construct these SLOs. So let's just dive in and start looking at some of the metrics that are here. So here is the default mesh dashboard that you get. And there's a bunch of different services here that are HTTP and GRPC workloads. You can see saturations. We see how many requests per second are happening. We see the median latency, the 90th percentile latency, the 99th percentile latency, and then the success rate, which is kind of an availability metric. Now, this is nice, but this doesn't really tell the whole story, because I could theoretically start making SLOs off of these metrics, but I don't fully understand the dependencies yet, because this is showing the product pages, you know, metrics in isolation, not understanding where those are coming from, likewise for ratings, details, and reviews, which are, you know, microservices that the product page calls. So we want to dig a little bit deeper. So how do we do that? We have a bunch of other tools at our disposal here. So let's jump over to Keali really quick. So Keali now is already starting to give me a little bit of insight here and saying, all right, my product page, which is, you know, a workload is coming from the Istio Ingress Gateway. So all the traffic that's hitting the product page right now is coming from the outside. There's no service inside the cluster that's making calls to the product page, that's perfectly fine, which makes sense considering Locust is hitting the external URL and making those requests. But what I see here is that the product page v1 deployment that is being surfaced from the service is actually making three different requests. So I see reviews, I see details, and I see ratings. So now I understand that these three services are now dependent on the product page. So any SLOs that I make for the product page, I will have to keep make sure that the SLOs for reviews, details, and ratings adhere to that so that this flow can be preserved and the customer experience can still be good. If I, for example, say I have an availability SLO of 99% for argument's sake, I can't have the reviews, details, and ratings be anything less than 99%. As a matter of fact, I probably want it to be a little higher. So I have buffer. So if, you know, if it's 99.1 or 99.2, that's fine. And then I have a little bit of wiggle room. If there starts to be some failures in that service that I can still maintain the customer expectation for this. So that's something to keep in mind there. But this does not still tell the whole picture, because when we start to look into Yeager, and we start to look at some of the traces that are happening, now we start to see that there's nested dependencies within this call, within this call graph. So the product, Istio Ingress calls the product page, product page, make some calls to the details API, and that details API goes back into the product page. And then the reviews go back into the you know, in parallel, the product page is making calls to reviews, and then reviews is making calls to ratings. So this is also evident by when we go look at the trace graph over here, and we can see that product page is making calls to details, and then product pages making reviews and then going to ratings. So now, when I look at Kiali, I don't get that kind of dependency here. But now I have a nested dependency. So not only do I need the ratings when I define the SLOs for the ratings, I need to keep in mind that I need to make sure I can respect the reviews upstream dependency on SLOs, and the product pages upstream dependency on SLOs. So as we start to have bigger and bigger microservices architectures, this becomes an interesting challenge. And a principle that I like to do is it's the least common denominator, essentially. So if this is 99, and this is 99.5 for SLO targets, then this has got to be at least 99.5 or higher, because I need to be able to adhere to both the reviews and the product pages upstream SLOs. So there's a chain that ends up happening with this, which is super important, because it helps with adhering to the customer expectations. So from there, we're going to take a deeper dive. And I've taken the liberty of going ahead and kind of simplifying some of the metrics so we can talk about them a bit. This is when we start to talk about the mesh metrics that are exposed and the granularity you get from them. So with this, what we have here is the product page workload. And on the left hand side is the incoming requests that are coming from the outbound, and the outgoing requests that it's making. So now I have an understanding of there's different calls that are being made to these services. And there are different upstream calls that are being made that are requesting the product page to serve content. Now, these mesh metrics are being sourced in Prometheus, and Grifrana is doing the heavy lifting of putting all of this together. Now I have some top level stuff, which looks very similar to what we had with the mesh dashboard. And we were looking at this service in isolation. But now I can break it down a bit. What I can see here is that most of all of the incoming requests are coming from the Istio Ingress Gateway. Perfect. That means that I know what to do here. I know that everything's coming from the outside. This is my starting point for an SLO. However, I can now see that there is differences in saturation between the reviews, details, and rating services here. And while I can see the response codes and everything that are coming back, which is great, it also tells me that I'm making many more calls to the reviews page, which means it may be more of a critical service, something I need to keep an eye on or maybe adjust some targets for. Next piece is going to be latency. So now the Istio Ingress Gateway is kind of acting as the customer here, because it's the outside traffic coming into the product page and having that served as we were here. So now we have a good understanding of that. But what's happening behind the scenes, as we saw with the Yeager dependency graph, is that there are a slew of different metrics and different calls that are happening to different services within Kubernetes. So the mesh to mesh, the intramesh traffic that's happening, I now have an understanding of the latencies that are happening there. So what I'm going to do here is filter these by the P90, because I think it's a good balance between something that's not too sensitive to user failure, but also gives you an understanding when things are starting to go bad, when 10% of our user base is having this type of an experience. If it's above a certain threshold, you want to be able to react to that before it's half. So I typically use 90th percentile for a lot of different histogram metrics or percentile quantile metrics. And it gives me a good understanding. So now I can see here the details, ratings and reviews, and I have an understanding of the request durations that are happening between those. And these are actually going to help inform some of the SLOs that I need to make sure that the product page is able to do is able to serve content and serve the customer. So if I look at this, and I'm going to take a peek at, you know, here's my 90th percentile coming in. So it's between five and seven seconds, give or take. So what we'll say here for argument's sake is 99% of requests that are coming in to the product page, content will be served within six and a half seconds. And that will be an SLO that I will define and track for the product page. That'll be the latency SLO. Now, knowing that I know that within here, there are certain SLOs that I need to create to adhere to that. For example, the details page has some spikes in it. So maybe what I want to say is in order to achieve the 99% target that I have on the product latency SLO, I need to make sure that 99.5% of the time that my details service calls that are coming from the product page are able to serve this in an appropriate manner. Now I can start to look at let's take the details 90th percentile latency. Now that's having some spikes that go between 30 and 50 milliseconds. Now, in some instances, I may need to be mindful that that doesn't spike to 100 because that's probably going to impact in on the P90 that's kind of sitting around five and a half seconds. So it could have a potential impact if something in the detail service starts to go wrong. So one way I can do this is to alter the workload here and look at details. So now I can see here, this is showing the details page and I can see all of the incoming requests that are happening. So if I look at this, I'm seeing the incoming that looked very similar to the outgoing on the product page. So now, as we're going through this thought process, so I'm thinking through my dependency graph, now I can start constructing SLOs on the details service and making sure that we adhere to them. This will be very interesting as well with the reviews. So if you look at reviews v2, you know, we can see the product page is making a lot of requests for the most part the 90th percentiles around four milliseconds. But when I jump to reviews v3, I see it being a little bit more spiky. So I would, you know, deem this as something that I probably need to understand the reviews v3, what changed? Is this tolerable change? Is it something that we expect latency to be on the product page to increase as a result because of the feature that's coming with it? Or is it something that we need to go and dig in and try to improve the reliability of that v3 service? So these are some of the questions that SREs will go through, but the data is all here. And that's thanks to the Istio Mesh partnered with Prometheus, Keali, and Yeager. So using all these all these tools in tandem allows us to have this rich canvas in which we can start to build SLOs and drive better outcomes for our platforms and for our end users. Now I will make a comment here and say that, you know, there is proprietary software here that can do a bunch of this. There's, you know, proprietary SaaS vendors that have integrations into Istio that can pull some of these metrics in. And as a matter of fact, have SLO layers on top. So if you're using those in your enterprise, I would check to see if you can get some of these metrics sourced in and be able to use this to help craft the SLOs. But I think here this is a way for SREs that are, you know, building with open source and are using open source to be able to still have that same experience and be able to craft those SLOs and have that reliability. So to summarize here, service level objectives are one of the best tools an SRE can use to help deliver that best customer experience possible. It gives better insight into the health of services and their interdependencies that we don't see with traditional APM or with other types of metrics. And Istio can help with a lot of the SLO definition and tracking because not only does it provide a lot of the golden signal metrics that we're usually seeking, but also it can have an understanding of a dependency graph, you know, within a transaction flow. And I think what's nice about this too is that, you know, microservices architectures evolve. And, you know, they may become more polyglot or less polyglot. But customer expectations also evolve as well. So as you're putting together SLOs, you know, six months from now, your tolerance for latency or speed may be shorter. And you may need to revisit those SLOs and go through this process again. But if you're armed with this data, the SLO creation and definitions is going to be orders of magnitude easier because you will have a broader insight into what's going on inside your cluster, thanks to the Istio mesh and the metrics and the visualizations from the add ons that are supported. I hope you enjoyed this talk. I want to make a quick shout out to Jason Kallner from Capital One. We work together and a lot of the restaurant metaphor was driven from that. That was how we used to teach SLOs and workshops. And it was very effective at getting the point across of helping understand, you know, what the customer experience should be and like how does that tie to microservice architectures. Also, I want to give a shout out to Adobe Firefly as some of the artwork that is here has been AI generated. If you have any questions, my GitHub handle is here as well as my LinkedIn profile. I look forward to talking with you about this. Look forward to connecting and take care.