 I'm Lee Calcote, founder and CEO of Layer 5, so Nick, you've been service meshing for a while. You ever wonder if you're doing it well, if you're doing it right? And you know, I mean, how do you know? Do you have circuit breakers that are configured too sensitively? Is your security doing what you think it's doing? Is your deployment performant? I think it's a really good question. And I think it's one that I'm kind of like really hesitant to answer because I think it's one of those things that like, I think with a lot of industry applications, you build knowledge from using the tooling. Now, with like service mesh, service mesh is really new in the sense that, you know, it's been around for a while. The concepts are not necessarily new, but the sort of the implementation of like, of having this thing which interacts with every piece of the traffic on your application is pretty new. And like, how do you know whether you're using it correctly? And I think some of that is that there are sort of standard patterns from sort of network and application development, which you can still apply and are very, very relevant with service mesh. But there's also very specific use cases, which you've got to like think about. And I'm hoping that maybe today we can kind of break some of this down and kind of share this with some of the good folks and maybe just demystify some of these things. But before we do, I suppose we should take a look at the problem statement. And that kind of leads us to here. So back at this problem statement again, distributed systems management is hard. And I think this is kind of like continually ring true. And as we kind of increase the complexity of our systems, it just kind of gets more and more of a problem. Kind of like the core things that we want to talk about in this presentation is looking at around the network and actually how the network has a major effect on your application. And of course, some of the things that you can, you can do with this. Networking has never been more significant for developers because it kind of is necessary part now we've moved to this microservice, everything is connecting over the network where previously, you know, it would have been an in process procedure call. Let's take a look at a little demo of how network can actually affect our system that leads going to show us now. Let's refresh on how a fallible network can be a real pain to your workloads. This is book info, a sample application from the Istio project. Let's dive in deeper to understand the challenge that it's having and take a look at its performance. We can use a performance profile to analyze its behavior. Here's the trouble book infos performance profile. We'll use this profile to generate load and analyze how it's running. We'll apply a short test, watch the system as the test runs. And through statistical analysis come to understand that about 18% error rate for this particular application at this time. This isn't an uncommon phenomenon. There are patterns of behaviors here. So the example of the failing system that you've just seen in that little demo there is actually pretty typical. And we're going to see how that we can apply some sort of standard patterns to be able to remedy those things. Ultimately, that's why we wrote this book. The kind of the concept behind it was that we wanted to kind of take all of that industry knowledge around networking, around kind of how that applies to the service mesh and distill it down into about 30 or so core patterns that can then be used and can be reused within your own applications. So as suggested, the book isn't about a specific service mesh, but it's about service mesh technology at large. It's about enabling use of repeatable architectural templates or patterns. These patterns have best practices imbued in them. Those best practices can evolve over time. There are a few key characteristics about service mesh patterns to be aware of. The foremost of which is that a pattern can describe behavior, not just service mesh configuration. So to give an example of behavior, take a canary rollout. Today's service mesh control planes are capable of facilitating traffic splitting between multiple versions of the same service. What they generally don't do is facilitate an analysis of the criteria to step between, well, your various steps within your rollout, whether that criteria is performance based or time based or some other criteria, error based. Those behaviors, that criteria can be captured in a pattern, can be applied to different meshes. Service mesh patterns are mesh agnostic, clearly. That's part of the goal. To that extent, they're intended to be beneficial to any who are running a service mesh. These patterns are reusable. They're exchangeable. They are declarative in nature. They're described in YAML. The syntax is about as simple and concise as I think you can get. There are any number of patterns. As they're being defined, they're being catalogued into these categories. Okay. So now you know what we're trying to achieve with the patterns. Let's take a look at a couple of examples. So the first one I want to show you is probably the most common pattern that you're ever going to find inside of applications. And also, if misconfigured, probably the one that's really responsible for bringing your application down to the knees. So in the previous sort of demo that we saw, what we saw was an upstream connection to a service and it was intermittently failing. Now, the reasons behind the intermittent failure, it could be a number of things. It could be just something on the very pure network level or it could be some application level. The key thing about the retry pattern is that it's sort of smart enough to be able to handle both of those things. The retry pattern actually operates at both layer four and at layer seven. So it understands the concept of the connection. Can we connect to an upstream service? But it also understands what success looks like with regard to the layer seven protocol that you're using. So for example, payment service here is going to try and connect to the currency service. Now at a layer four, it's going to try and establish a TCP connection to that. If it can't do it, it will then go and get that connection failure message and it's going to go into its retry loop. It's going to retry another one of the instances of the currency service. So we see that happening here. Now, it becomes successfully connected. So what the retry then does is the retry is analyzing that request and it says, based on my knowledge, has the response that I've been given matching my criteria of success. So in this instance, it's an HTTP request and the measure of success is going to be HTTP status code 200. Now in the instance here, we haven't got a 200, we've got a 500. So what's happening is again, the retry is going to go into that loop and it's going to try another endpoint and it'll do that for a configurable number of times until it gets success. Now, this isn't really kind of healing your system, but what it's doing is protecting your end user from the exposure of these problems and that's very important. A retry only buys you time until you fix an underlying problem. It shouldn't be the fix itself and I think this is kind of like a key thing when you ply a retry. It shouldn't be something which you're using as a kind of a permanent fix. You should always use it as a temporary measure because quite often problems will get worse and then you will get to a point when the retry can't help you. I think another important point about retries is idempotency. So in terms of idempotency, what we mean is that if you attempt a request multiple times, you should get exactly the same response. So for example, if I try and get a product detail, then every time I make a request for that product detail, it shouldn't be changing any state in the database, it'll just be returning me the same data. That will be considered idempotent. Therefore, it's actually safe to retry a number of times. Something which necessarily might not be idempotent, however, is something like when you take a payment. So looking deeper into something which might necessarily not be idempotent is this order service. So this order service has actually three steps. The first thing the order service does is it processes the payment with an external payment gateway. It then updates the stock level to mark that that particular item has been sold, and it then sends a confirmation-ish email to the person who's ordered. Now, in terms of idempotency, per process of taking payment might be processed correctly. Now, updating the stock levels may also correctly succeed, but when it comes to sending the email, maybe the order service at this point is going to fail. Now, assuming it does, it's going to fail at this stage here, and what happens is the order service then returns an error of 500. If a retry catches that, the retry is going to kind of say, well, hey, this call to this service has failed, so I'm going to retry it. Now, the problem is that because this isn't an idempotent service, when we go back into this flow again, I'm going to double take the payment. So idempotency is an incredibly important facet that you've got to consider when you're kind of dealing with retries. It's not just as simple as applying a retry to absolutely everything. And then there's kind of like some kind of nuances certainly around like calculating retry counts. So, for example, if you've got five instances of a particular service, and that they're exhibiting a kind of an error rate of 2%, that's kind of evenly distributed across all of the service instances. So, given kind of the application of a single retry, I'm going to just have, I'm just going to retry once, and then I'm going to give up, what you end up with is a kind of a probability of 0.004% that both the original request and the new request will both fail. Now, that kind of gives you a kind of fairly high level of confidence to say that, you know, in that instance, a single retry is okay. But what if the error level is much higher? So, for example, given a 50% failure rate this time, now the probability of both requests failings now up to 25%. So then you can kind of look at that and say, well, a single retry isn't applicable. So like, hey, maybe, maybe all I'm going to do is I'm just going to do five retries, you know, it can be really tempting to use this, this high retry sort of attempt. But the problem that comes with that is that whenever you're retrying in a system, your system is running at reduced capacity. So does your system have the capability to kind of handle this load? Because if it doesn't, what you might actually be doing with your retry is not sort of increasing the, reducing the failure rate from 25% to 0, you might be increasing that failure rate to 100% because the system is now all backed up and it can't handle any of the requests at all. The important thing that you've got to think about is that you can't just go putting arbitrary numbers around using something like a retry or any of the patterns when it comes to your network application. You've got to apply a measured and understanding approach of like how it affects the system and how these different error rates can change things. Mechery is the service mesh manager. It is a CNCF hosted project. It helps you deploy and operate any service mesh. Mechery implements and validates service mesh standards, both service mesh interface, SMI, and service mesh performance, SMP. So in addition to those management features, Mechery is designed to be an extensible platform. So not only can developers write plugins for Mechery, but it also has a number of ecosystem projects that have cropped up around it. And as the openly governed service mesh manager, Mechery and the projects around it have maintainers from Layer 5, VMware, HashiCorp, Red Hat, Citrix, Rackspace, and so on. And it's through their collaboration that service mesh patterns have become one of these ecosystem projects. So Mechery natively incorporates support for the service mesh patterns project. Using Mechery CLI, you can import and apply your service mesh patterns or using Layer 5's mesh map plugin. You can visualize and design your service mesh patterns, export them, and share them. So let's head back into Mechery and look at how it uses patterns. It uses these patterns to help you overcome those distributed system challenges. Recall that in Mechery, you can use its CLI to import a pattern. And in the UI, you can also then edit that pattern. In context of the retry pattern that Nick was discussing earlier, let's dig into that one a bit more. To do so, let's use MechMap, a plugin for Mechery. In MechMap, you can visualize your patterns and customize them, really using any service mesh that Mechery supports. Because Mechery adapters support any number of versions of a service mesh, you will have version-specific components to choose from while you're building your MechMap. Now with MechSync at work, and your patterns loaded, you can create a MechMap to apply the retry pattern. So the pattern here has been set for a retry count of two. And so we'll go ahead and connect the retry pattern to our troubled book info. And once we have this connected up, we can, from there, go ahead and deploy the pattern and apply it to our infrastructure. Now, let's head over to performance management. Back to our book info performance profile. And initiate analysis. Recall that without the retry pattern, we were seeing about an 18% error rate. And so with the pattern applied, there are no errors in our analysis. Things look pretty positive. There is, however, a slight hidden cost to this pattern. So let's go compare our two performance analyses and examine. Now in the comparison, we can see that there's a latency cost to the use of retries. So responses from our second set of requests returned without error, but they also incurred some delay as those requests were retried. So we've just seen there another demo. And Lisa showed us how we can actually use Mecheri to both kind of do that kind of exploratory configuration around our system, but also to be able to do that methodological testing to be able to do our performance testing to be able to ensure that we are confident that the levels that we've set for our retry is appropriate. But the thing is with a lot of the patterns that you kind of apply in architecture and specifically in service mesh is they don't necessarily exist in isolation. So for example, a common pattern that you would apply with a retry is the circuit breaker. So what the circuit breaker attempts to do is the circuit breaker attempts to remove a service from the sort of the load balanced rotation when it continually fails. Now there's a kind of a number of reasons why you want to do this. The foremost is that if a service is failing just because it's busy, actually breaking the circuit and not sending any requests to it will allow it to recover. In the instance that it is completely faulty, then what you can actually do is you're removing it from the kind of the process and you're failing fast. You can kind of handle that in a number of different ways, but it ultimately protects the system downstream from any slowness which can occur upstream. The way that it works is this. So a request comes into the system. If the circuit breaker is open, then you're immediately going to fail, but let's assume it isn't right. So we're just kind of getting into this situation where we're starting to experience errors. So the circuit breaker is closed. The request is called upstream as normal, and if it's successful, it's returned as normal. However, if it isn't successful, then what the circuit breaker does is it's starting to keep this internal count. So it says, well, I've had a number of failures and this could be consecutive failures or it could be failures over a period of time, but it kind of looks at a threshold that you configure. When that threshold has exceeded, what happens is the circuit breaker opens. So no more requests are going to get sent to this service, but then like, how do you determine when it recovers? So the circuit breaker after a certain period of time and is what's known as a half open state. So in the half open state, a request, a kind of an exploratory request will be sent to the upstream service. If it succeeds, then again, the circuit breaker will close and we'll go back into normal operation. If it fails, then we keep it open and we can kind of continue to assume that it's still in that healing phase. Like all of the patterns, they kind of look simple on sort of the surface, but to effectively use them, there are a number of considerations caveats that you've got to think about. Health checks are quite a common one that when you're kind of configuring thresholds for circuit breaker, you've got to do this with the health check in mind. So for example, if you have a threshold for a circuit breaker, which is greater than the kind of the passive health checks which exist in your system, then the circuit breaker is going to have no effect because the health check will detect the failure before the circuit breaker does. So a circuit breaker is designed to detect failures quickly using active requests as a method to do so. Health check endpoints also don't report things on such as database locks. I mean, sometimes they do, but quite often not they don't. So you can have a failure, for example, that is due to just a lock on a write on a database. If you ping it as a health check, a health check might just be saying, hey, are you alive? Are you okay? Are you up? You're not going to kind of get a report on that where the circuit breaker is really effective is that the circuit breaker is going to experience a kind of an inability to write it, either get an immediate response from the upstream service that it can't perform the work or it's going to time out. But in either way, what you're doing is you're kind of doing this active health checking of the service by continually looking at the service traffic. And of course, in the instance of failure, you remove it from the rotation so that it doesn't affect it any other areas of the system. But when we're working with our circuit breaker, we also need to consider things like root independent circuit breaking. So what do I mean by this? So let's think we've got a service. Now the service is allow me to get a product's details and it also allows me to update a product's details. So we just said that when we were talking about health checks, that something like a database write lock can cause a service to fail, but it isn't necessarily picked up by the health check. Now in the instance of kind of circuit breaking, if we open the circuit, because we're unable to write to write the update of the product's details, what we're also doing is we're updating the circuit, opening the circuit, which means that we can't read the product's details. Now in terms of the percentage of writes and reads, reads are generally more numerous than writes in some services. So we're removing a surface from circulation because we can't write to it. What we really should be doing is treating those independent things as with our own circuit breaker. So a write, yes, we should absolutely open the circuit when we can't write to a product, but we shouldn't if we can continue to read. We should kind of try and keep as much of our service in circulation. So we've got to think about these things quite fine grained. Ideally, we want to apply our circuit breaking on HTTP path and verb or GRPC method and not at a kind of a full service level. Heading back into Mechery, let's apply the circuit breaker pattern to our troubled book info. We will pull up our workload, we'll add the circuit breaker pattern, go about getting it connected to our trouble spot, and then again, once it's the design is complete and our mesh map is ready, we'll go ahead and deploy the pattern. We'll return to the performance profile, which has two concurrent threads configured. And that's going to be more than our circuit breaker pattern is going to allow. So as we perform analysis, we should see a tripped circuit. There it is, a tripped circuit. We've been through a couple patterns now. The thing is, there's quite a few more and actually a number that have yet to be defined. I chair the CNCF service mesh working group. This is a group in which, well, we do a few things, but one of those things is advance the thought behind service mesh patterns. So Nick and I want to invite you to come participate. Service mesh patterns are for everyone. Whether your organization is a member of the CNCF or not, you're welcome to come and participate, engage and help define these patterns. So Lee and I have been working on a book called Service Mesh Patterns for O'Reilly and it's currently an early release. We would love your opinion and if you've gotten any comments or maybe some real gems of information or some esoteric kind of gotchas that you think would make great reading, we would love to kind of hear all of those things for you and you can check that out and also with with the CNCF. We're about on Slack. Any questions we would be overjoyed to hear from you all. Hope you've enjoyed the talk and thank you so much for watching.