 And as it is tradition in Germany, we are trying to be on time, so hence we're rushing through the schedule today and we'll continue the show with Steve and John and they are talking about service resiliency patterns. This is definitely something you all want to know a thing or two about, so John and Steve, take it away, please. Hey, am I still connected? Yep, you're good. Let it roll. All right. All right. Hello and welcome everyone. Thank you all for joining our session about building resilient services. So with the growth of Kubernetes and microservices, or just building distributed systems in general, our applications need to be more fault tolerant, because a lot can go wrong. So what does it mean to be fault tolerant? We need to program our applications to be able to handle and recover from partial and transient failures. Today we'll talk about some tactics that we can use to make our applications more resilient. Now, while these apply to any situations, they are especially important when we think about a cloud native context. So for today, we'll pretend we have a weather service that we are integrating with and calling. But before that, let's introduce ourselves. Steve, who are you? Yeah, so my name is Steve Trent and I am a principal consultant at Red Hat for almost 11 years now. In my typical day-to-day, I'm very hands-on keyboard. I'm working with customers to actually write code and to architect out solutions. I'm very passionate about writing clean code and keeping things simple, and I love learning about new technologies or frameworks, or just seeing new patterns and seeing how I can apply it to my everyday work. Yeah, I'm John Keem. I'm a solutions architect, which is a fancy way I think of saying, figuring out what customers, enterprises need, what they're trying to solve, and then trying to pair it up with technology. Usually, most likely, open-source technology. I've been developing for a number of years and I love it. So clean code is definitely on my list today too, and that's why we're talking about resiliency, how to make clean code and resilient code. So Steve, tell us a bit about our use case today. Yep. So before we get started, you might see a little timer up there. This is a topic that we could get very, very in-depth and verbose about. So we've got a little timer here just to keep us on track here because every one of the things that we talked about today, we could go in-depth for an hour by itself. So that's that little timer up there. All right, so today, let's just pretend that we've got a sample app deployed that's got some external dependencies to a bunch of other services. These can be services that you write yourself, that you own, or something that's managed by a third party. And this is pretty much the standard pattern for all the microservices that you're probably building today. You've probably already encountered some of the issues we're going to talk about today and handle it in whatever form or fashion that you usually deal. Whether it's writing code to handle that, we'll try catches and for loops, or maybe you're using a library to abstract some of that out for you. Today, we're going to walk through and talk about some of the most common patterns and see how we can make our apps more resilient using them. So John, where do we start? Like what can go wrong for applications or where do we start first? Nice. Yes, I think the most logical, most common approach to error handling that a lot of people do is this mechanism known as a timeout. And that's something we use when maybe the end service we're calling didn't respond or we're not sure what's going to happen or it might take a long time. So the pro there is that we can shield ourselves from getting into too much trouble if the downstream service we're calling takes a long time or maybe is not there, or hangs forever. That way it doesn't cause us to also hang forever and wait for that. So in this case here picture, we wait for three seconds, it doesn't respond, we assume it's dead, we move on. We allow our users to continue using our app. What are some of the cons of this? Well, maybe we're waiting too short of a time and maybe that extra millisecond we could have waited is when the service would have returned. So we prematurely gave up and didn't get the answer that we were actually looking for, so that might be bad. Or maybe we're waiting too long. Maybe we know that the service on average takes two seconds and maybe we're waiting two minutes. Well, that's entirely too long. This service has probably failed by then. So really dialing that is important. Can you tell us about like some factors to consider for setting proper timeout values? And when do I need to know or when do I know I need to change from the default values? Yeah, so in Firefox, there's a bunch of default request times that you can take a look at in toggle. But the default request HTTP timeout is 30 seconds. But really the answer putting my consulting hat on is that it depends. It depends on your particular application depends on your context and only you really know what's best. So let's say I talked about normally request takes two seconds. You know, you dialing that in appropriately makes a lot of sense where around that time frame. But the other consideration is your end users, right? If you have a end user like a web app or something, you can't have your users sitting around for a long time because they just won't use your web app. So you want to dial that in so that the wait time is appropriate. However, maybe you don't have real life users, maybe you have two systems talking to each other. Well, in that case, maybe waiting a couple seconds or minutes might be all right, right? Because there's no human at the other end tapping their foot waiting for the response. So again, really contextual, really dependent on your application and you'll get a feel for it once you start to use it. You'll figure out kind of what that right timer is. Now, talking about timeouts, the next kind of logical thing to think about is this idea of a retry, right? So piggybacking off the last idea, if the service dies, what can we potentially do? Well, one, we can fail and say, well, we didn't get anything. The other thing we can do is maybe to retry the service. Now, this is really nice for a class of errors that we call transient failures. That's like a temporary failure, maybe a blip in the network, some packets were dropped or something. In those cases, it's a good solution or a good practice is to maybe retry the service because that weird anomaly that happened probably won't happen again. So the retry is good there. Now, what are some pros of this? Well, the good thing again is we eliminate or shield ourselves from these types of weird network things and kind of make them be much reduced or go away entirely. Now, con, the con here is for non-transient errors. So again, these are like not the network blips. These are like permanent that services down. It can't talk to something else. It's something that maybe we have to rectify. For these types of errors, a retry doesn't really help, right? It's not going to go help bring the service back up. And just retrying could also make the problem worse. Let's say the end service you're calling is under load. Well, now just repeatedly calling it is not going to help the situation. We might use something like an exponential backoff time or something like that. So we're not hammering the service. Maybe we delay every second, then two, then four, then eight seconds. That might be a good mitigating approach. But either way, for these classes of errors where it's more transient, the timeout and the retry mechanisms might not be so good. Do you have any recommendations on how many times we should retry or what are some of the faults we could rely on? Yeah, good thing I've still got my consulting hat on. The answer is it depends, again. You are really going to know the characteristics of your application. And so you'll know kind of what a sensible number of retries make sense for your context. Think about a service you're calling, right? That if it's item potent or fails fast, no problem. Call it multiple times. But maybe the service does something weird where it creates half records or just in the middle of processing and moves some files around but then doesn't cleanly kind of quit and then you're like in this weird state. So if you understand kind of what the service is doing, then you'll know kind of the best approach and how many times to retry or even retry at all. One more comment which is it's similar to the timeout where you have to think about the end user experience. So for retries similarly, it does take time to retry so you have users waiting, you might want to retry less or more or depending comparatively to backend systems where you don't have that constraint. Now, the next thing is these were all pretty simple patterns and these are patterns that I reach for pretty immediately to try to mitigate or to think about some errors that could occur. But Steve, wanted to ask you about some more complex advanced patterns. Yes, so a circuit breaker is one of the more advanced patterns and just like an electrical circuit breaker, it's used to prevent damage to the system by immediately detecting some sort of fault condition and when it does, it trips the circuit to prevent additional faults from degrading the entire system. But unlike an electrical circuit breaker in your house, you're not going to need to go manually go back to the panel and flip the switch back on because with a software circuit breaker, it keeps a status or pulse on the health checks of the calls going through it. So it can automatically be configured to actually turn back on and it works by having three states. In the first state, which is normal operation, it's closed, which means it's a complete circuit. So your application will try to call out to the weather service and it'll check the circuit breaker first and it's going to say, hey, the circuit's complete. It's good. So that's business as usual. On the other spectrum of that, the circuit is open, meaning that the circuit breaker has been tripped and what that means is whenever a request from the application tries to go to the weather service, the circuit breaker is going to say, hey, I can't complete that request. It's going to throw a circuit breaker exception and when it does that, your application can handle that really quickly. The purpose of that is just to fail quickly because you don't want it to continually hit something that's down. In between states, though, you've got a half open state and when it's in a half open state, some of your requests have failed. Some of them have passed and you're sort of in this watch face to determine what should it do next. Should it stay open? Should it close up? Or should it stay closed? Should it open up or so on? This behavior, it's fully configurable so you can change things like how many of the last requests you want to monitor, like that's your rolling window. What are your thresholds for success and failures to determine when do you open the circuit, when do you close the circuit? And then you can configure how long the state opener closed for. So the pros of using a circuit breaker, so for non transient errors, it's really about failing fast and returning an error immediately. We don't want to spend any cycles attempting to send a request to something that is known to be down. It's down for the last five or 10 requests. Don't even try it. For transient errors, this is really great because you're monitoring the status of the calls so you can automatically recover once your services are coming back online. This makes for a really robust system because it's actively trying to prevent itself from damage but it's also trying to heal itself too. Obviously the con of this is a little bit of complexity. You're opening yourself up for a lot of potential for misconfigurations or mistuning of the circuit breaker. Maybe you're tripping the circuit too often or maybe you're not doing it enough. The overall impact to performance or latency is affected by your circuit breaker. And then the other trade-off is complexity of error handling and troubleshooting. If you're testing and getting inconsistent results or if your users are reporting intermittent issues, you just need to be aware of how your circuit breaker is behaving. Most of the libraries that we'll talk about today, they have methods or ways to inspect the status of the circuit breaker. That's just an additional point or task in your checklist for troubleshooting. I see a great comment in chat. I'm going to ask this. You talked about being complex. How do we implement this? Yeah, so the patterns that we've talked about before and the patterns we'll continue to talk about today, these are all readily available in libraries that are widely used already. If you're a corpus developer in Java, we've got the SmallRy Fault Tolerance Package, which is an implementation of the Eclipse Microfile Fault Tolerance Specifications. For the Spring Boot developers in the crowd, we've got Spring Cloud Circuit Breaker, which is actually using the Resilience4j library behind the scenes. If you're not using corpus and you're not using Spring Boot, you can always just use the Resilience4j library itself. The next technique is the fallback pattern. When your primary service is down, sometimes it makes sense to call a backup or a fallback service. And in this other service, it will give you partial results or maybe a subset or something compatible to the original result so that your application can still provide some level of service versus just returning like an Air or a Null or an empty dataset. The pros of this is it works really well for services that are prone to intermittent failures. And if you retry it, it's not going to make a difference because it's down for good. The cons of this approach is if your main service is always prone to failures, maybe you need to reevaluate why it's on the critical path. It's not a good idea or it's not good practice to use exceptions for control flow. The second con is it depends on what your backup service is doing. It could potentially be adding additional complexity that you'll have to maintain. You're essentially creating another endpoint that could fail if you build too much complexity into your service. If your backup service is also prone to failures, that could be a bad thing and could cascade into other failures. Yeah, so can you have more than one fallback? Should you have more than one fallback? So the fallback method is essentially your backup plan. And I would recommend that whatever your backup plan is, it should always guarantee to return something. You could make your backup as simple as just logging in air and returning a canned response saying like, hey, the service is unavailable, try again later. And if you do decide to call another service that could potentially fail again, you're just gonna have cascading failures and that's gonna make troubleshooting really difficult. So whatever your app needs to do, go ahead and do it, but I recommend just trying to keep it simple. So John, what can we do to reduce or contain the fallout of failures? All right, the next type of error handling that we could potentially use in other patterns is called the bulkhead. So this pattern gets its name from the nautical term bulkhead where the ship is broken up into compartments. So if a breach occurs, the entire boat doesn't get flooded and doesn't ship and sink. Instead, that one particular compartment floods up. They're able to contain that breach, assuming it doesn't breach all of the compartments. But in the same way, we can take this approach and use it in software, too, where we want to limit the damage a particular breach, or in this case, a failed service can cause. So in our example here, we have an overloaded application that we're making HTTP calls to, some weather service. And let's say that it's completely exhausted as resources and it's not responsive. So the pro is that that overload component, component A, doesn't necessarily have to fail the entire system. It doesn't affect the entire system because we have component B still able to provide some level of service. The application itself should theoretically be up even if a single component becomes not usable. So the con, though, is that this is a little bit more complex. You have to think about how to build your application in said components. And each additional bulkhead, if you look here, component A potentially is making its own HTTP call and I think there's a database on the other end. It might have its own connection pool or something. So there might be more resources involved because you have to have each component to be able to act and be alive independently. But again, the pro there is that they can live independently, which is a good thing. All right, so there's a bit of overlap here with thinking about Kubernetes and containers. And so I know one question I get a lot is, you know, John, if Kubernetes exists and it's doing a lot of this health, self-healing stuff, why do I have to do anything in my application? Where is that line? Where is that boundary? Can I just rely on the platform? And the answer there is really it's both. While the platform does a lot and using containers is a great way to create a completely isolated unit of deployment, completely isolated app. And while Kubernetes does have a lot of great self-healing properties, as developers, we can't be lazy. We still have to use best practices and design our application in a way that we can take advantage of those things. So one thing we can do as application developers, making sure that we isolate our services. Again, this component kind of thing that we're doing, we don't want these different components to affect each other where one can take down another, right? So we have to design it in such a way that that happens or doesn't happen, I should say. Also, if you have a lot of service chains or a lot of calls being made, we don't want a failure in one particular link in that chain to bring the whole chain down, right? We want to have failures in such a way that all the services remain up and can recover instead of cascading down and destroying the, like I said, the entire chain. The other thing, too, is now thinking about the platform a bit, particularly if we're using a Kubernetes as our target, we don't want to put a lot of containers in a single pod because we want to be able to scale them independently. We want to be able to have Kubernetes heal them independently. There's that level of isolation where we can get it, so really break them up into different pods. Now, I've talked about the platform. There's a lot of things we can do at the platform level, too. The first thing is network policies. If one service goes nuts and starts hammering all the services around it, we want to have network policies in place so we can control what that service can actually call. The next thing, talked about apps going crazy. Sometimes you'll see situations where they start to eat more and more memory or CPU or whatever because it's runaway. For those occurrences, we want to put nice requests so we request a reasonable amount of memory and also good limits so that as it starts to grow, it doesn't get too big. We want to avoid that noisy neighbor problem where we have an app that's runaway and is eating all the resources on the machine. Those are some good techniques that we can do. Moving forward away from this, we've talked a lot about different error handling techniques. What's the big takeaway that we can have from all of this? That is there. The first one is to use the right approach. We've talked a lot about different tools in the toolbox. You want to use the right one that fits your application and your specific scenario. Right tool for the right job. I think that's a big mantra that a lot of developers can really get behind because there's so many options out there. You want to pick the one that is best to suit your use case. That brings us to the second point, which is that these techniques can actually build upon another just because you use one doesn't preclude you from being able to use another. A lot of these techniques we talked about today are additive. You can add them on top of each other and mix and match as you need. The last one is... The last one here is it's not magic. There isn't a magic bullet that will solve all your resiliency issues. All the techniques that we've talked about so far, these are ways to handle errors. As mentioned before, if exception handling is part of your normal control flow, you're probably just masking an underlying issue with the solution. For example, if you're retrying or falling back on the first calls every single time, then whether it's a flaw or bug in your code or maybe the service is just too unreliable to use, that is just going to affect the overall performance of your system and you don't want that. So we've got two minutes left. I don't know if we've got any QA questions, but we asked a lot of good questions throughout the presentation. Can we check the chat box? Do we have anything there? I don't think so. Nope, I haven't seen a lot. So unfortunately, I can let you guys go early, which is a dream come true. Before we close out, John, do you have any last words for us? Yes. Do these techniques. But also do it in a way. Excuse me. And also make sure to add tests. I'm not testing, but testing is a critical part of this. To test in your code to not only test the happy pass, but also think about the failure cases as well to make sure that you're able to handle these type of failures. Yeah, that's a good point too. Testing is so important. Everything we've talked about today should be testable as well using just regular JUnits or whatever test frameworks you commonly use. One last thing that I will add is these, like you said, they're additive. You can use them on top of each other using a combination. But just be careful too. It's like taking medication. All medications are pretty much good for you, but you can mix medications and it can get really bad results when you do that. So just be mindful, analyze, design, just understand what your latency performances are, and this is how you build resilient systems. So, you know, we've got 30 seconds left. I just want to say thank you all for tuning in. If you do have any other questions, feel free to reach out. Our emails are right here. And if you just want to chat, just post something in the chat box. Thank you all. Thank you for attending. John and Steve, you're my heroes of the day. You're on time. Thank you so much for that. And thanks everybody for joining. Please make sure to join the main stage for the keynote in like five minutes. So we'll be back on the main stage in five minutes.