 On today's Visual Studio Toolbox, Carl continues showing us Polly, which is a way we can control for transient errors in our apps. Hi, welcome to Visual Studio Toolbox. I'm your host, Robert Green, and this is part two of our look at Polly, which is a tool that gives us the ability to handle transient errors. So you have services in the Cloud talking to each other, they fail, how do you handle that? We're going to continue with Carl Franklin and- Hi. Hey, Carl. Hey. In part one, we did an overview of Polly, what's the issue it's trying to solve, and we started looking at how it works, and we're just going to pick up where we left off in this episode. That's right. So if you don't know what this is, go back and watch the last episode. But the short answer is Polly is an open-source project that my company, AppiNex, took over and enhanced. It's now in the .NET Foundation. It is also part of .NET Core. If you use the HTTP client factory, you can configure policies automatically with that. Also, it's very popular, it's getting about 150,000 downloads a day. So it's good, it's clean, a lot of people use it. In the last episode, I did three demos. Now we're moving on to the next one, which is wait and retry a number of times with enough retries. The last demo, we had wait and retry three times with a 200 millisecond delay in between each retry, and now we're going to retry number of times with enough retry so that it actually works. Again, we're making a call which is always going to fail, but policy gives us the ability to specify what to do about it other than sit there forever and don't provide any information versus try a number of times, wait and retry. So this is really, you've got an error, a service isn't talking to another service, and you're setting up policies to say, well, what am I going to do about it automatically? Exactly. So we're calling this web API where we're passing a value, we're getting a reply with that value back, but it's programmed to fail after the fourth request within a five-second window. So the third one through three, request one through three within five seconds, they work, after that they don't. So what we're doing here is we're using a wait and retry, async policy handling all exceptions, and of course we can be specific about the exceptions we want to handle. We're going to retry 20 times and every time there's a retry, we're going to wait 200 milliseconds, and this is what happens in between those retries. So this is sort of a new exception handler, tell the user what they want. So we're basically showing in a console application those messages in yellow. So again, the last demo we did, this number was three, we were waiting three times and it was failing. Now we're waiting 20 times and it will succeed after 20 retries because that's enough time to cover it. So essentially what's going to happen is we just make an HTTP request and we wait for a while and then it comes back that it worked. So in the previous examples, we never saw four, five, six, seven, eight, nine, 10, 11, 12, they just never came back, right? Right, yeah. Now we're waiting long enough and retrying long enough that they eventually succeed. Right, and again, this code isn't prescriptive, this isn't saying this is what you should do. We're just exercising all the different policies so that you can get a handle on what they do and how they behave. All right, so the next one is wait and retry forever. Now why would you do that? I mean, that's asking for trouble, don't you think? However, there is a really good use case for this one and it's just wait and retry forever, I think 200 milliseconds between each one. I was actually writing a WPF app that was sort of like a wizard and in one page you were gathering data from a device. It was a Kinect actually, a Microsoft Kinect. And in the next, when you hit next, it took all that data and submitted it to an API in the cloud and then it couldn't actually go to the next part of the wizard until we got a response from that server that had the magic numbers and the results because it had some kind of magic sauce that it did, the application could not continue. Like there is nothing you can do in this application until we get a response. So wait and retry forever is good. And we're basically putting up a thing that says, retrying and if it doesn't work, check your internet connection and just come back later. And we were saving the state of it so we can just go back in and run it and it would pick up. But there might be some situations where you don't want to fail. You just wanna wait and retry forever. So that's what it does. And it's gonna look exactly the same as the last demo which retried 20 times because we knew that was enough. Now we're not specifying 20, we're just retrying until it actually succeeds. Okay. Yeah, a little bit different. And then you could add into that code that asks the user, do you wanna keep retrying? Sure. It's cancelable, right? The user may not want to wait forever. Right, we have a cancellation token. Yep. Okay. So at any time, you know, they can press the button that fires that cancellation. Right. So the next one, this is an interesting one and this is all code. It's a wait and retry with an exponential back off. I think this is as close to prescriptive as we get. This is a really good way to do wait and retry. And the whole idea is that every time through the retry, we change the number of seconds or whatever to we exponentially increment the timeout, is what I'm saying. So the first time it's 200 milliseconds, then it's 400 milliseconds, then it's 800 milliseconds, et cetera. Right. And so this is all just done with a little bit of code. And so we're, this is a wait and retry and I think we have a maximum number of six retries here, but you know, you could do a wait and retry forever with an incremental back off. But you can see how it slows down. 200, 400, 800, 1600. The idea being that this thing's not responsive. So rather than just nag, nag, nag, nag, nag, you start slowing down because if it hasn't come back in 800, it's not going to come back in 900. Correct. That's the theme there. Okay. Yeah. So this is a really good strategy, but you can also combine this with other things like a circuit breaker, which I think is the next thing that we're talking about. Yeah. So this circuit breaker is what I was talking about before with the sort of the I love Lucy, Lucy and Ethel kind of handling the strawberries that are coming down the chocolate covered strawberries are coming down the conveyor belt and they're coming too fast and they're throwing them over there. You know, what are they doing with all those requests? So the idea is that when you have a downstream service that's struggling for whatever reason, Azure, AWS, whatever might have rebooted or there's like a problem with your service or hey, maybe your credit card expired and they decided to shut it off on you. I don't know what it was, right? There's some problem with that service. So if all these other services start hammering that service, it amounts to a denial of service attack. They'll never recover, right? Right. So rather than continuing to send even with a timeout, even with an exponential timeout, you can break the circuit. And what that means is that when the circuit is open, no calls go through. But remember, this is happening at the policy level, right? So it's not gonna fail to the client, but it is going to just wait. And you basically tell it how long it needs to wait before it closes the circuit again. And so that's what happens. Now it gets a little more complex demo-wise, but it's a very, very powerful tool, the circuit breaker. And it's a well-known pattern too. So if you think about it, this is a great example of nesting policies. So we've got a wait and retry policy, right? That waits 200 milliseconds, and then keeps retrying. And then we have a circuit breaker policy. And we're gonna break if the action fails four times in a row, and that's what this four here is, right? And then we're gonna wait three seconds after that. And then we're going to try, we're gonna do a test run and see if it works. And if it doesn't work, we're gonna keep the circuit open. And if it does work, we're gonna close the circuit and allow everything to go through. Now, this is a different metaphor than database connections, which are exactly opposite. When a database connection is open, you can use it. When it's closed, you can't. When a circuit is closed, that means electricity is running through it and it works. When a circuit is open, that's like a circuit breaker. There's nothing going through it. So it's a little bit of a different metaphor. But you get the idea. So now look at this. In our try catch, we have our wait and retry policy execute, and then inside that, we have the circuit breaker policy execute. So they're nested. And yes, I'll show you how to clean this up in a minute. But that's the whole idea is that you can nest these policies, right, from outside to inside. So watch this, and it's, again, we're gonna get some different colors here, but I'll explain what happens. All right, so that's enough. We can just look at, we can just look at what happens here. So the first three work, okay. Now we have our wait and retry policy, too many requests. And after three seconds, the circuit breaker kicks in and says, breaking the circuit, I'm sorry, four, after this is one, two, three, four, this is the power of ASync with console applications. So after four, it says, logging, breaking the circuit for three seconds, right? And then we get these exceptions that fail. And then the next one, it's called a half open circuit. We're trying, we're making a trial. And then that call worked, we're closing the circuit again and everything works. So during this whole time, during this time here, we are not sending any requests through. Right. Even though the wait and retry is trying to resend, but it's not allowing them through. That is cool. And of course, for your particular app and your particular scenario, you can play around with the actual policies and how many times you want to retry and how long you want to wait. Exactly. Based on the application, based on user preferences, really, right? Yeah, exactly. It'd be an interesting way to do it. You can ask people, well, how long do you want to sit there twiddling your thumbs before we give up? And the cool part is, there's a way that you can update those variables in, while the application is running. And you can, because there's a configuration store that you can just change and it will populate. So you don't have to stop the application just to change the policy. All right, so the next one is, I told you we would clean this up, right? So this is a method, well, a thing called a policy wrap. And policy wrap is a part of Poly where we have our two policies. This is exactly the same as the last one, our wait and retry policy in our circuit breaker policy. But now we're using a policy wrap. We're saying policy.wrap async. Here's the outer policy and here's the inner policy. Okay, cool. And now instead of having these nested, we just call policy wrap execute async. Got it. Isn't that cool? Yeah. Very cool. So the result is exactly the same as the last demo. It's just cleaner code. It's cleaner code. It's easier to read. Think of you have three nested. Who? Yeah. Exactly. Who wants to do that? All right, junior programmer, you get to debug that. All right, so now we've got a wrap with three. A wait and retry circuit breaker and we've got a fallback. And a fallback policy is sort of like the last resort. That's when you throw up your hands and say, I'm done. This didn't work. We're finally gonna report an exception to the user, but we wanna do that in a nice way. We wanna control the message that goes to the user rather than whatever our infrastructure service gives us an error. We wanna tell the user nicely that this failed. And sorry, try back in an hour. Nobody knows why. Didn't work, right? So everything works, but you see the fallback catch is filled with, let me see, the circuit is now open and not allowing calls. And then the response is, please try again later and you can substitute whatever message you want here. So that's what the fallback is. And let me just show this real quick. So here's our wait and retry. Here's our circuit breaker. And now here's our fallback policy. Okay. So we're handling a broken circuit exception. And we're saying, please try again later, that we substituted that message. And then essentially we have a fallback for any exception. So we have four, a fallback for a circuit breaker and a fallback for any exception, which is just, you know, right? So now get this, we have two wraps. We have a wrap that wraps a wrap. The resilience strategy, which is wait and retry and circuit breaker. Okay. And then we have another one that wraps fallback for circuit breaker with- Is there any limit to the number of wrappings you can do? No. Cool. So we have the fallback for any exception wrapping, fallback for circuit breaker, wrapping my resilience strategy, which is these two. So there's essentially five policies going on here. And then, you know, when we have the, we use the policy wrap just like before in one. Very cool. Cool stuff, yeah. So, I mean, there's more to it. How much time do we have? We, well, we're about 15s. We should probably wrap up at this point. All right, so all I wanna tell people is that these two demos right here, bulkhead isolation, I just wanna explain what that is. So a bulkhead on an ocean liner, you know, let's not use the Titanic as an example because you didn't have bulkheads or maybe it did, but I don't know. Modern ocean liners separate their hull into compartments. So think of them like sealed off rooms. And if it was just one big hull, if they hit a nice burger, got a torpedo anywhere in the hull, the whole ship would sink, right? But if they're cordoned off into these sections, you know, one section might get it filled up with water, but it wouldn't sink the whole ship. So that's the metaphor. The metaphor is if you've got a service which is calling two downstream services and one of those services goes down, you don't want that to affect the other service. And how it can is that all the resources are going to retrying this service that's down and then this service that's actually not down is a victim of that because you're using all these server resources for this guy and this one gets none of them. Okay, so that's what that demo does and you can explore that in the samples on your own time. Think of it sort of like multi-threading, but in the context of a service call and it's a lot easier than multi-threading. So it seems like Polly is very easy to use and extremely powerful. So I think in a nutshell, it gives us the ability to handle these transient errors when a service is failing and do something way more useful to the user than just show a spinning circle while everybody waits for the service to come back. Is that a good summary of what it is? That is probably the best summary I've ever heard. I mean, that's essentially what you want to do. You don't want to just let failures happen, especially when you're in the middle of service to service communication and something happens with service A, nobody's there to press the retry button. You have to account for those things. And the other thing that I didn't mention is that the whole idea of the Chaos Monkey, right? The Netflix Chaos Monkey is built in to Polly. So there's another set of, there's another project that you can use in Polly that it's called Simian Chaos Monkey. You can use to do random delays and failures and just to test out the resiliency of your system. So it's like a complete package. It's really good stuff. Awesome. All right. So thanks so much for coming on and showing this to us. We'll have links to all the demos and the repo where you can get it. Highly recommend everybody start playing around with this. This is really cool stuff. All right, hope you guys enjoyed that and we will see you next time on Visual Studio Toolbox.