 Just to set expectations, this is a sort of 101 level talk, okay? We're not going to go super deep into some of the patterns, we're not going to go super deep into some of the config, but hopefully you walk away with a good understanding of resilience, good understanding of reliability, and then you've got some tips and tricks that you can use with InkaD and with MSRI Ingress as well. And also a good understanding of what each of these things actually brings to the table, what rate limits can accomplish, what retries accomplish, et cetera, et cetera. Perfect. I'm Daniel. This is Flynn. We have worked together for five plus years. There were different companies and communities and so forth, very active in the open source community. We'd love to get involved in the Twitters. You can jump on our respective slacks, engage with us as well. Feel free to come up just at the end of the talk, but don't worry, you can always engage with us online as well. I'm always happy to chat afterwards. It's also worth pointing out for those of you who might know me from the past, that at buoyant, that is not a typo. I am at buoyant now. Not an ambassador of mine. Fantastic. So resilience. I always like to think about the, you know, why are we doing this stuff, right? I'm a technologist, you know, Java programmer by heart, by background. But I always like to think, why are we thinking about a thing? Why should we care about reliability? Why should we care about resilience, right? The user wants that reliable end-to-end experience, but they don't really want to know about the details, right? They just want the web app to work. They want to get a good user experience, right? You've got to handle north-south and east-west traffic when you're dealing with microservices. Back in my Java monolith days, north-south, traffic coming from the user to the monolith, happy days, right? As we split up to microservices, I started to get Ruby, Bit of Go, all these other things, suddenly I was implementing libraries to manage all the comms between the services. We've pulled that out. Things like observability, security, and resilience within the application now can be done via a service mesh, east-west. So I've got to think now, north-south, east-west, as we'll demonstrate, as Flynn will demonstrate, the patterns do differ, whether you're at the edge or within the mesh. So some little gotchas, little takeaways will go through there. And despite we're hearing at the conference, DevOps is not dead, right? And self-service is required is my pitch, right? And as a developer, I've been an operator, too, right? But I want to be able to do my work without interacting with unnecessary tickets, you know, Jira, all this kind of stuff, right? I just want to do my work, basically, whether I'm DevOps. Probably also worth pointing out that north-south and east-west have different patterns, but they are all required. You don't get to just go, oh, I've got resilience up on the north-south level, I'll just ignore east-west. That does not work. You have to think about it all the way through. Perfect, seven, perfect, seven. If you have not bumped in this book by Mike Nygaard, it is amazing. Check it out, Pragmatic Programmers. It goes into resilience patterns like our standard design patterns, right? Second edition of the book is out now. I learned so much from the first edition. It was, like, 10 years ago. And it's been updated with microservices, with cloud, in mind, although it is a technology agnostic book. It goes through retries, timeouts, rate limits, which we're going to cover today, sorry, with the MSRI Ingress and with Linkerd. It also goes through things like bulkheads. If you think in the Kubernetes land, that's like your quotas, your resource quotas, right? You don't want things overwhelming or consuming too many resources. Circuit breakers, you can do, like, in the mesh. You can do, in terms of, like, say, Envoy, I'll do a lot of work on Envoy. There is a mesh, sort of, or proxy-level circuit-breaking. And if you're from the Java world, and I'm sure other worlds as well, there's application-level circuit-breaking. Things like histricks from the Netflix stack. Spring have got a very similar pattern. We're not going to cover bulkheads and circuit breakers today, but, like Flynn said, they are an important thing in the toolbox. This is kind of a 101 talk, but do bear in mind there's other patterns out there. Excellent. Quick intro to MSRI Ingress. It is an open-source CNCF project, incubation-level. If you haven't heard of it, do check out the, you know, the website, the GitHub site. Pop along with Active Slack channel as well. It's all about getting your user traffic into your back-end services. It is an edge gateway, API gateway. Call it what you will. But the main thing is about getting that user traffic into the back-end. Why is it an option thing? You have been coding on it for, since inception, you were the lead engineer, right? Since 2017. Yeah, so it's battle-tested, right? That's one of the things, like, you know, there's many sort of ingressers out there, but this is battle-tested. It's used massive-scale at various different companies. It's a convoy. And we've wrapped that on top with the north-south use case in mind. If you think about the primary role of an ingress, of an API gateway, it is literally mapping from, we've got Jane in this sample, coked to a relevant back-end service, right? We've also got Mark here, going through to the back-end service. From a security perspective, we might want to stop that. We might have some auth in place, right? To stop that happening. Or the X might actually represent something breaking within the application, right? And that's where you need observability, rate-limiting, resilience, app development focus. We're mainly focusing on the middle two today, the rate-limiting and the resilience with the app development sort of mindset as well. But this is the role of an ingress. You're getting that user traffic front to back. In terms of configurations, this, you know, if you configured any kind of proxy before, any kind of API gateway, this should not be a big sort of change for you. This is our custom resource, mapping route back-end service. You can also go fancy. You can do canarying with weights, 10% of traffic. You can also add timeouts, which we're going to cover, right? And you can go even deeper with labels, which we'll use for rate-limiting, which I'll explain a bit later. You also need to set up things like the listener, right? Listing to ports. You're going to be setting up the host, probably some TLS and so forth as well. And of course, you want to set up the rate-limit service, which we'll walk you through later. And then notice is by a Kubernetes YAML, custom resources, right? Whatever your pipeline is, GitOps, or CI pipeline, CD pipeline, you can push your, like, deployments, your services down that. You can push this config down that as well, which is what I love about this kind of, you know, both from the Lincardee and from the Emissary's perspective, it's Kubernetes native, which is great. Just don't even think about it, right? But there is different personas at work here. Jane are DevOps persona You can split the workloads, right? And the operator sets the guardrails, sets the defaults, and then me as a developer, I go in and do my day-to-day work without needing to constantly bug them. So think about this, you know, there's often, like, be it security, be it observability, be it in our case resilience, the two personas work in hand-in-hand, but there's often different ways of configuring the tools, which we'll show a bit today as well. Think of that, Brooklyn, hand over to you. It's always interesting to hear, but it's not one rather than me doing that one. It's nice job. So this is the Lincardee shock and awe slide. Lots of logos and numbers and things like that. This is the real takeaway. Lincardee is a service mesh. It's all about mediating, monitoring, and adapting the east-west traffic from service to service. Lincardee's purpose in life is to give you what you need to create highly secure, reliable, observable, cloud-native applications, and to make those tools available free of charge. All of the logos back there thought that this was a good idea. All of the people starring the GitHub repo thought that was a good idea. I'd also like to point out we are currently the only CNCF graduated service mesh where that means that we are considered as mature as Kubernetes itself is. Emissary Ingris and Lincardee work out really nicely together. From Lincardee's perspective, Emissary is just another workload in the mesh. From Emissary's perspective, Lincardee is just another chunk of the Kubernetes networking layer. So you can literally just stick the two of them in the same cluster and they start working together, which is really lovely. I think... Oh, yes, sorry. I should point out. The way Lincardee works is by injecting these proxies next to each of the services and then having it... arranging things so that Kubernetes will route all of the traffic service to service through the proxies. There's also a control plane that keeps track of what the proxies are supposed to be doing. And that's really kind of all there is to it. I think you were going to talk about the demo architecture. So we'll now dive into some code. Flynn's going to be doing some live coding, so I hope the demo gods are with us. But just to set this up, this is the very simple architecture we've set up just to provide a realistic kind of use case, right? We have our faces gooey Actually, let me back up. Apologies. Just in terms of... The smiley service is responsible for sending back smileys, okay? Deep in our core graph, the color service is responsible for sending back colors. The face service aggregates these together. So you get green smiley face, right? When everything is working well, that's what we get. In theory, we should see lots of this in the browser when we do the demo. And then going back in again, you imagine we're calling through our faces gooey, it's a single page web app, through MSRI, into the back end, and you'll notice then there's lots of lovely linkies, our lobster from Linkadee, acting as a sidecar next to the services. That's the service mesh part. I don't think it's back to you. This is what you should always see when we're running this demo. Lots and lots of smiley faces on a green background. This is not what we're going to see at the start. Let me just be clear about that. There are a lot of different things that you might see. The cursing face when the smiley service is broken. You can see a gray background when the color service is broken. You can see the exploding head if the face's application, the face's service itself just has been completely overwhelmed. You can see this sort of thing for timeouts. One of the things that was deeply fascinating was that we have three services and I legit even though I'm the one who wrote all three of these things and put the whole thing together, there were points where I had to use all of the tools, both on the emissary side and the Linkadee side to figure out what was going on in my own demo even with just three applications. This stuff gets very, very complex very quickly. Which is what we're going to look at now. Oh, yes, one other thing. The web app is actually willing to show the user old data if the services are responding. For our purposes in the demo you'll see that little counter ticking up when it has had something that's a little confusing but decided, it's okay, I can keep showing this data. It'll be all right. All right. Let's use the demo. I'm not actually going to show the installation of emissary and Linkadee and all that. We have a cluster running. It's actually, in this case, it's a K3D cluster running on this laptop. Could be anything, doesn't really matter. We've got our single page web app running here. We would like to see a grid of smiley faces on green backgrounds and you won't notice that we are not, in fact, seeing that. Places where you see the cell just vanish are because something just took way too long and the web app gave up. The red background is where we couldn't even talk to the face service. So I guess let's go ahead and get started here. The first thing we can do is we can see if we can get rid of that bit with the frowny face on the red background. We're going to do that by telling emissary to do retries automatically if an error comes back from the face service. So the lines in green there, we're adding a retry policy to the mapping for the face service. So that we're basically telling emissary if you get any 5xx response retry it. Only retry it once, but retry it now. Let's go ahead and apply that one. We are telling emissary to only retry once just because that makes it a little bit simpler to reason about what's going on. We don't necessarily expect that that will get rid of all of these errors because we only get one retry. It's possible to, you know, get a failure or retry and then immediately get another failure. But you can already see from applying this we have a lot fewer of those simply by telling emissary go ahead and do the retry. Also notice we haven't changed anything at all in the application. All we have changed is a bit in our API gateway. So let's see if we can get rid of the cursing face from the smiley service. We'll do exactly the same thing. We'll tell emissary to do a retry. Any guesses to what should happen here? Anybody? All right. Let's do that differently. Raise your hand if you think this will work. Nobody. Okay. Raise your hand if you think it won't work. Uh-oh. Nobody's really committed at all. Let's find out. In point of fact it does not work. It has absolutely no effect. And the reason for this is the call from the face service to the smiley service and the way the app was put together, it doesn't go through emissary at all. It happens past emissary. Emissary has no ability to affect this. So instead, we will tell Linkerd to retry that. Things are a little bit different in terms of the configuration. On the emissary side, sorry about that, on the emissary side we had to tell it, retry in a 5xx and you get to do one retry and on the Linkerd side we just say, yeah, this is retryable. You can control this more. We don't need to in this case. That basically is telling Linkerd any of the 5xx's will retry. Linkerd has a concept of a retry budget where it will just keep retrying until you go over the retry budget. By default I believe that's 20%. As long as not more than 20% of the traffic is coming from retries, Linkerd will keep retrying. You can tune that, but often you don't really need to. Now we're getting as much smiley faces and no cursing. No cursing is good. We can repeat the same thing and see if we can make the grey colours go away. The answer is they go away from retrying things. You'll also notice every so often we still see that red background because again, emissary is only doing one retry. We could go through and play with a number of retries. We could tell emissary is okay to do two retries. We could tell it five, whatever. Some of them will probably always be able to sneak through a little bit and so that's a thing to be aware of. We still have a bunch of cases where things are taking too long and the cell is just fading off. So let's go through and do some timeouts here. We're going to do timeouts from the other direction. Retries we started at the top of the call graph and worked our way down just to demonstrate that you can do this the other direction with timeouts. And then work our way back up. The other thing that's worth pointing out here is adding timeouts is not really about protecting your services. Adding timeouts is about giving your client the ability to do something when it takes too long and making it easy for the client. It's giving your client some agency to decide what to do. In this particular case if the face service can't talk to the color service it'll start showing it as a pink background because it took too long. We also see that there are already fewer cells vanishing because now there's less stuff taking too long. We can do exactly the same thing for the smiley service where we'll see the sleeping face if things start taking too long and even at this point we don't really see any more of those cells just vanishing and staying vanished. We can of course do the same thing at the emissary layer so we can just tell emissary, hey whatever's going on deep in the call graph there, if it takes you too long to hear back from the face service just go and give up on that. And in this particular case this is where we'll start seeing those counters appear where everywhere there the GUI has decided I'm going to go ahead and I'm going to show some old data here that counter is really just there for demos. Did you also notice the one up there in the upper... sorry some of these where you'll see the sleeping face that's sort of grayed out that's where the client has just decided this is taking too long in general anyway. But the point here is that the client the web app now has the ability to decide what to do when things take too long rather than just stalling and giving the user something weird and confusing from the user's point of view. This gives us with rate limits and we will let Daniel talk about rate limiting. Yeah, we've got the code up on screen not there. So, well, alright, I'll say one more thing here. Imagine if you will that one of your developers has deployed something where suddenly your service gets a lot slower somewhere down on the call graph. So we've just done that we've just modified this deployment and you'll probably see a bunch of red pop up here as the deployment recycles but now it will start exploding if things get hammered too much. Like that. Perfect. And then it is the next bit of him for showing the rating. Yeah, go ahead. Super. So the rate limiting looks a little bit different than perhaps you might expect but the timeouts and the retries it's like pretty self-explanatory, correct? But with the way we've exposed the functionality in the rate limiting it's based on the original lift rate limiter. So lift is in the company. We all probably jump to lift to get here, right? The way they exposed their rate limiter was very much based on labels. The idea being you could apply a bunch of labels to a mapping that data then gets passed down to the rate limiter service that we created, or the risks about the box options too, and with that data being passed down through you can make rate limiting decisions. So I've got some quite basic examples here. I think I'll just use a generic key, right? Yeah. You can pass, well, by default properties such as the remote IP the source and destination a bunch of useful information gets passed down through and you can say hey, I want a rate limit based on this IP. I want a rate limit based on these headers being injected for example. Very customizable do advise you to check out the MSOI docs because it is quite rich and when I say rich, can mean complicated sometimes, right? But it's not when you take a pause and go through it and quite as perhaps obvious as what Flynn's already shown with the retries and the timeouts. See, this is why I'm making Daniel talk about the rate limit. Yes, thanks then. Appreciate it. Anytime. The good thing is, look, I've got a sample rate limit app I wrote in some Go. There is an old Java one I wrote back and it's still out there on the interwebs as well and we'll show the code at the end but we've put the link in there. Pop along to the repo. You can see I've actually used Honeycomb. Honeycomb.io's leaky bucket algorithm. So very simple Go service. There's an API, protobuf API you have to implement for the rate limiter service so I generated my stubs from that used Honeycomb's leaky bucket to I think it was 8 RPS in the end of Flynn we did, didn't we? Yes, and if you read carefully on the previous screen here, you would have noticed that the environment variable we set on the face service has it blow up at 8.5 requests per second so doing setting it rate limit at 8 per second should give us some relief. Fingers crossed. So everything is highly customizable you can look at the labels being passed in, you can look at the headers and make those rate limiting decisions and I'm happy to drop by the booth later on, happy to walk you through and some of the options there for the rate limiting. But without further ado let's now give this a try and we should see a lot less explosions. Now the nature of this, we have to go through and refresh all these cells so they're not going to vanish immediately but if we give it a few seconds we should see the exploding heads go away. That is always good when the general works, right? It's always nice. Fantastic. The last thing that we should point out here oh by the way, sorry, there are two last things we should point out. One is if we turn off the counters then that gives a better example of if you had an end user who was using an SPA like this this would be more the thing that they would see where, okay, every so often they'll get a delay but for the most part they're just going to see smiley faces on green backgrounds which is what they're supposed to see so we have here an application that's composed of honestly some pretty badly behaving microservices these are terrible but we've been able to use Emissary Ingress and LinkerD to mitigate some of the terribleness here and overall give the user an experience that can actually be pretty good without having to go mess with the code of the application and we did this in what, 10 minutes? Just going through and tweaking things obviously you're not actually going to fix this without fixing the application code which we will not show you here that's a little bit outside the scope of Emissary but the point here is that even before you fix the application code there's an enormous amount of stuff you can do to mitigate the effects of badly behaving code and give your users a better experience particularly on that note we talked earlier about the different personas I've definitely worked in an ops persona where it was hard to get some of the developers to engage and that's another problem that is DevOps to work on that kind of thing but the same time I had a limited set of tools available to me as a platform engineer as an operator and it got me 80% of the way there there was a couple of super old really janky heritage apps and I could protect them in a bubble somewhat to Flynn's point, clients then calling could interact for the most part with those as though they were reliable please do document these things right in terms of hereby dragons kind of thing and ideally fix it to Flynn's point but this super old service, no one went there we just couldn't fix it right so we protected it in a bubble with retries, rate limits and timeouts we got a set of tools in my toolbox then I went and checked the developers what can we do to actually make this service more reliable right and if anybody's curious these apps the services down in the mesh are actually hard coded to fail 20% of the time which is just got off in terms of terrible, terrible application and they're also hard coded to do things like something like half the time to respond so terrible metrics down there at the application level but we're still able to deliver a decent user experience just by messing with things in the API gateway and the service mesh so please do not take this as any endorsement of a 20% failure rate being acceptable or anything like that but yeah back to the slides for everything any file wrap up do we have more on the slides? yeah just a final bit of conclusion there we go, we do perfect so just coming back to the full circle hopefully you've taken away from this that users want the reliable software they don't necessarily care about the internal details that's our jobs right as developers as QA as platform engineers you do need to think about this end to end but you need to ideally start with the services but in reality you're probably going to be looking at communications like we've talked about you want to fix things if they're broken but it's a nice tool to have in your toolbox and at least cloud native communications the north, south, east, west think about the whole thing it's very tempting if you're really into your meshes just to try and solve everything in the mesh if you're really into API gateway just to solve everything there but something like emissary ingus and something like linkerd together hopefully you've seen that give you a lot more options there is a few dragons it's very easy to mess up some of the time-outs sometimes with time-outs at the edge versus time-outs in the mesh as well I'm happy to talk about that afterwards if anybody wants to know some of the dragons we ran into while we were doing this there were a lot of them but for the most part I really do like the combination of the two separate things, ingress gateway and the mesh gives me the options the combination is very powerful love it and the mix of retries, time-outs and rate limits does genuinely go a long way I've seen some folks get a bit too into the circuit breakers too early because they're kind of the hotness back a few years ago everyone was loving histricks and at the application level it gives you more options again we're at the wire level here we're at the service comms level but honestly retries, time-outs, rate limits I'm a big fan of keeping it simple and make sure your solution is developer-focused and self-service that's why build on the tools that are out there I've seen some folks try to hack all this into libraries and that can work but I'm a big fan of like some other folks have already done this stuff there's an open source project, there's a CNCF project contribute, get involved and contributions can be docs because we love contribution but it can be code as well so rather than do your own thing rather than try and push it in super ops focused or super dev focused think about that end-to-end experience, developer-focused, self-service and the CNCF tools are all about that Kubernetes, cloud-native is all about declarative, self-service config one thing we didn't put on this slide that occurs to me we probably should have is just another note about the fact that remember the app we were demonstrating is very very simple and shows some extremely complex behaviors as you start looking at how those different things interact with each other so we didn't really talk about it much in this slide, we didn't talk much about the observability and debugging part but there are a lot of tools to help with that as well and it can be very very helpful to think about this from the beginning of you developing your app rather than waiting until things are going wrong so yeah final slide, references there the main demo you can find oops we'll need to update the top link pay no attention to the top link update and then my rate limiting service is below as well which I moved into the ambassador labs repo, hopefully self-explanatory we've tried to document as good as we can but always happy to take questions on our Slack or 5 million on the CNCF Slack as well and we also have the docs for linkadee and the docs for msr ingress as well, do let us know how you get on with these things, we love feedback, we love PRs in the docs as well if you spot something that's a bit incorrect or not super obvious and we always both projects very much appreciate docs PRs there we go, now there's nothing to hide on the slide and I think if everyone's taking their pictures, I saw folks with a few cameras it was great, we have a few minutes five or so minutes for questions if anyone would like to check anything out I think it might come up be roving, I'll thank you or you know just wander up afterwards but always happy to take questions now so we saw your demo where you came to your initial values for some of these things and incident situation, is it appropriate to proactively change some of them initial values, sorry which ones for example, your time out that you initially came to 250 million, oh yeah I am not going to lie the time out values that we showed in this demo have gone through probably a dozen different iterations while Daniel and I played with things it took a while to find combinations that gave really good results and that's another case where the observability part plays a huge role in this one, where we'd go change something and then look at the demo and go, huh that's funny, that's not quite what we expected to happen and then we'd go through and poke around through emissaries logs and the linkardy dashboards and things like that and realize what was going on and then iterate and just keep going, so yeah 100% there's a very iterative process on all this stuff this is like watching a tennis match we genuinely are scouting a room any other questions at all if not oh perfect, yeah in the retry policies there's some way to do a back off so it doesn't retry immediately or an exponential back off I'm going to have to go look that up to be candid same as me if I remember right I believe that there is but I'm going to have to go look that up we kind of deliberately deliberately didn't want to get too deep into that for this one because we'd run out of time if we did excellent question though thank you and the folks that perhaps haven't bumped into this you've got microsavers and you're doing lots of retries and things, it's very easy to get what's called a thundering herd something we talked about showing and ended up deciding we didn't have time for if you look at the linkerdviz dashboard when the demo app is running in the first place I wonder if I can show a bit of that I need to go reset everything here so if you look at the viz dashboard it will show you how many requests per second are happening to a given service and when we start this off with this particular demo it hovers right around 8 per second there are 16 cells in each cell refreshes every 2 seconds as soon as you turn on retries that number goes to 9.6 because 20% of them immediately get retried because of the failure rate so yeah it is really interesting to watch how some of these effects or how some of these changes have differing effects on what happens to the services retries and rate limits and timeouts and such so if you increase the amount of load on your service not decrease it but again it gives your user a better experience and so overall it is a good trade off to make Mike Nagler's book that we put up it released it, it is a fantastic map perspective as well it talks about thundering her it talks about resonance in the application you have got to think about some of these things I have definitely worked on systems where retries have actually taken down the system because it overloaded and you get this thundering her so that is a good question good shout do check out that book it is fantastic to educate you about that Hello guys, thank you for the presentation I am wondering because when you look at the presentation like the focus you can clearly select it is towards the end user but I am wondering when it comes to service to service communication have you tried it if so what are the dos and the don'ts sorry what was the what was the question I am asking when it comes to service to service communication so what should I do or should I should not do what should you do or not do exactly I will prefix the conversation further answer by saying I always recommend thinking about the end to end definitely when I was doing platform work I was just like hey service to service it is really important but always think about that goal the end user and folks do timeouts and stuff in the service mesh that were completely ignored by Ingress because Ingress was too tight on the timeouts always think about end to end but probably a better answer in terms of service to service dos and don'ts actually I think your answer was pretty good nobody runs Kubernetes just for the sake of running Kubernetes everybody runs Kubernetes because there is something they want to accomplish for their end user and so starting out by thinking about how to play out for the end user is the critical bit I don't know that there are any hard and fast rules in terms of what to do for service to service the way that we've been approaching it as we were working through this is all about thinking about how it's supposed to look to the end user watching this web app and then you know okay well we know that this bit is slow what can we do about this bit down here wait a minute if we make this bit timeout like this we still have this problem that this bit behind it is too slow so maybe we can start there and focus on that bit right again it's very iterative it's a lot of watching and observing and tweaking but pretty much always starting out from the perspective of thinking about the end user is I think the way to go one final comment I had on that as well if something looks too complicated it probably is because I've definitely done that where I put all these things around the edge to protect my services and I was like you know what I think I should fix this service right ultimately again you know no amount of stuff you can do in the Ingress or the service mesh is going to 100% compensate for a service that's terrible like these are so ultimately you have to fix the service it's 30 seconds left so it's like a super fast question really quick question I guess we do thanks for the insights would be helpful to have some sort of tools to run or simulate traffic to be able to fix out the right mixture of those parameters you've been playing with and maybe we can through AI in the future to be able to adjust all those parameters automatically fantastic question and I reckon there's a whole talk we could do on the answer to that it's a great low testing synthetic monitoring definitely look at those kind of things great question we might unfortunately have to read chat afterwards so much room for tooling around all this we should have mentioned that we talked about experimentation we talked about iteration having a firm benchmarking situation is well worth investing in awesome I think we better wrap up we'll be around thank you very much for your time absolutely come find us if you need more