 Hey, thanks so much for coming everybody. This is great. This is a bigger room than I expected, at least personally. We are going to talk about building resilient apps the easy way today. And I don't know if anyone actually read the abstract for this talk, but we're trying to do a lot in 35 minutes. So hopefully we'll show you at least one or two of the things that we promised in the abstract. Yep. And our clicker works. Good to go. All right, I'll go ahead and do a quick introduction of myself. My name is Kendall Rodin. I am a senior product manager at Microsoft. If you've heard of it, I work on a platform called Azure Container Apps. Happy to be here. Cool. Thanks, Kendall. My name is Alice Gibbons. I'm a customer success engineer at Diagrid. And if you haven't heard of us yet, maybe you should look us up. I'm just kidding. We do a lot of cool things with Dapper, which is one of the things we're going to be talking about today. OK, so I'm going to kick it off here, talking about some developer challenges. I think this is one of these things that we are all here at KubeCon talking about distributed systems and some of the challenges we face as developers and in IT today. I think every single person in this room would have different idea of what this means to you. Every single person could say that they have a different challenge that their dev team faces. Some of the ones that we might see commonly are how do I encrypt my traffic between my services? How do I use distributed tracing to get not only traces between my applications, but also all the way through to my infrastructure? How do I do consistent retries on my apps? If something fails, if there's network latency, how do I ensure that goes all the way through the back end call and doesn't just stop when there's network latency? These are just a few of the things that was the reason behind the Dapper project. So what is Dapper? A lot of you might know what Dapper is already. I'm just going to give a quick introduction. Dapper is a set of distributed systems APIs, and it codifies microservices best practices. It can be used today with any different application code and across any platform. It is a set of nine different building block APIs, each of them abstracting away infrastructure and providing a way for developers to increase their productivity. Today, we're going to be focusing on a subset of Dapper APIs. We're going to be focusing on publish and subscribe, distributed lock, and service to service invocation. Dapper stands for the Distributed Application Runtime. Didn't mention that. And it can be run on any different platform that you'd like. Today, we'll be using an implementation running on Kubernetes because, hey, it's KubeCon, and that's where we're all here. And how does Dapper do this? It uses an idea called Dapper Components. These are YAML manifests that, essentially, you take the connection information from your applications and from your infrastructure and embed it into your applications. So instead of having to import an SDK for something like accessing a state store, you actually can just call the Dapper API and access the data that way. There are over 100 Dapper components today, and these are all open source online. So we'd definitely encourage you checking these out. And we also have the opportunity to create our own components with Dapper. So these are just some of the contributions from the huge open source community that have been contributed. But there are a ton more out there across all the main clouds and on-prem. So I'm going to hand it over now to Kendall to talk to specifically about service to service invocation and how it works with our end-to-end solution. All right, awesome. So I'm going to do a quick overview of the architecture today that we're going to be walking through just so you have a little bit of context in terms of what we're trying to accomplish and how Dapper is going to make it easier for us to get the solution up and running on Kubernetes. So it's, ooh, I'm going to get the clicker right. I promise. Here we go. All right, it's all going to start with a front-end application. This front-end application is essentially going to contact a back-end API directly via a service call and pass an order payload. Pretty simple. Once we move on from here, that order API will receive that payload from the front-end and will essentially publish it to a topic. In this case, we're using a Kafka implementation for the message broker. This could be swapped out with any number of back-end components that are available and supported via Dapper. Once the message arrives on the topic, it will basically be received by a subscriber. And this subscriber is called the Archive Service. And basically, it's objective. It's to take new orders that have come in and store them in a storage account. Once those are available in this Azure storage account, essentially we're also going to have a job that's running in the background on a schedule. So every 15 seconds, this job is essentially going to pick up the new orders that have come in, do some processing on those, and store the output in a state store. In this case, we're going to be updating loyalty points for people who are active users of our product. So now I want to dive in specifically to the service invocation aspect of this architecture. The big thing here is that we're using Dapper in order to get a lot of benefits when it comes to service invocation. So let's take a look at how Dapper actually interacts with and helps coordinate the service-to-service calls within my Kubernetes environment. So within the front-end application, instead of calling directly to the back-end service, we're going to call a Dapper sidecar. So in Kubernetes, Dapper does run as a sidecar in the pod. However, you can deploy Dapper without containers and frankly without Kubernetes. So we just want to make it very clear that this is just one way that Dapper can enable you via your Kubernetes deployments. Now, we talked about the Dapper APIs. They're available in both HTTP and GRPC. So you can choose whatever makes sense for you. Once you hit the Dapper sidecar that's running local to the front-end pod, we will then have Dapper communicate with the sidecar for the back-end service. And the sidecars will always communicate secure by default GRPC and you can also get MTLS here. What's also exciting is that in addition to MTLS, we can actually use Biffy identities, which will allow us to set access control policies. So for example, we could restrict the front-end from posting to the back-end and instead only allow it to do get operations. This is just one of many examples. In addition, we also see that we can enable distributed tracing not just amongst our services, but all the way to the backing infrastructure for our Dapper components. Sound good? Make sense? Awesome. I'm Levin Dapper. Hope you guys are too. So moving on, service invocation is great and a lot of customers do use direct service-to-service calls somewhere in their architecture at some point. But obviously when we're building distributed systems, there's typically a need for us to have some kind of communication mechanism that allows a little bit more independence of our services. We wanna decouple those via some kind of message broker or eventing system. For example, if we're using a competing consumer pattern. So that's where Dapper's PubSub APIs really come into play. So once that order API has been invoked from the front end, it will publish a message via Dapper to whatever that backing message broker is. And Dapper provides us with a lot of cool capabilities, including at least once delivery, so you're definitely gonna get that out of the box. And then what's nice is that the Dapper sidecar for the receiving service that's subscribing, Dapper handles that subscription on our behalf and also makes sure that the message gets delivered to whatever the method is on that backend service that we want to invoke. All right, awesome. I think I covered this pretty well. One thing to call out that I didn't mention on the previous slide is that by default, Dapper does use cloud events messaging format, but you can disable that. For example, if you're talking from one Dapper app to maybe a legacy app that has not been Dapper enabled. And you also can get distributed tracing across this experience, not just the service invocation calls. All right, has anybody heard this phrase at any point in their career in cloud computing? Anybody? Failure is inevitable? Okay, I remember I once, like at the beginning of my career at Microsoft six years ago, I remember somebody gave a presentation where they said, Azure never fails. This was funny because I work at Microsoft. It was like Azure never fails. It was a bit of a sales pitch and the advice that this person was given is that's the absolute opposite thing you wanna say. You wanna say it is inevitable that all cloud providers at some point could potentially fail. It is inevitable that some kind of call at some point will fail, right? A pod can go down, a node can go down. Network latency can get introduced. You can have transient failure in your network calls. So you basically have to build your distributed applications knowing that faults is inevitable and knowing that failure is inevitable. So it's great that Dapper provides us this service invocation capability, these PubSub APIs. But what happens when we run into one of these inevitable failures? We have to have a way to recover from that. And Dapper now has a capability which enables you to do that and that's called Dapper Resiliency. So let's talk a little bit about how this works in action for service invocation. And then what's great is I'm gonna stop talking for a little bit and we're gonna dive straight into a demo and get rid of these slides. Sound good? Are you with me? Okay, just checking, okay. I'm a validation girl. I need to know you're still here. All right, so we've got a front end application. Let's say that this front end is not Dapper enabled. So in this case, it's just a regular deployment running in Kubernetes. It's typically going to communicate to some kind of back end API. In our case, it's gonna be that customer order service API. Now, because I'm not using any kind of Dapper here, I'm just basically calling the back end service for my back end API. So nothing crazy happening here. No MTLS being given, nothing like that. So let's say that this call is to fail for whatever reason. One of the many examples that I gave earlier. That means that that request is likely going to fail and return to the front end caller, which is one, not a great experience for end users. And also ultimately, because the failure is likely transient, it would be a lot better if we had some kind of retry capability, which is this is not new concepts, but it is nice that you can do this without instrumenting your application code and across languages in a consistent way. And that's what Dapper provides. So let's switch the scenario and we're gonna have a Dapper side car running alongside our order API. So we've essentially Dapperized it, as we like to say. Our front end API, or excuse me, our front end caller will also now be Dapper enabled. When I make the call this time, instead of actually just making the post to the service for the backend API, I will call the service invocation endpoint via Dapper. So you can see here, I'm calling the V1 invoke API. I'm passing in the name of the Dapper app ID that I'd like to target, which in this case is the order API. And I'm also specifying the method on which I'd like that app to be invoked. You can see here, I also have this small YAML file called resiliency. What's nice is I can apply a resiliency spec or policy that's loaded at runtime whenever the Dapper side car gets up and running. So what that resiliency spec will do is say, hey, let's say that there's a particular behavior, right, a transient failure, and I wanna retry this a certain number of times. Well, what's nice is you can actually set that timeout policy in the resiliency spec. So in this case, I'm saying, hey, after 500 milliseconds, I want Dapper to return an error. That will then kick off my retry policies that I've specified in the resiliency spec. And in addition, I can even have a circuit baker policy that I apply when the number of retries has failed a certain number of concurrent requests. So ultimately, the goal would be I retry this and eventually it succeeds and that's what's returned to the front end. Y'all ready to see this in action? Okay, okay, cool. Before we do that, the last thing I wanna call out is that in addition to resiliency policies for service invocation calls, I can also set resiliency policies for all of the Dapper components pretty much. So if you think about that, that means I can have PubSub retry. I can have subscription retry. I can even have retries for actors retrieving secrets from a state store, reaching out to, excuse me, from a secret store or even retrieving state from a state store. So that's just something I wanted to highlight. All right, we did not record our demos as backup, so everybody clap, because we're doing this live, tap it. Yes, we love that. Okay, here we go. All right, can we still see? Yeah, it looks good. So here's where I'm gonna start. In order to essentially inject the timeout that I need, in order to inject network latency, I'm gonna use something called Azure Chaos Studio. This is essentially just a wrapper around Chaos Mesh, the open source project. And this is really critical because we obviously wanna be able to test our applications and their resiliency policies dependent on the type of failures that can happen. So in our case specifically, we're introducing a delay and this delay will essentially kick off the timeout policy that we've set. So I'm gonna go ahead and start this. I should have started it before I did that spiel. That would have been good, but you know, we're okay. So let's see if we can get this started. All right, so it should get up and running here soon. What I'm gonna do now is I'm gonna switch over to my terminal. We're gonna close a few things out. And what we can see is that we have several pods that are already deployed to our AKS cluster. Obviously, this could be a cluster of your choosing. But what I really wanna call out is we're gonna start with the non-dapper enabled scenario. So that's gonna be my front end pod. I can easily see the front end logs here. And I'm gonna use DDoSify, which is basically just a command line tool to hit my front end and pass in 100 get requests, which are going to generate 100 orders that are placed or sent to the order back in service. So I'm gonna do a quick DDoSify. Thanks for your patience here, dash D. And then I'm gonna do the IP address of the front end that is not dapperized. Okay, so what we should see is that we're sending some orders in and we're seeing the orders being received. What we should ultimately see is a bunch of 500s, right? So what essentially happened here is our network latency was injected and our front end is now receiving 500 errors because it never got the successful response that it was waiting for. So we can see that's not an ideal scenario, right? I'm gonna quickly show you too what's neat just for visualization purposes is that, oh, this may not have actually been running. That's okay. I was gonna show you some live metrics that showed the 500s, but I think you get the point, the logs are good. So now what I'm gonna do instead is we're gonna flip over the scenario and focus more on the dapper enabled pods. So you can see here, I have a front end application called front end dapper. This is the exact same application code. Nothing has been changed. I wanna make that very clear. I have not instrumented the code with Polly. I'm not using any kind of resiliency libraries or retry policies within my application code. I'm just setting a resiliency policy via dapper. So if you wanna see what this looks like and I'm gonna zoom in just to make sure everybody can see. You can see here, I have a resiliency spec and I'm applying it to the dapper resiliency namespace where my application workloads are running. So I wanna highlight a few different sections here. The first is the actual policy section. So you can see here, I can set policies around timeouts. I can set policies around circuit breakers and I can also set retries in general. So the timeout that I've set is 500 milliseconds. This is what's going to trigger dapper to tell my front end application, hey, the timeout's been met and error has been returned. Once that timeout has been met, once I apply the resiliency policy, we'll see that I can use one of many policies that are specified. In this case, the retry forever policy basically says I'm gonna retry the call forever, pretty clear. And then the retry max says that I'm going to do a max retry of 100 times and once that's been met, I will stop retrying. We can also set a circuit breaker and in this case, what we're saying is once 20 consecutive requests have failed, we're going to open the circuit breaker. Then we're gonna wait 60 seconds and once the circuit breaker becomes half open, we can essentially let just a max request of one order come through. What's neat is we can then specify targets and targets and scopes are really used together. So in this case, our front end dapper app, which you can see here, hopefully, our front end dapper app will use a specific target when talking to the customer order service. So whenever the front end dapper app is targeting the customer order service, it will use that general timeout policy and it will use that retry forever retry policy. In terms of components, we also set up a retry policy for the publishing to the Kafka topic. So when my customer order service talks to the target order queue, which is our Kafka pod, it's going to have an outbound, it's going in the outbound direction and we'll use the following policies. The general timeout policy, the retry max, which basically says it's only gonna retry 100 times and then it's also gonna use a circuit breaker, which will open once the number of concurrent failures have occurred of 20 here. All right, make sense? All good? All right, so our KS test should still be running, which it is. So I'm gonna switch back over. The resiliency policy is already running in our cluster, which is amazing. And I'm going to now use the DDoSify tool and I'm instead going to target the front end IP for the dapper enabled service. And before I do that, I do wanna pull up the logs so we can see what's happening in real time. So on your left, you're going to see the logs for the dapper workload or excuse me, for the front end application pod. And then the back end, or excuse me, on the right side, you're going to see the dapper delog. So this is the dapper side card. We're gonna go ahead and turn wrap on. Okay, so here we go. What we should see, which you can hopefully see, is that on the left side, we see the orders are getting sent, but we don't see any 500s, right? Because they're never getting returned to the front end. Instead, what's happening is that dapper side card is going through that retry forever policy. So the front end user will never see this. In an ideal world, if we set up the jitter and all that stuff, we'd eventually see the call succeed. In this case, it can't, because we're applying this network chaos on every single call. The last thing that I wanna show before I hand it over to Alice, is that we also set that retry policy on the publisher as well. So if we take a look at our, sorry, got a lot going on here. Bring this down. Maybe you could see a little better. So if we take a look at the customer order service, if you remember, it had a target for that Kafka pod. So when we're publishing orders, and there's a certain timeout, it should kick in that retry policy and eventually the circuit breaker. So we're gonna take a look at the customer order service. And we can see here that we're attempting posts and we're getting back 500s. And I'm gonna stop this from auto-scrolling so we can see. So you see that the application code is getting a 500 saying we are not able to successfully publish this order. And when we look at the order service dapper D logs, we should see that the dapper API is being called and that it's going through the circuit breaker process at this point. So the circuit breaker has now been opened and now it's not letting any of those calls through. All right, awesome. Well, for the sake of time, I'm gonna pass it over to Alice. Alice, had you already switched this over? So while she's doing that, I wanna talk a little bit about what Dallas, what Alice is going to be covering in the next segment. So we've talked about how we can make our applications more resilient without changing our application code. Alice is going to go into another important and critical aspect of having reliable applications and that's gonna be enabled through the distributed lock API that was made available in dapper 1.8. Alice, take it away. Is this on? This is on? So you also wanna switch to the podium? I'm okay. Okay, that's good. All right, cool. I know, and as Kendall mentioned, right? We're showing a lot here. So I wanna make sure everyone's still with us. We're still trying to do a lot for resilient applications, all right? And this is just one more step in that. So up here on the screen, you can kind of see our architecture. And right now we've made it through the backend. We've done some distributed calls and they are all resilient, as Kendall showed. What we wanna do now is we have a loyalty job and the job of this service is essentially to pick up all of the orders that have been processed and add the loyalty points for each customer. This is important, right? We wanna, we put our customers first. So we want those loyalty points to be right. And we have decided to use the distributed lock API in dapper to ensure that strong consistency on the database. So we have multiple replicas of the loyalty job running and this, if there was no lock on some of this data, you might have overwrites and times when, you know, one of the instances would clobber the other instance when trying to write to the database. So that's what we're about to show today and I'm just gonna walk you through an example. So you can see here that we are like using the distributed lock API in dapper and what exactly does this give us, right? The purpose of a lock is to ensure that among, you know, several different application instances, you are only gonna have one application instance at a time, accessing that data and updating it or doing a put on whatever the data might be. In this case, we have, again, the loyalty job and we are deciding to update that those loyalty points. There's two main reasons why you'd probably use a lock. One would be probably for efficiency and the other one for correctness. We're focusing on the correctness scenario today because we want that strong consistency within our database. So this is a very powerful component that allows you to build a ton of different features on top of the dapper APIs, anything like leader election scenarios or whatever your other developers might come up with. So on the screen now, you can see that we have those two replicas of the loyalty job and since dapper is a concurrent lock, you're gonna have the first replica that gets to that lock is gonna gain the lock on the state and then the second one, when it tries to come in and update that order, will not actually be able to. We are running, again, two replicas of this app and you can see it's using the Redis implementation of the dapper lock API and the next one here, you can see that once the lock is complete or once the lock has finished, the same replica can actually release the lock and then so other replicas of the loyalty service or the loyalty job can come in and do the work. I should also mention that dapper is a lease-based lock and so it does have a timeout on it or a time-based system where after 60 seconds, you know one replica will release that lock so other replicas can come in and do the work. This also helps prevent failure because if you run into an exception in that first replica that has the lock, you're always gonna release it after 60 seconds. And one more thing I should call out about distributed lock is that if you had two different applications and they're both accessing the same data but you can access them concurrently so from two different apps, dapper's gonna allow you to access those at the same time so both different applications can take a lock on that data and maybe they're updating different pieces of it or your system allows for both instances to update that data at the same time. Okay, so let's see this in action and as Kendall mentioned. Okay, down there. Yeah, okay. As Kendall mentioned, we are doing these live so another round of applause for Kendall's challenge. Is it going to run? Yeah. Yeah. All right, so I'm switching over to canines here as well and this screen looks a lot like Kendall's did. This is the same cluster, surprise, surprise and the same applications and you can see on the left here, we are gonna make this slightly bigger. No, it's good. You can see I have multiple instances of this customer loyalty job running and this is that loyalty job I was talking about, right? This is the one that's updating those loyalty points that we wanna make sure are really consistently right for our customers. So we have the customer loyalty job, two instances of that and then we have this customer loyalty job, no lock. So as you can imagine, the one with no lock does not use the DAPR distributed lock API and it is exactly the same code besides that fact. So again, we're running very similar to the demo previously in terms of the same code deployed multiple times and we're gonna do a load test. So let's see what that looks like. So I'm gonna run exactly the same load test that Kendall did and this is using that front-end IP for our DAPRized app and this is gonna run approximately or about 100 orders in 10 seconds each with 10 different customer order IDs. So let's run this guy and see what happens. All right, successful. And first let's take a look at the scenario when there's no lock, okay? So I have, I'm gonna come into the lock pod here and this is gonna be attempting to process some of these orders, which is great. I'm gonna come into the other version on the side here. Lots of orders, great, great customers. Okay, and you can see I have, this is my no lock code and what it's doing in both cases it's gonna actually be processing these orders simultaneously in a lot of cases, right? So there is no mechanism that's telling the app, hey, don't be accessing the same code at the same time. So a lot of these orders will be exactly the same on the left pod and on the right pod. So you can see actually on some of these timestamps they do actually match up from the GUIDs. So we're updating the loyalty customer ID nine, we're updating nine on the right here. So again, this is just the scenario where the lock is not being used and we have multiple instances updating the loyalty points and then saving them to a database. Okay, cool. Now let's check out the version that has a lock so that's the more interesting one, right? And so I'm gonna come up here, come in there. And this, again, as I mentioned earlier, so this loyalty job is checking every 15 seconds to make sure that there's no orders left to be processed. It's looking in this case within Azure Blob Storage and checking to see, hey, are there orders for me to process what's going on? So let's run another load test, see what happens. Okay, we see some orders, right? There you go, things are happening. Okay, so what we have on the left-hand side here is we have, again, the multiple replicas of the same application, so I've replica one on the left, replica two on the right, and you're gonna see some lock failures and this is good, this is what we wanna see. So I'm attempting to process this one order ID on the right here or on the left and it is not being processed. So this one successfully locked it and then you can see on the right-hand side that it wasn't actually locked. So this had failed to lock order. So for each one of these, you're gonna see that the order on the left-hand side would be processed and the order on the right would not be or vice versa, right? So one pod is gonna be updating those loyalty points and the other one will not be. And this is essentially gonna ensure that each one of these customer orders gets updated in the correct format and doesn't overwrite or clobber each other in the database at the end of the day. Cool, so let's check out now what that looks like in the Redis database. So if you remember from my architecture, I had these writing to that Redis DB. So I'm gonna switch over now just to my Redis database. Today this is just running on my AKS cluster and I'm gonna check out what those orders look like. So for instance, come in here. Let's check out that no lock one first, okay? Because I'm gonna check out no lock because I wanna see what that looks like if in the case I did not use a lock in my code. So I sent, if you remember, two different load tests. Each of them had 10 orders in them. So there should be around 20 orders and in the right hand side here, we actually have this order count. So on the no locks case, I have a ton of different scenarios and I'm getting different numbers every single time. Sometimes I'm getting up to 35, sometimes I'm getting 37. These are processing entirely different amounts as it goes as the lock writes over and over to this database or as the no lock app writes over and over to this database. But, and you can see 33, all very different. But if I come in here and look at the case whether it was the lock. Oh really? We're tearing up the orders, yeah. Okay. Yeah, because there was, yeah. Okay. They didn't come through. Well we have 19, 19 came through, Kendall. Okay, well we have now on this case, we have 20 orders that came through for the customer load service with the lock and this is essentially gonna be, if there wasn't, we had a bit of network latency, Kendall said, so we had a bit of latency here but there would be 20 orders every single time. So we have a few instances, there you go. It's coming through now actually, Kendall. Yeah, when I try to delete the keys from my demo, the chaos. Okay, but I think it's just coming through still. So yeah, we have a few orders are coming through and at the end you're gonna have 20 orders for every single instance of this in this Redis DB. So again, this is two of exactly the same application, one of them's using the lock, one of them isn't and this is updating the loyalty points and you do have that correct order count at the end of the day. Awesome. All right, well we really appreciate all of you coming today. I think we covered a lot but I think at the end of the day, Alice and I really want all of you to walk away with one central message which is we need to find easier ways to enable developers to get up and running with their applications quickly and having to focus on plumbing code and solving some of these distributed application problems on a developer by developer basis is never fun. So finding a way that we can use Dapper to enable our developers to get some of this core functionality without having to lose that focus on business logic is really the key takeaway here. So we would love for you all to dive into the Dapper project, contribute, give feedback and hopefully you'll be empowered to develop your applications with ease using Dapper. Awesome and then yeah, if you have any questions, comments, concerns, please come ask us. Anything's open, even like why we decide to do live demos, things like that and yeah, check out Dapper. Yeah, thank you so much.