 Hello everyone and welcome to the Dapper Maintainers Track Session. My name is Aaron Schneider. I'm a co-founder in CTO Diagrid. And my name is Hal Spang and I'm a senior engineer at Microsoft. Okay I'm going to be sharing the screen so Hal please let me know when you can see it. Looks good. Okay so today we're going to talk about Dapper and how it can help make your application infrastructure services more resilient and full tolerant. But before we deep dive into that let's talk about what's Dapper. Dapper is essentially a set of APIs for application developers to help them write their applications faster and more reliable. So instead of focusing on things like state management and PubSub and the vendor of an architectures and triggering their code based on events coming in from different systems and fetching secrets or configuration, Dapper gives them these building blocks for them to just consume these APIs from so that they're free to focus on their code, on their company's IP and their business logic. Dapper runs on any infrastructure. It runs particularly well on Kubernetes which is the recommended production platform for it. But it'll also just run on virtual machines. Dapper has a sidecar architecture in which you have your application here and then the Dapper sidecar and the application will just call Dapper via HTTP or JRPC. So literally any programming language that understands HTTP or JRPC can be used to talk to Dapper. And so the application will call Dapper over these protocols and here are some examples of how these look like. So for example here the application will call local host and it's going to tell Dapper, hey Dapper please invoke the card application for me on the method new order. And this is an example of state. So the application is basically telling Dapper hey Dapper please fetch the item 67 state for me from a KeyValley store that I configure you work with. This is an example of the Dapper Pub sub. So publish, subscribe. And this is an example of how an application might fetch a secret from Dapper from a component called KeyVold. So what are these components for state, publish and secrets? Dapper at its very heart has the concept of components which are how developers talk to these different APIs or rather the actual implementation behind those APIs. So for example if a developer talks to the Dapper state API an operator or developer can basically configure Dapper based in the environment they're running in to talk to different databases. So for example if they're running the AWS they will configure Dapper to run against AWS DynamoDB if they're running in Azure it might be CosmosDB if they're running on-prem or locally it might be Redis or Cassandra Firebase for Google Cloud and the same goes for any one of the Dapper APIs. So whether you're using bindings to trigger an application Pub sub to connect different systems asynchronously or the state management to save state, configuration management to get configuration items or any of the other APIs components are really at the heart of Dapper and they run both locally and on Kubernetes. On Kubernetes Dapper has a pretty simple architecture it has a control plane that is used to configure the data plane. So the control plane runs four main pods these pods are the Dapper runtime injector which is a sidecar injector it'll inject the Dapper sidecar into your application once you've annotated your deployment TML for it the sentry service is going to come up with certificates for your application which contain a spiffy compliant identity so you can then use that identity to tell Dapper to apply policies authorization policies so that one app can only call the apps it's been configured to call and the operator is the component that runs in the cluster that listens to new components so for example if you're applying a component of a state server type or a publish or a pub sub type the operator will detect that and then update the sidecar with the latest metadata the actor partition placement service is used for a very specialized Dapper building block which is called actors which is used to distribute very small units of computing state throughout the cluster millions of these that Dapper can just fail over and rehydrate with their state if you're not using actors you don't actually need have to run that pod and then Dapper will essentially the runtime will make sure that it listens to all of these components and your application can just talk to it using regular HTTP or gRPC let's talk about resiliency so for we've talked about how Dapper can give you APIs so that you can talk to state stores or publish messages or subscribe to messages or retrieve secrets but at the heart of all of these operations you have something that your application depends on it's not alone in the world it's not isolated your application might have a direct method called to a different service via service invocation or if you're getting an application configuration there has got to be a backhand that's fetching these or more to the point storing these configurations for you for your Dapper application to fetch same thing with state management when you get state or save state you know most of the time it's not being the call is not being made from an in-memory store you're probably talking to something like then would be your Firebase or Redis or Cassandra and input bindings are events that trigger your system like Twilio API or Twitter API stream for example but each one of these dependencies might become unreliable at some point and that's where Dapper introduces resiliency since the 1.7 release and what's special about Dapper's resiliency policies is that it allows you to define these policies that just make your infrastructure more reliable using circuit breakers, retries, and timeouts which Hal is going to deep dive into in a sec but what's special here is that for example unlike service meshes which give you these capabilities for a service to service calls Dapper really allows you to apply these policies holistically and globally throughout your entire application so it'll cover service to service calls but it'll also cover other application dependencies like databases and caches and secret stores and all of your applications external dependencies also and with that I'm going to hand over to help to continue deep diving into Dapper resiliency all right I will share my screen let me know when you can see it yep um now all right uh let's you can uh click the right there it is thank you as a Microsoft employee I use more of power points um so here's the resiliency you know as a whole um what it is inside of Dapper is a a YAML is your running standalone or a crd but that's also a YAML but it's basically it's a piece of configuration that allows you to set up several different policies across a series of targets in your in your application so as compared to doing something like this with an SDK or you know any other piece of code where you're manually wrapping each of your individual message calls what Dapper does here is let you define you know a whole policy that lets you take things and apply it across a bunch of different sets so instead of worrying about your individual calls or where things are being set up that way you actually have this you know big easy way to set up you know a global resiliency policy you know you had kind of like a one-stop shop for all of your retries and timeouts and circuit breakers so what those are or i guess i'll go to crd first um so that was the YAML which you can run as a standalone the crd itself is a you know the kubernetes concept which uh your role mentioned Dapper does run in kubernetes and in that case what we're allowing is for also multiple crds to be defined with your or multiple crds to be defined and then they all get merged together so you can see here we actually have two you know different very short policies these are not you know full things by any means but you can see you know one specifies a timeout one specifies a retry and then they get merged into one big policy so this also allows you to uh define multiple crds across even you know different teams different organizations you know a bunch of different people can provide the resiliency that they care about and that works with their app and they don't need to necessarily merge it all together at once so you can kind of work a little bit more in your own independent manner well also still you know cohabitating in the same kubernetes cluster so what does every part of the resiliency structure mean um there are three main policies we have timeouts we have retries and we have circuit breakers timeouts are the easiest out of all these guys uh they simply let us specify a duration and after that duration has expired we time out and the request fails um you know really straightforward stuff but allows you to set it instead of worrying about per htp call per gpc call in which client you're using you know you can just set it at one level it can apply across all your different components all your different you know apps everything so it's there and it's nice and easy to use uh retries exactly what they sound like um just allows for generic retrying requests or operations um we support two different types of retries right now we have constant retries and exponential retries and then inside of the retry as well you can even specify you know internal the dapper you can specify if a retry is uh if an error is actually retribal or say more permanent error um and then finally we have circuit breakers circuit breakers being one of the more complicated sections of resiliency but essentially what they allow you to do is cut off systems from traffic or reduce traffic um to allow for recovery time this is in situations where you're having a component or service that is you know simply not working appropriately or you know causing a large number of errors now with these three policies i wanted to highlight here how they work together um you know they essentially all wrap each other we have a retry at the elermos layer followed by circuit breakers followed by timeodes and then finally the actual wrapped functional code that's being called by the resiliency policy the reason we have them in that order is because that allows the timeout to signify error to the circuit breaker and then it also allows circuit breakers to signify to the retry policy when uh we've gone too far you know if the circuit breaker is now open we don't want to keep retrying and uh continue to spinning our wheels so the circuit breaker uh allows us to or at that level allows us to specify a permanent error back to the retry policy thus stopping all the retries and allowing for whatever the timeout period that we've set is so going into the policies just a little bit retries like i said we have constant and exponential policies constant policies really simple they have a maximum number of retries and they have a duration maximum retries how many retries we have duration how long in between each individual request exponential policies are a little bit more complicated because they have a little bit more advanced behavior but we can set the max retries again maximum retries that we want to make you can set the initial interval which is the starting points that you start doing exponential backoff add we have a randomization factor which is used to introduce jitter so that way you don't have everything uh you know calling at the same time because again you can just have swarms of requests that way there's a multiplier which is the growth rate of the initial interview interval you know so it's the bigger the multiplier the faster we back off we have a maximum interval which is the maximum amount of time you will have between retries again important because if you don't if you don't specify these kind of things especially in exponential policy you can be waiting forever and then a maximum elapsed time which is the overall maximum amount of time spent for all the retries so that way you can kind of control the global level of your policy now we're getting into circuit breakers which again are how you stop traffic from one another their stuff is a little bit more intricate so we have the max requests but this is actually the maximum number of requests that are handled in the half open state of a circuit breaker a circuit breaker has three states which are closed open and half open closed similar to an actual circuit breaker like in your house's electrical system means that everything's running normally open means we have flipped the circuit breaker which is stopping all traffic and then half open is a state where we will let through the max number of requests that you see here and that's where we look for a successful message or successful request if we get a successful request then the circuit breaker can close again and we can now resume normal traffic if we don't get a successful request the circuit breaker goes back to being fully open the next thing we look at is actually the interval which is the cyclical period that errors are evaluated in so that means that we have a rolling window where we're looking for a certain number of errors or a certain condition to be set and if that condition is met then that's when the circuit breaker opens so if we're looking for failure scenarios here as opposed to anything else if you don't specify an interval it just aggregates forever the timeout is how long the circuit breaker remains open before going back to the half open state so if you set a timeout of 60 seconds that means that all traffic will be denied to whatever that target is for those 60 seconds and after 60 seconds we'll do half open we'll let through the number of max requests and again if those succeed back to close if they don't back to open and then finally is the trip which is the actual you know thing that we evaluate in the circuit breaker generally these are fairly straightforward cases you know the in the default that we see here is consecutive failure count so that means instead of looking at and again since it is consecutive failures which what we're looking for here in that interval you know it's you need to actually have those errors back to back you can't have error success error success we're looking for a total failure here although you can also you can also set it to look for a total number of failures over the course of the interval so that's how those go and they and function like that and again on the going from half open to closed again we're not or half open to open we're not looking for the full trip value at the half open open that's looking for a basically a single piece of success or failure in that regard so that's not evaluating the trip again and now we also have the targets which is the other side of the resiliency policy so these can be applications actors or components and then what they do here is they're mapping policies into or mapping policies into the system that you're calling so the entire resiliency configuration is basically setting up a string to value map where the string is you know the name of the policy to the actual policy itself and then in the target it's the name of the target to what the policies look for what the target is looking for in terms of a retry policy so first we have apps and we have just an example here app a so this means that when any application calls into application a including itself if application a called its own method it would be using these policies so it would have a general time out the retry be service retry in the circle breaker will be serve a circuit breaker none of those are defined here but that's basically how this is going to work out is that you would expect in your resiliency policy to find a timeout named general a retry named service retry a circuit breaker named service circuit breaker components are a little bit different than apps because they have a slightly different set of behavior here you notice that we have a pub sub component that we're defining this for and it has two different types that we're defining for outbound and inbound now pub sub and bindings are actually the only ones that have an inbound policy and the reason for that is you can look at components essentially as calling generally an external source so an outbound policy is calling into that component you know for example if i have an azure service bus component and i wanted to publish a method a method or message i would then call that pub sub component with its outbound policy because i'm calling that component now the other side of that pub sub component is when you are subscribing to or listening for messages and the reason we call it inbound policies because in that case dapper is calling into your app so that's why we have these kind of two different policies and in this case they're set up very very similarly but they are two separate things and you can handle them separately because often in most cases when you're publishing a message to your pub sub system you're not going to have the same kind of resiliency requirements as when you're actually processing that message you know which is another thing handled by dapper when we have it receive a message and then it calls into our application they're two different scenarios so in the case of resiliency they've received two different policy definitions and finally we have actors again these are a little bit different than the other systems very similar to apps except they have two extra fields which is the circuit breaker scope and as well as the circuit breaker cache size the scope for an actor can include the type of the actor which you can actually see is the how we index into this we have actors and then my actor type so that's actually how we would use this if we called an actor of my actor type we would use this policy and the scope can be type id or both and when you're using both or id what it's looking for is actually the individual actor so you can actually have your circuit break your circuit breaker go all the way down to saying i want to fail this individual actor because maybe one actor host is failing as opposed to the other ones so you're actually you don't want to stop all trafficked to all actors but you want to stop traffic to that host while that host recovers from whatever is happening or while that host is deprovisioned and a new host is reprovisioned and then finally we also have these circuit breaker cache size which again is important because as you've grown stated earlier actors and dapper they're a very specific paradigm of how you're doing some development and it's designed to distribute load and this design to distribute you know tens of thousands to even millions of actors so in this case you know we can't keep all circuit breaker scope into our or all circuit breaker all circuit breakers in our memory because you know it could balloon infinitely it's also worth calling out circuit breakers going back for a second here circuit breakers as we can see from this definition or this image here are actually stored locally to your sidecar so each sidecar has its own circuit breaker and its own circuit breaker cache I should say so as you can see here I kind of define the flow of an app invocation if app a wants to call app b app a calls into a into a staffer sidecar which is going to handle all of your you know resolution and networking things of that sort and know how to call app b but the first what it's going to do is it's going to check its circuit breaker cache it's going to look for a circuit breaker for app b and it's going to try and put that request through its circuit breaker the reason for this is because if app b is broken you know we need to know about it on app a so we keep all that data there and each of these apps interacts individually with the room circuit breaker so that way you know we don't have any of these they're not widespread failures then you know nothing no you know app a to app b failure is going to impact you know app c to app b if those two things are okay and you can kind of imagine this might happen if you're you know doing multi az cloud work and maybe az1 to az2 is down but you know az2 to az3 is fine you don't want an az1 to az2 failure to stop your entire traffic set so circuit breakers live on the individual hosts and that way we can have some more fine-grained control of how traffic is managed so fine let's do some target examples just to make sure everything you know is is syncing appropriately so i had a really simple policy defined here um we define a few timeouts we have fast at two seconds and slow at 10 seconds we have a general retry which is a constant policy with a duration of five seconds between your requests and a maximum of 10 retries we have uh an app b retry which is exponential and has a maximum of 20 seconds but you notice it doesn't actually have anything else set including maximum retries when you don't set the maximum retries we're actually going to be trying to do some more um we're going to try and do essentially infinite retries so maybe app b we don't care how long it takes we just want that thing to you know uh to succeed eventually um we also define a circuit breaker app acb or app a circuit breaker and then we have just our two targets down here app a and app b you can see that app a references the general retry which is this retry policy right here um and then it references the timeout fast so requests here timeout after two seconds going into app a and then it also uses app a circuit breaker app b however uses app b's retry so it's going to retry forever calling into app b is going to retry forever and then we have also the timeout slow so a 10 second timeout because you know maybe we know that the application has longer uh running uh systems or functions or whatever but we know it's going to take a bit longer and we want to retry longer so these two apps have their own policies defined but it's all on the same thing it's all very short it's the same right here so now i can show everyone a quick demo of uh how uh circuit breakers work i need to do that all right um so what i have here uh is i have two dot net apps uh one of which i have called a generator one of which i have called an analyzer now the generator um basically just uh puts data every 10 seconds into my cosmos into a cosmos db component or state's work component that i have set up um and then sends a message via uh add your service bus over to an analyzer or to this analyzer app here um basically saying to do your analyzing work now in this case the analog's not really doing anything it's actually here to sign or to show a bit of a bug but uh what we're going for here is um the analyzer right now doesn't have any resiliency setup on it and uh i've just you know i've made some new changes to it and i'm pushing it out for the first time and now we're going to see how it runs um so we can start my generator who's going to start generating some data for us um we should see in just a little bit so it started generating data you can see that we're getting hitting our items uh we've have found 11 items to process uh that's great um we have some data in there already um oh but look now we notice that we have 13 instead of 12 even though we already added one so maybe something weird is happening with this application but we can revisit it in just a moment because first i want to show off my resiliency policy so it's going to look very similar to the one that we have uh in the in the slides but what you can see here is i have a timeout which is a fast timeout a retry and then i have some circuit breakers the one we're interested in here is actually the state circuit breaker because you can see that uh down here my cosmos db state has an outbound uh policy where i'm just using the circuit breaker state circuit breaker it has a trip of consecutive failures greater than one it has a timeout of 60 seconds and has an interval of 30 seconds and a mass request of one second so what this means is that after more than one consecutive failure uh my app is now going to start triggering its circuit breaker which hopefully we'll find that my app you see now we're getting up to 23 items in here um so shortly my application is going to hit this point where it's um you know because of you know maybe we have a filter that's incorrect on our on our query scan that we're actually getting too much data and we're going to start you know putting too much pressure onto our uh columns db instance um so without the circuit breaker what we're going to see is that we're just going to start failing a lot in there right on cue um it started failing it gives us um this big uh message which if we scan in here for a little bit um you can see that we have um too many requests the request rate is too large please try again after some time so what that means is that yes we have accidentally overloaded our cosmos db um and you see look and there it is again so right now we're just hitting columns db we're getting 429s which is constant use the error code for going too fast and it's just going to keep happening over and over again because uh you know our app is broken essentially we we pushed out a bad deployment we're now causing too much traffic and what this means in in our our day-to-day life is that uh you know we're going to start running up uh unintentional cost on cosmos db I think this gets big enough our thing that's generating data which is the bigger or the more important of our apps is also going to start seeing errors for how uh its data is generated so let's kill this really quick and then what we're going to do is we're going to go back and um we're going to restart my application but this time we have a config enabled that uh enables the resiliency feature so now we'll restart it we'll see uh that it should uh fail immediately which is good but then you notice look we failed our request but now we have this our uh query states uh we failed to query the state store because the circuit breaker is open and now you'll see that we keep failing our circuit breaker um so what this means is that we're no longer putting pressure on cosmos db it means that our first application uh it doesn't need to worry about uh there being too much traffic for it to get its job done and then because the circuit breaker is applied on uh the granular level of the cosmos db you'll notice that it actually means that we're still processing messages on this application we're still getting our pub sub um this app is still open for service invocation if you had to do other things to it you know so uh it's it's isolating just the bad portion of the code and now with this circuit breaker we're giving ourselves time to investigate time to find the bug time to roll back or fix it um without having to risk the uh the failure state you know we don't have to uh have any of these you know over over as charges or or risk a bigger outage and you can see here that we failed again the reason for that is of course our circuit breaker went from uh open to half open so now you can see that we failed again so now we're back to being closed and that's uh how it's going to go now going forward and of course you could tune this to have a longer outage period if you wanted to or a shorter one you could take more requests but at the end of the day what's happening is you know our circuit breaker is stopping us from overloading the database um and stopping us from causing further impact with our bad deployment so that is uh the demo any questions before I uh stop sharing your own now i just want to say that this is really great because it removed a bunch of uh code boilerplate code the developers would essentially have to write inside of their apps like to hard code these circuit breakers which are very difficult to get right memorize and timeouts so and this is also great because it can be used across any states or the dapper supports and any pubs over you know binding system exactly and both of these applications can also even use the same policy so you can get the exact same behavior across multiple different applications without having to change any or either the applications code because it's all done with that configuration yeah this is really great uh thanks everyone for tuning in and we'd love to take your questions at the Q&A section thanks everyone bye bye thank you bye