 Thank you. So Werner Vogel, the CTO of AWS, famously said, everything fails all the time. So for Werner Vogel, the question was not how to avoid failure, but the question was how to handle failure. So I am Dominic Tornow. I'm a principal engineer at Temporal and I focus on systems modeling, specifically formal modeling and conceptual modeling. I work at Temporal and Temporal is an open platform for durable executions. And durable execution is to a distributed system. What a transaction is to a database. An abstraction that enables you to build an application as if failure doesn't even exist. Now to provide this guarantee, Temporal built a lot of expertise in failure handling. And today I'm super happy to be here at Asia Summit to talk about handling failures from first principle. So a first principle approach breaks a domain down in its basic principles and then builds an understanding from these basic principles instead of relying on unspoken assumptions or on conventional wisdom. So in this presentation, we will think about failure and failure tolerance holistically. Failure is an event in a system. Failure refers to an unwanted but never the possible event. Failure tolerance is a guarantee of a system. Failure tolerance refers to the guarantee that the system behaves in a well-defined manner, even in the presence of failure. Now in other words, if a system is failure tolerant, then the system truly guarantees total correctness in the absence of failure. But it also guarantees at least partial correctness in the presence of failure. And if the system is actually able to guarantee total correctness even in the presence of failure, we speak of failure transparency. Now failure transparency is obviously the most desirable property, but it's not always possible. So think for example of the CAP conjecture. I think it's a database heavy crowd. Think of the CAP conjecture. The CAP conjecture states that for a replicated data store you have to choose between consistency and availability in the event of a network partition. So in the absence of a network partition, the network partition is a failure, the unwanted but nevertheless possible event, the system is able to guarantee total correctness. It is able to guarantee both consistency and availability. But in the presence of a network partition, the system is able to only guarantee partial correctness. We have to choose between consistency or availability. But the good news is at least you get to choose. As a designer of your system, you get to choose what failure tolerance means to you. You get to choose the guarantee you want and it's important for you. You get to choose if you prefer consistency or availability. Failure tolerance is a design decision. Now in order to talk about failure holistically and what kinds of failures we expect and what kinds of failures we need to tolerate and what guarantees we need to make in the presence of a failure, we need to first look at the underlying system model where the failure actually plays out. So system model is a set of assumptions about a system and algorithms and protocols that are correct under one system model may not correct under another system model. Any deviation may render any algorithm or protocol incorrect. So you can think of a system model a bit like a board game and the game sets the stage and the game sets the rules and as a player you have to devise a strategy to achieve the game objective within the constraints that the game sets for you. And even a slight change to the rules may render a player's strategy completely ineffective. That happens a lot with extension packs. Now for this presentation we will think in terms of a very popular system model in a cloud environment in a microservice environment and that is in terms of service orchestration. So a system is a collection of processes and one process is a sequence of steps and one step is a networked call to an upstream service. Here a single service call has transaction like semantics. It's atomic and it either happens completely or it doesn't happen at all. However the sequential composition of service calls does not have transaction like semantics. The sequential composition is not atomic out of the box. But we also want the sequential composition to be atomic. We want total application not partial application. So in the event of a failure we need to ensure that the process executes in one of two ways. Either observably equivalent to exactly one's total application or observably equivalent to not at all. No application. So the classic example is certainly travel booking and many of us traveled to be here. So our credit cards were charged when we booked hotels and flights. Each step is in itself atomic. However we also require the composition of the steps to be atomic. We expect exactly one charge, one room renovation and one ticket. So to keep things simple for the rest of the presentation whenever we need a concrete example let's talk about charging the credit card, the charge credit card service call. So what failures do we need to tolerate? What could go wrong? Well it's a microservice environment. It's a networked call. So the request may be lost in the network. The service may crash before the computation takes effect, before the credit card charges. The service may crash after the computation takes effect. So after we charge the credit card or the response may be lost in the network. And in the absence of a response we actually don't know if the intended effect happened or if it didn't happen. We cannot distinguish whether the failure occurred before the computation took effect or whether the failure occurred after the computation took effect. So we may end up in an inconsistent state. And additionally the computation may simply return a failure, raise an exception, like an insufficient funds exception. So there is a response and the response itself indicates a failure. Okay, now what are we going to do? How are we going to handle that failure? Failure handling always consists of two components. Failure detection and failure mitigation. The first component of failure handling is failure detection. So it refers to the mechanism that detects if a failure has occurred. Now generally I struggle a bit with the notion of failure detectors in distributed systems. Most authors focus on detecting component failures or emission failures, crashes. But I like to cast a bit of a wider net. So when I think about failure detection and failure detectors, I generally think about witnesses, a predicate that confirms the presence or the occurrence of a failure. Very common example for witnesses are exceptions. The system itself tells me something went wrong. But also timeouts. We're waiting for something and it doesn't happen. That is also a pretty good indication that a failure has occurred. It is not certain, but it's a good indication. Now the second component of failure handling, that is failure mitigation. It refers to the mechanism that actually addresses the suspected failure or resolves the suspected failure. And broadly speaking, especially for our scenario, there are two failure management techniques. One is forward recovery and one is backward recovery. So remember the process is a sequence of steps and any partial execution is undesirable. Therefore, in the event of a failure, we need to ensure that the process executes in one of two ways. So observably equivalent to successfully total application. That is what forward recovery is responsible for or observably equivalent to no application. That is what backward recovery is for. So let's look at forward recovery first. In the case of a failure, we just move the process forward. So more formally, we transition the system from an intermediary state to the final state. And as a rule of thumb, we need to repair the underlying failure because we try to push past it. We need to resolve that failure. Forward failure recovery is a very common platform level failure mitigation strategy. We simply retry. Something goes wrong. Let's do this again. Something goes wrong. Let's do this again. Next, let's look at backward recovery. In case of a failure, we roll the process backward or more formally transition the system from the intermediary state back to its initial state. And as a rule of thumb, we don't have to repair the underlying failure. We're not trying to push past it. So backward recovery is a very common application level failure mitigation strategy. We compensate. We undo what we already did. We reverse the charge on the credit card in order to choose the ideal failure handling strategy. What are we going to do when a failure occurs? We also need to take the class of the failure into account. We need a classification. Now, obviously, there are hundreds of different ways of classifying failure, but here I want to focus on two orthogonal dimensions. That's the spatial dimension and the temporal dimension. On the spatial dimension, we can classify failure as an application level failure or a platform level failure. So in order to do so, we need to think about a system in layers. Components at a higher layer usually make calls, down calls to components at a lower layer and generally they're expecting a response. And the end to end argument states that in a layered system, failure handling should be implemented in the lowest layer possible, looking from upside down, that is able to correctly and completely handle failure detection failure mitigation. So now a failure can be classified as either an application level failure or a platform level failure, depending on the lowest layer that is able to detect and mitigate the failure. So for instance, an insufficient funds exception indicates a application level failure. The application level is the lowest layer capable of correctly and completely resolving that failure. That failure is completely meaningless on an application level. But a could not connect exception, that failure indicates a platform level failure, although the application itself could potentially mitigate that fail. The lowest layer that is capable of correctly and completely mitigating that failure is a platform layer, which can just simply be retry, we retry in the network. On the second dimension, on the temporal dimension, we can classify failure as transient, intermittent and permanent. A failure is transient when we can assume that the probability of a second failure after first failure is not elevated. So formally, a transient failure is defined by two characteristics. So first, the probability of a failure of two occurring after a failure of one already occurred is the same probability of two occurring just by its own. And transient failures are auto repairing. They need to repair themselves. Otherwise, they're by definition not transient. So we do not need any intervention. In our example, if the cause of the could not connect exception is, for example, a router restart, then that is a transient failure. The failure repairs quickly. Once a router restarts, the connection can be made. The second class is an intermittent failure where we can reasonably assume that the probability of a second failure is elevated. So formally, an intermittent failure is defined also by two characteristics. Firstly, the probability of a failure of two F two occurring after another failure F one already occurred is higher than the probability that F two occurs on its own. And intermittent failures are also by definition auto repairing and resolves themselves without any intervention. In our example, the cause of the failure is an outdated routing table. And the could not connect exception may be an intermittent failure. The type of failure auto repairs, but with some delay. But as soon as the router updates its routing table, the connection can actually be made. And if a failure is permanent, we can reasonably assume that a second failure is certain. So formally permanent failure is defined by two characteristics where the probability of a failure F two occurring after failure F one occurred is 100%. And secondly, also by definition, permanent failures require manual intervention. They require manual repair. In our example, the case of the failure is an expired certificate. Then the could not connect exception is a permanent failure. Yeah, that failure doesn't order repair. Somebody has to come and install a new certificate. Otherwise, it doesn't go away. Now with all of this together with the for this particular system model and for that particular failure model, what could an ideal failure handling strategy look like? And and also what is ideal? What is ideal for me may not be ideal for you. It's a design decision. But let's look at one that I think is a reasonable failure handling strategy. Oh, in the event of a failure, first, let's assume it's a transient platform level failure. So we retry immediately, right? Let's not wait, immediately retry. If the retry succeeds, we successfully mitigated the failure. Done. If the immediate retry does not succeed, we just need to upgrade our understanding of the failure. Now let's just assume that the failure is an intermittent yet still platform level. That's order repairing. So we just retry a bounded number of times. And typically we do that with exponential back off. Yeah, again, if one of the retries succeeds, done successfully mitigate the failure. If none of the retries succeeds, we once again upgrade our understanding of the failure. We can still assume that the failure is a platform level failure, but not to assume that it is permanent. So the process basically suspends and is awaiting the repair of the underlying failure. We need manual intervention. That does the trick. Once again, we mitigated the failure. If nobody repairs the failure in time and manual intervention doesn't happen, we need to upgrade our understanding of the failure once again. And we now assume it's an application level failure, and we are ready to compensate. We're rolling back whatever already happened. And if the compensation is successful, we successfully mitigated the failure. If the compensation is not successful, we're in the worst place to be. And then we basically have to escalate to a human operator. If we charge the credit card, but we cannot roll back the charge on the credit card, we cannot undo the charge. Somebody has to sit down, write a check and mail the check. Or however we're going to resolve it, but we have to resolve it outside of the system on a level of a human operator. Now as a conclusion, failure and failure handling and guaranteeing failure tolerance and working on failure transparency can be super intimidating topic. But it helps me a lot to take a principled approach to think about failure holistically and then implement the failure handling strategy with confidence. Now if you want to get a head start, then check out Temporal IO. Temporal takes a principled approach to a failure handling and implements the concepts that we have explored today on a platform level, guiding you towards an ideal failure handling strategy for your distributed systems. And with that, thank you very much for joining. Super happy to answer any questions you may have in person or feel free to reach out to me also online. For example, on Twitter at Dominic Torno. And then, yeah, thanks again. Thanks again for being here. Yeah, so Temporal is an open source project. And it is a platform for durable executions. Now I like to contrast also durable executions to volatile executions. Volatile executions are just simple function executions. And think of any simple function. It only provides you weak execution guarantees. The function may crash or the function may time out, right? And that may lead to partial modification. Now Temporal gives you durable executions and durable executions are function executions with strong execution guarantees. The function execution cannot fail and the function execution cannot time out. And yeah, it's an open, it's an open source project. And check it out at temporal.io. Please. Yeah, it cannot. Yes, correct. So you are correct. It cannot. It cannot know. It is an impossibility result in distributed systems that it's either complete or perfect. And what usually what systems the, I mean, in this case, the approach that we take is we suspect a failure even if the upstream component is just slow. And therefore any stragglers, any delay responses will be omitted or I'm sorry will be discarded. But you are entirely correct. You have to deal with the fact that the computation of that request actually happened. So your system must be able to either roll forward or roll backward, whether the computation took effect or didn't take effect, which is actually not, it's not easy to do, right? A, it's not, it's basically impossible to do on a platform level, unless you know the semantics of the operations like a database does, right? I know what a write does. So I can undo a write generically. But for like any service call, I don't know what like, what is the undo operation of a of a credit card charge. We don't know that on a platform level, right? So it requires the cooperation of the of the application programmer. And that's actually quite a feat. Yeah, it's quite a challenge. Yeah. So the there is so on the on the system model, right, that temporal takes into account, there are certain failures that we can handle on a platform level. And these are trans failures and intermittent failures, right? So failures that can be resolved by a retry and by forward recovery, we can handle that completely on a platform level and do not require the operation of the developer, although since we're retrying, we do require idempotence on the on the upward service calls. But as soon as we talk about application level failures, like the insufficient funds exception, right, there is there is no way that we can push through that on a platform level. So at that moment, we escalate to the application level and this, these are the exceptions that you have to take into account in your code. So the compensation is can still be found in the code. Compensate often often called sagas, right? So the compensation can still can still be found in the code. But on a platform level, we can take care of intermittent and transient failures, but transient and intermittent failures. We can also we can help you deal with permanent failures by not abandoning the function execution. Usually a function execution when a function execution like encounters an exception, it just poof goes away, right? And then if in doubt, you don't even know that it happened and it just went away. So the temporal durable executions, they do not go away. They suspend at the failure point. It just sits there and wait, right? So I can go in if I'm still within the timeout of the overall durable execution. Some of our durable executions run hours, days, weeks, month. We have actual users that use them in the course of years. So they service 30 year long loans, one durable execution. That suspends on the failure point and then you come in, you fix it, and then it just resumes as if the failure never happened. It's transparent to your code. It's actually pretty slick. If I was just about to. Thank you. Yeah, please. Thank you again. Thank you very much for being here. And then please come find me. We can talk downstairs in the hangout area. And I also have temporal stickers if anybody wants a sticker. Thank you.