 So I'll talk about some example and the example will be based on order processing Which is kind of close to my heart because I spent eight and a half years at Amazon Starting from ordering and then working on different projects there And I'll introduce notion of a durable execution which solves problems of his current approaches to solve this and then I explain how it can be implemented using I think IO and Then talk about temporal which is a specific open source implementation of that idea And we can talk a little bit more about other possible use cases for this technology so imagine you are Kind of building next Amazon comm whatever and you need to do ordering obviously it's much more complicated But if you look at this code, it's pretty straightforward, right? You just want to run sequence of steps so in a very simplified form it would be check fraud prepare shipment charge Like ship the order and then for example send confirmation email So how would you implement that in real production? Like can you write code like this? Problem is that you cannot because this process can crash any time You can have failures on this Are the PIC you're calling or the services you're calling also you can have long-running operations because for example prepare shipment can take a day Or two if you out of stock So how do we solve that they standard approach which probably all of you use or no and just learn that school is that use some sort of Event driven architecture, right? So you will go and ship events like service will publish event other services listen to events reply to them And I kind of you kind of stitch these things together in real life You usually use cues because cues kind of nice because they persist state and they can help you if your process crashes So we'll re-deliver the message But problem is that I think they still approach is not good and again I'm talking to somebody who owned I kind of was ticklit for Amazon pops up for like over I don't know almost 10 years like not enough 80 or half in Amazon so it was five years and so practically a lot of companies kind of and groups at Amazon came to our team and to me and say okay, let's do design review of our backend architecture and We kind of quickly learn that cues is very bad way to do this large-scale microservice registrations just because They don't help with a lot of problems like the air handling is bad because practically all you have is deal cues Then long-running operations are not supported because you cannot have cue like task from the queue take one day to execute and Also, your logic is just spread out across multiple services because Like imagine you need to do saga using our history a choreography if you practically something fails Then so it's like in this movie like everything depends on everything right like if you have Choreography like everything service depends on any other service because all they do is just listen to messages The other approach is which is kind of pretty well known as orchestration so there is a central place which kind of executes this transaction says okay, this is do this first this second this is third and so on and so on and Orchestration is has a lot of benefits over a choreography Mostly around visibility of the process to find it in one place and very clear service API's because in this case If you have fraud fraud all it needs to define is an API for fraud It doesn't need to do any doesn't need to know any about any other service, right the same thing for fulfillment and payment When in choreography you kind of need to know about other services which are on a new ecosystem So the only service which needs to know about others is service which uses those API's for example in this case is an order The problem is like if Orchestration is actually gives us so many benefits. Why we don't use it I know some like various solutions out there, but if you go to average developer and say implement I don't know whatever like deployment or infrastructure automation use case or whatever they won't even consider orchestration We'll start building this kind of choreography type solutions And there are various answers one is scalability and just developer experience But in general, I think it's programming model because if you these days you say oh, I want to use an orchestrator you're practically immediately forced into this new model of Practically drawing diagrams or writing XML or JSON files Because we cannot write normal code anymore like like I had here in my example right process order like just five lines of code You probably need to draw diagram you need to do Oh, you write bunch of XML or JSON because you practically writing like I don't know your normal procedural code in Config files right just so because think about it still procedural right but you just write in config files Or you do dex and whatever but this is a very very limiting because it's very hard to do complex complex kind of a flaw in business logic So what would be the alternative? Let's look at this original example Imagine you could just write code like normal like I like here in this example and It will just deal with all these Problems like process crashes intermittent failures or long-running operations out of the box Right that would be kind of the ideal solution So that is what I call durable execution. So it's kind of new concept I'm pretty sure you've never heard about it because it's kind of new paradigm The idea is very simple and again, let's talk about abstract like not don't think about how we would implement that think about an abstract Imagine I have code which is guaranteed to complete in presence of any failure So it means that if process crashes after prepare shipment process will be just migrated this function will be migrated to another kind of physical Process and keep running right and all variables will be preserved So you don't need to think about crashes or if some API fails It will be retried a thematically and as is this function is durable You don't need to think about its crashing so you can retry forever if necessary right in the same thing if this function is guaranteed to keep running an Operation can take a hundred milliseconds or it can take ten hours or it can take five days API is the same right you just call function like prepare shipment and prepare shipment can take three days It's still a blocking call Why can you do it because again this function cannot crash because it's not linked to a specific process So it's an idea, but how would you want to like if I want to implement it in real life? How would I do it? Obviously you can imagine this Magical I don't know Ram in the future which will kind of survive process crashes and so on But I still not going to help you in the case of distributed system imagine this machine completely burns right or like Easy goes down. So how would you do that? So the idea which? We used is actually very simple can we just Remember what this code did and replay it from the beginning and not to execute those functions Let me kind of give you some Idea how it would work imagine we have this function process order and we start running that So we call check fraud and after check fraud returns. We will have separate kind of persisted history of results In some storage somewhere and let's say we will go and record results of that function in that log So then we will call the next function and we will recall the result of that function in in that log We go to the third one and we record the result with function in the log again and then imagine our process crashes So we will detect it is crashed. We will go to the different machine and we just start running that function from the beginning But in this case we call check fraud But we're not going to actually make an API call because we look at the history and say oh check fraud is there It's already executed so we can go just give this result back to the function directly without practice keeping the actual API call So we go to the next one. We get the result of the previous one the same thing we get it from the log So we keep going through the log until we kind of hit the end of this right and at this point There is no more log and what it means that when we call ship We will actually make a really API call to ship ship service very very simple just record results in the log and When you replay this function from the beginning use Results from the log until you run out of log and then it practically means your function is recovered and now you can make real forward progress Obviously when you start implementing these things where a lot of details you need to take care of First one is how do you record the result of the function? Obviously, you can go pretty low level and use some complex the kind of no Web assembly or like heck like Python interpreter or like go pretty deep there practically just because you need to practically intercept all I or and But for example, if you want to be practical you can do very simple idea You can just go and have a function. Let's say every time I invoke this external API. I will have specific function Let's call it execute and this function will run that API call Maybe try it as long as necessary and then it can just capture the result and record into the log So this will it's yes is additional Kind of overhead of having this function from kind of mental point of view But the nice thing is it's very explicit because you know, oh if I call execute this function is captured If you just call some function which is without execute I mean I will run this function in line, right? So it will not have this kind of capture capabilities The the other problem which you will run into if you replaying this code should be deterministic What does it mean? So deterministic code is code which Produces exactly the same result takes exactly the same code path given the same set of external inputs, right? So in this case if I call Some function it should return always the same result because then I can replay it as many times as you want If you put random in your code, you know that you will take one branch or another branch like Depending on the execution So what it means is that if you want to write code which will practically implement durable execution It cannot have for example, just use random In any the code it should practically It's not allowed, right because you practically will take different different code path You will have the same problem with time because if you run this have some condition based on time now You run this code now Then you run this code like try to recover this code like 30 minutes later time passed, right? You will get different condition to take different branch So you need to make sure that if you want to make your code deterministic besides random You need to make provide deterministic time. What it would be the deterministic time the time which will Return now when you run this code first time, but then it will return the same time every time you replay it What it means that you technically need to record in the log the time as well, right? Not just the result of the function, but the moment to you this line of code was executed the other problem is concurrency Because we know that concurrent code is practically Non-deterministic because frets kind of switch contact switches happen anytime and it's not controlled by your process is usually controlled by Environment like by the operating system. So if you want to make sure that you can have concurrent code You need to do something about that you can write some sort of concurrent Like Fred Fred implementation which actually in Java we had to do something like that because in Java We actually doing blocking code, but nice thing about Python that Python has I think I or So I think I always very nice because it gives you full control over execution of tasks because Python allows you to implement your own event Loop and when you implement your own event loop you have full control in the order of how these tasks are executed and If you want to have deterministic event loop, what would you do? You would implement practically single key of tasks and you run all the tasks in the same thread, right? Like you just make sure that single threaded and if you're like you do use random in your event loop Implementation or just be very careful about that that it means that you will be able to run these Pretty complex as complex as you want right program There's a many parallel tasks like any I think are you I think in a wait But it will you will be able to guarantee that it runs exactly the same order every task and then you practically can get full Determinism, which means you would be able to recover that using this very simple replay approach One thing about event loops. They are not only about executing tasks They're also about a yo right because event loop always kind of runs the tasks Then there are no tasks to run it will block until there is external event coming in right? if you wanted this thing kind of as we capture events For execute we can kind of convert these execute functions to commands and send them into external service for execution For example, and then with these commands complete or there is some additional event for example Time refiring you would be able to send these events into the your implementation of the event loop and we means that You will it will undo a practically will apply those events and then run new tasks because when you apply events new Tasks will be eligible for execution for I think I or So you kind of have to look to level of loops right one loop runs tasks until trans out of tasks until all of them are blocked And some external kind of invocation at this point you kind of generated a bunch of commands You send these commands out and then there are new events. You will apply those events to the kind of Our blocked program and then we will run event event loop again And so this is kind of double execution can go forever until your program completes So putting all together So if you want to recover I think Python code You need to remember function results You need to separate tasks which execute IO because IO is not deterministic by definition because it can fail and then not fail Like and or like IO is not deterministic So you want to move them in separate functions and capture their results Then you need to make sure that you have deterministic random and you have deterministic time and Then you need a deterministic I so I think are your event loop and you just prohibit running any code which is Just uses threads directly But is it enough if you want a practical system, it's not enough for example You need to identify that Like if you have multiple of those functions running at the same time you need IDs Right for example if it's order you need to order ID and you have millions of them in parallel So you need to system which will actually Provide mapping of those IDs to to these functions like order implications, then you also Detect failures, right because as I said if your process crashed like by itself nothing will happen So you need some system which will detect that Find other process to run this and kind of recover that state and continue executing and then you as we capture those Are your tasks as commands? We need to be able to run them somewhere and retry them if necessary kind of on the system level and there are a lot of other things You need to do For them durable timers if you want to say I want to in my code say a wait sleep for 30 days You don't want to keep that inside of process right? You want to be able to move it out and then there is a durable timer in your system Which will fire 30 days later find appropriate process recover that and continue running after 30 days So how here we come to temporal what is temporal? Temporal is an open-source project. We started a tuber seven years ago It's been in production at first first three years of tuber event from zero to hundred use cases I think right now it's over a thousand of like in seven years later And it's used by a lot of companies. So early adopters were hush a corp coin base air bnb door dash and so on and And it's what we call durable execution system because it limits this idea of durable execution But it takes care of all this other stuff. I talked right recovering so on and we besides Pipefone we have is the case in a bunch of other libraries so you can mix and match them It's it scales on linearly to pro clean your database and any persistent store We tested up to 300,000 actions per second, but we can go higher. We just didn't want to pay this big AWS bill And and it's used by a lot of companies. I just was lazy didn't put the slide with bunch of logos here So how does it look from a physical deployment? so temporal is a service and it has back end and This back end runs on top of a database and it exposes your PC interface This service doesn't run your code. It doesn't run the code The code is just part of these decay library, which you take your Python You just include this temporal library and you start writing code like normally and then when you run this process This will connect to the back end for JPC interface to get tasks and execute them So from topology point of view this code runs outside and they separate activities and workflows so like this durable execution code is a workflow and activities just code which does I or and They can run in different processes if necessary or you can collect it in one process Unfortunately, I don't have demo come to our booth. I will can give you like a real demo of that So this is how we will rewrite this order of processing workflow if you were kind of using temporal first you would just Do put a decorator on the activity All you need to do is decorate. It's just not function But it can be also method on instruct right and also you will just write workflows as a class and In this case you need to decorate it as well And then you need to have annotate the main workflow of function is run and then for capture We call execute activity This is kind of like this capture function and then you also have parameters because you need to specify timeouts because technically RPC call right you need to make sure that activity is retried after certain timeout But otherwise it's just normal code. You can use any constructs of which you use In Python so it can be conditions loops like everything and then as we said, I think are you as well So I Didn't it's I'm pretty sure it's just the whole quote. I showed you these five lines How it would look like if you write as a temporal workflow So how do we deal with these problems we described before Determinism so you have to use practically workflow dot random You cannot use like random or whatever. So we have actually a special what we call like a component Which will actually detect this most of those and Give you failures if you try to use random directly But like you have to use What's provided the same for time use workflow dot now instead of time daytime now or use a sink IEA now Which also kind of works as well So the same for concurrency nice thing about concurrency is we using our custom event loop You don't need to do any changes. You just use normal I think IEA To do any concurrent thing. So there is nothing special not nothing workflow specific there you just if functions annotated decorated with Like workflow it will be already executed on the temporal I think I your loop which means that you just use I think I your normal way away You can use a weight you can do tasks and so on and so on so Why do you want to use it? I just gave you ordering example But think about it gives you ability to write code which will survive any failure without additional code, right? So you just it just keeps running. So some use case infrastructure provisioning data doc all the data of like internal Kind of orchestration of their provisioning is done through this hush a corp cloud is built Using this approach because they run terraform as an activity But you at the end you still need to orchestrate a bunch of API calls in the cloud to provision resources and so on Obviously it replaces business process automation so if you use BPM and you can absolutely switch to normal Python code is set of like these diagrams and stuff and Customer lifecycle because both laws can run forever So you can technically because it's a function which keeps state So you can listen to events and like listen for example have a loop which will I don't know charge customer once a month for example Payments a lot of banks use it for payment processing and so on Other processing already the customer support and so on IOT you can have digital twin workflow for every device And other one is low code. No code Thing if your bill because it's called you can write interpreter for any DSL So if you have specific DSL have use case I say no I have customers I want to give them special kind of workflow Representation you can create your own DSL and then use this approach to interpret the DSL So you get scalability and all the benefits of temporal But you still provide the high-level like low code no code picture to your customers. This is very common So just to recap I I'm saying use orchestration don't use choreography for the module if your system is really workflow Obviously if you just fire forget it's fine right if you fire an event and listen some way if you don't need to reply It's fine, but every time you have steps Choreography is not right way durable execution is the best way to do orchestration Temporal is the way it's open source project. You can use it's a mighty license to implement do this and I think our Python is the case one of the best because I think I own So you can go to the portal that you're for my info or find to up our booth and we will give you demo we can discuss it more That's it questions Hi, thank you for your talk just quickly the environment management for the workers Is it the same environment as you set up for the for the main project? Or can you specify like a Python environment specific to like a workflow or an activity? So it's just normal Python code which you link these the cages a library So you control the workers temporal process project doesn't care about how workers run right? so you can use any way to deploy them and Obviously it's because it's Python there. I think limitation version Python like you I think I don't think doesn't work in the very old ones Right, but otherwise you fully control the Python Kind of interpreter. Okay. Thank you. Hey, thanks. It's a good talk. Thanks I have two questions if if I may first is you have like multiple different technologies that you support like Python Java Do they also like work together? Like can I have like different workers in different languages? Yes, and how do I? Define tasks written in a different language because you use a decorator for Python, but for Java. I think Calling a Java task from Python is different. I guess we have actually sample vision has five languages in it Activities can be in different language than workflow and what we have also child workflows So child workflows can be you can do like different ones Basic idea is that only server doesn't know anything about the implementation at the end when you walk activity All server sees as a string and then you take it can work by string directly So if you invoke in different language, you can just use string name of the activity or child workflow Or you can actually create interface which matches, right? For example, you can have Java interface and Python interface if their names match You also can specify in the decorator you can specify the name explicitly and then it just walks because when you invoke it All it does it just takes name out of the function serializes arguments and then It sends them out. All right. Thanks Then my second question what happens during During version upgrades or like during deployment of during deployments Where a task may or like maybe workflow now gets? a Task or activities like inserted or re-arranged or something like that. How do you keep track of that? So that is a very good question how you version long running processes, right? So we have two great answers one answer is you version It's entirely for example when you start it it will start on the current version And then you will have workers for per version keep running them right until they drain So we just released version to make that pretty awesome experience around that if you need to patch workflow Which already run in we also provide that it's very actually very silly way you do that You just say if old version you keep old code else Like you put new code and you keep old code until you drain workflows Which we and when you run it like first time it will always take new code But if it's doing using replay it remembers it replayed this past version will use the last old branch Right as long as there's some old code still available Yeah, so you can because we're running 15 like versions of workers right for workflow which runs for a few months is not very practical Exactly. Yeah, okay Thanks. Hi. Yeah, just quickly if it's database backed. How do you manage performance? I can imagine if you're for every, you know function that you have to run You have to make multiple calls to a database. How do you stop that becoming slower than just running it synchronously? So over think about this way If you care about durability, you will have to debate talk to the database anyway, right? so If you compare into this like everything in memory, then yes, it will be slower If you have to implement it yourself, it's comparable performance and we do a lot of optimizations there One thing is that server itself scales linearly with database. We always consider it any database we consider it hundred not Cassandra clusters and And we actually we we offer cloud service when we run this backend cluster for you And we have our custom persistence as I said, we were able to run 300,000 actions per second And again, we could run high just bigger AWS bill Thank you for your talk My question is say we fail in the middle of a workflow is the whole workflow wrapped in a transaction because I can imagine if we add a new customer in one service and It fails it another that we want to like roll back and not have any customer. Yeah So again, there are two types of failures. There are Infrastructure failures or out like process crashes deployments all of these so workflow doesn't mean notice that it keeps running, right? But business level failure for example, I don't know you do money transfer withdraw money deposit money account doesn't exist for deposit Right, this is business level failure So in Python case, you probably want to throw an exception have try cage and then run compensations because obviously People called saga and there are better ways to do saga as well more complex way to do that But if you need to do sagas this is a lot of people use that just for sagas Like I know quite a few very big banks for payment system like this They just use the system as core system for sagas. Okay. Thank you Okay And it's gonna have to be very very quick I'll ask a quick question. I'm not sure whether I'll take quick to it. That's sorry. Yeah, anyway, so and This talk you give an example of regarding Python like a sink a sink IO loop and so on But that's a very spice on a specific solution. So my question is more general like how much of the code is kind of a Like the same for different not the same the logic of the same for the different languages because obviously Java works differently and go Asynchronous, I don't know a lot of gold. But anyway, I imagine it's quite different than Python. So how much Work was went into actually Doing all of those things In different languages to make it basically Absolutely correct. We are trying to be as close to the language as possible. So we want to like language native experience Right, that's why we are trying to do every language differently in go You would actually have blocking call in Java. Also, you have a blocking call. Not I think I know there's no thing Don't net will be a sink IO right like I wait a sink. So yes and type script will be a way to sink Yes, it's a lot of work. We actually have underneath. We have rust based library which Hides 90% of state machine because there is pretty complex state machine behind the scenes there And then we have like relatively thin layer which is language specific which does this type of integrations in Java We practically do for us, but we kind of say use our API to me practically do Like I think I know but on top of frets. It's insane. Talk to me. I can explain that Perfect. Well, thank you so much, please