 Next speaker, Yosh Riyadi, what kind of name is that? Indonesian? It's Indonesian. Michael, I think, was Vietnamese, I think. Anyway, it's great to have people from the whole region. And I like Yosh a lot because he's a serverless fan boy, like myself. I think you're even running a serverless user group, right, as of sorts. Yes. Who's into serverless? Woohoo! Okay, great. Nice. Nice one. Were you joking? Is that another people's server? No, I genuinely think so, which is amazing. And it'd be great to, I can't remember who put your hand up, we should talk. But Yosh is definitely leading the charge with server stuff. And the step functions is definitely very interesting. So you've got 15 minutes. All right, thank you. Evening, everyone. So today, I'd like to share with you Stagas with step functions. So we've moved on from monoliths to microservices, right? And in this new world, there's no single source of truth anymore. Or each service has its own data store and to process a single business action involves communicating with multiple discrete services all over your infrastructure. So let's imagine we're building a travel booking platform. We have a travel agent service, which coordinates the process of booking a trip. And behind the scenes, this service might communicate with different services. Like for example, we have a car service that we rent a car from, a hotel service where we book a hotel, and an airline service where we book a flight. And each of those services, each of those service calls, combine together into a single business process known as booking a trip. And so, but it's not so simple. Let's imagine that two of our calls were completed successfully. The car was booked, the hotel was booked successfully, but the airline service didn't complete the flight booking. So how should we handle cases like this? So normally, we would have application level mechanisms that enforce some invariant within all our services. For example, if the flight booking fails, perhaps the car rental service does some logic to unbook the car booking. Or this logic could live in the travel agent service or somewhere else within our application. But this is fine for four services. But how about 500 services? With all these concurrency control mechanisms living across all different services all over the place, it can become seriously unmanageable. So that's where Saigas come in. So Saigas was described in a 1987 paper. It was meant to be a solution for an alternative to long-lived database transactions. And recently, this pattern has been applied to distributed systems. So most of the distributed Saigas, the first half of this talk is mostly from this particular paper. The link is available in the slides, which I will share later. And so to summarize, a Saigas represents a single business process. So within the Saigas, there could be many service calls. Distributed Saigas is a collection of requests. Each request could be a single call to another microservice. So Book Hotel is a request and Book Flight is a request. And each request has a compensating request that's executed on failure of a request. So we have Cancel Hotel, which compensates for Book Hotel. We have Cancel Flight, which compensates for Book Flight. So what a compensating request does is it semantically undoes a request. So it basically tries to roll back and revert to the original state of equilibrium before the request. Now, some things you can't undo. For example, you send an email, there's no way to unsend the email. But what you can do is send another email to say that, please disregard that email. So essentially, compensating request tries to restore your application state to a state of equilibrium, the original state before the request. And so for Distributed Saigas to work, the both requests and compensating requests need a few attributes for it to work. So I'm just gonna quickly go through some of these so that we are on the same page. So requests and compensating requests must be idempotent. So when you apply it once, so idempotent means that if you apply it once and apply it twice or how many times, the result will be the same. So and this is because we may receive the same message more than once. What if for example, our first request to the book to the car service times out? We don't know what happened, we waited for too long. So we have to send another request to the car service and the second request is successful. But what if it turns out that the first request arise after the second request? So we have to handle that case by making our request idempotent. So the second key thing is that both requests and compensating requests must be commutative, meaning regardless of the order at which they arrive, they have to arrive at the same result. So even if the canceled car request comes first and then another book car request arise later, it should still result in a canceled car booking. So this is because messages can arrive in any order. So these two are the requirements of CIGA requests. So if you have those attributes, the distributive CIGA guarantees that all requests in the CIGA are either successfully completed or you have a subset of requests and the compensating requests are executed. So if you look at this diagram here, in this case, we have the first two steps were completed successfully, but the third failed. So in this case, the compensating request for the failed request is executed as well as all the previous requests. So the point of having a distributed CIGA is to ensure consistency and correctness across all your microservices. Because all the state is spread out all over your application, we need a way to sort of keep everything consistent and correct. And that's what a CIGA is basically. It's a failure pattern to handle failures within your microservices. So how do you define a distributed CIGA? You can define it as a state machine basically. So this is an example for our, so on the left is an example for our example. We'll look at it in more detail next. So who executes, who manages these CIGAs? You need something called a CIGA execution coordinator, which is essentially a service, a standalone service that stores and interprets your CIGAs state machines. It's responsible for executing each step in your state machines. So this service is what actually talks to your microservices. Also handles failure recovery by executing any compensating request. Should there be any failure? So the benefit of using distributed CIGAs is instead of having all that concurrency control mechanisms all over your application in your individual microservices, you can isolate all this logic in a single place, which is your CIGA execution coordinator. And creating a new workflow is just creating a new state machine. You don't need to create a whole new service to support a new business process. So, but some of you might be thinking, wow, building this CIGA execution coordinator sounds really difficult and time-consuming. That's where AWS Step Functions come in. So AWS Step Functions, I think of it as basically a CIGA execution coordinator as a service. It's not officially called this, but this is how I think of it. It sports a push model using Lambda and a pull model using applications hosted on EC2 and ECS. It's fully managed as retries are handling and it costs reasonable, I think. So the way Step Functions work is you define state machines in JSON using something called an AWS state language. It's AWS own language for defining state machines. You can then visualize your state machines in the AWS console and you can execute and monitor view logs from your executions from the console as well. We'll see this in a demo soon. So this is an example of a state machine written in the AWS state language. So it's just JSON. So to produce the state machine on the right, we have this JSON where it says we start at a state called Hello World. Then we define a series of states. So Hello World is one of them. So this is this point through a Lambda function and we tell Step Functions that this is also our end state. And this JSON produces this state machine. There are many different types of nodes that you can use to define your state machines. So task is the basic one. So compensating requests and requests like book hotel, cancel hotel is a task. And the other state types are more for flow control. So the choice state type lets you perform if conditionals. If the outcome of the previous task is A, you can call this task, otherwise call this other task. Parallel lets you execute tasks in parallel. You have many other building blocks that you can use to build your state machines. So here's an example of modeling a task as a business process as a state machine. So let's say we're trying to do, given an image, we want to analyze the image and create a thumbnail of it. So we can define that process as a state machine. So in this case, we start, we form a task that extracts the image data. We use a choice node to perform conditionals. Then we have a parallel execution. And this is an example. We'll have a hands-on soon. So other things you can do in your state machines is configure retries. So for example let's say our request to the car service times out. Then you want to retry it. So you can tell step functions to retry certain steps. So in this case, we retry if there's a specific class of error that's written by the previous task. We retry how long after the previous failure do we retry, how many attempts, and if you want to use exponential backoff. We can also catch errors in our state machine. And we can map different errors to different steps. So this is just a screenshot of the console. So let's just look at the other step functions. So you remember this diagram. So basically we're trying to build a state machine, a saga for this. So we already have it. So basically this is what my saga looks like. It's really just using it. Unfortunately yes. I'm waiting for, I think AWS should really create a GUI to drag and drop like CloudCraft. Or you can just drag and drop. But you have to write it by hand for now I hope. So let's try executing this state machine. So I have a basically this is the request that the trip service would be sending to our travel booking platform. So here we're executing. So we start by executing all three requests to the different services in parallel. Then it just completed. So it's successful. So it went to this state, a state. And using this console you can sort of see the inputs and outputs at each step and any exceptions. So this is the happy path. So let's look at the not so happy path. So in this case I've basically just created a flag to make it fail. So in this case, okay already failed. Wow. So we've executed, oh no it's in progress sorry. So we've executed the three requests in parallel. But the flight booking actually failed. And because of that we go through this particular path in our state machine. So the annoying thing about the parallel node is that if one of them failed, all of them failed. All of them failed. You can't tell which one. You can't tell which one in this case. But anyway, if you execute things in parallel and you want all of them to succeed no no it's sorry. Anyway in this case we have to cancel all of them even if one failed. Because the others might be in flight and arrive later. So anyway in this case we go to this path in the state machine. We call our compensating requests and complete the request. And so from this console you can also see a log of all the executions. So it's pretty helpful. And yep it's really cool. So in this case you can get. So what I did is wait. Yeah you can actually tell. It sucks. Alright so we've seen distributed sagas. It's a pattern for handling failure in microservices. We've learned about the role of the Saga Execution Coordinator. We looked at step functions and how we can use it for sagas. We had a brief look at the state's language and the console. That's all I have to share for today. Thank you. How do you ensure? You have to write your requests. So the book flight, you have to write your book requests like the book steps and you're compensating requests to be commutative. It's not step functions responsibility to be commutative. Exactly. Yep, exactly. Basically your requests and your compensating requests have to be this. It has to be Adam Podin. This is just part of the requirements of a distributed Saga. Exactly. So can you compensate things that we just tried to stop or something that's from that way we can actually move? Okay. So I mentioned that the annoying thing with the parallel task is you really can't know which one failed. So I actually tried a few different approaches. There's another version where basically I tried to find which step actually failed. And this works but is not as elegant as the other design. Because anyway if one of the requests fail and we execute the request in parallel we have to cancel all of them. The other approach is to have them execute sequentially. So in this case we know exactly where, which requests fail. For example if we execute at the book hotel and this failed we immediately only cancel the request. And if you go to the book flight we immediately cancel the flight and the hotel. So this is another approach. But when you execute in parallel you have to cancel all of them. You have to compensate all the requests. Is there any guarantee from Amazon that what happens if step functions themselves fail? Yes. So you can actually sample. So there's some built in, so you can catch failures that are built in and your own errors. And one of the built in errors is actual step function fail to execute errors. So you can handle that. But in terms of guarantees I mean I'm not, I don't know what, yeah. I'm pretty sure it's guaranteed. Any more questions? Is it possible to implement a TV or a DVD? Ah you want to, ah CC. That's an interesting question. Yeah, exactly. I don't think the tooling is there yet. I think, I guess one way you can do it is so the way I created my sagas is basically using this console. So you just write my state machines and just work through them using all possible inputs. But in terms of automating testing of your state machines, I think the tooling is not yet there. Really good, thank you. Yeah, yeah. So it's done. So for example in serverless you have to sort of run it would be, to run integration tests in serverless you kind of need the infrastructure live. But at the same time there have been projects that lets you emulate parts of AWS infrastructure locally, like DynamoDB locally. But I don't think there's local step functions you can run at the moment. Which I think would let you test it locally. But right now the tooling is not yet available. Yeah. Alright I think that's really interesting. I think I personally use like promises to do most of that sort of grouping or whatever it's called. But that looks more explicit and more helpful for business people I guess, and their team members really. Other than like looking at because promises are quite difficult to read actually. Wouldn't you agree? I mean would you implement this state machine code before or? I think it's a pain in the ass. It's a pain. So it's good when it's all like explicit I guess. Cool. Thanks again Joss.