 Hello, everyone. Welcome to our session on transactional solutions for micro- services. I'm Aram Ruzhichka. I work for Red Hat. I'm a developer of the Forman project and I program mostly in Ruby, although I've doubled in all sorts of other languages. And if you want to take a look at what I do, there's my GitHub. Go there, check it out. I'm Ondra Chalopka. I also work with Red Hat on a wide-flight project, more specifically on an ARANA transaction manager. And my backlinks are as well mentioned here. All right, so what's on agenda for today? First, we'll take a look at transaction management, what it is, why do we want it. Then we'll take a look at micro-services and how they fit together with transactions. We'll introduce the SAGA pattern as one possible solution. And we'll take a look at two implementations of the SAGA pattern, namely long-running actions for micro-profile and the Jamfro framework. First, what is a transaction? A transaction is an atomic unit of work where all the parts of it either finish or fail. So the transaction is either fully done or not done at all, nothing in between. It has just these two states. Let's take a look at how transactions look like in your everyday applications. So we'll start a monolithic example application where we have just some application receiving a call and doing some updates to different databases. When we have such situation, we expect the application needs to do several business tasks and could be like creating order, link shipments, saving the data to different database to be received by some truth party and sending confirmation there. So we have three actions, but the system could crash. And now these three different steps are broken. So we would like to have some way of defining those applications as single unit of work which are like coupled or are composed as a draw all finishes or none of them. That's where transaction manager or transaction handling helps us. And normally that's some third party or the third party service or other service than the application is, which guarantees the consistency for the data which compones from those three actions. So here is the application, make some changes in database and then at the end it's up to the transaction manager to finish the old task to fulfill the consistency to finish all with success or with final. Let's take a look at the same thing, but in a kind of microservice deployment. Of course microservice deployments come in all sorts of shapes and sizes. So in this example we have one gateway which calls all the other microservices which do the separate steps or we could, for example, don't have the gateway and let the microservices call each other. And what's happened then? There's no problem that there is like a lot of parts that could crash now much more than before. Well, one could say that we traded a single point of failure for many single points of failure. For that people started to implement all kinds of frameworks and experimental ideas, let's say. So to quote Martin Klettmann on this, every sufficiently large deployment of microservices contains an ad hoc, informally specified, back-ridden, slow implementation of half of transactions. So, all right, so maybe we can just use the usual standard asset transactions in microservice work, but that could be a trouble for the microservices as a distributed system. Normal transactions brings locks which causing microservices being coupled together. Give me a short time to explain this. Normalized transaction manager works with X8 transaction which is specification to run two-phase commit over different participants of the transactions. For example, here we have like two applications inserting something to databases, then it's up to the transaction manager to manage the completion. If we go so, then when I take just the one of these applications and the one database and we can see the transaction manager starting the transactions, this is the transaction started at the database. And when application invokes some insertion to database, in the scope of this transaction, the insert is done. This is kind of in progress state which is not visible to other parallel running queries or insertion. And at point when transaction manager started to run two-phase commit, because two-phase commit is like consensus protocol to provide this consistency over multiple participants in the transactions, it runs the prepare call as the first phase to the database. And this database then locks the row of which is composed of that insertion. And at that time other application which are somehow dependent on that record are locked to be processing. For example, here if there is this book, booking some title, then for example if other service would like to cancel this booking, it's dependent on the first insertion to finish. Till time when commit is called, then the look is released and the row is available for other participant to run any data changes. So now we are coupling the different operations based, different operations together, chaining them, which is maybe not what we want in micro-services. So to me this looks awful a lot like a big two out of three options where the three options are use microservices, the other option is use transactions. And the third thing is to not have to implement these transactions ourselves. As we've seen people tend to kind of drop the last point and pick the first two and then implement the transaction handling themselves. So we've seen why transactions are a good thing and why we want them and why they don't really work out of the box for micro-services. So we could always roll back to the monolithic approach but we want to use microservices. We may want to do so because of smaller teams which are focused on one single functionality piece can develop and deliver it much faster. We also may want independence where we can pick the right technology to do the job and not be bound by a single language or framework for the entire application doing lots of different things. We also may have scaling problems and we may want to address those by just scaling the service which has problems up. Also microservices are usually easier to understand since they reduce the scope a lot and deal with just one thing and they also isolate failure in a way. A failure in one single service doesn't necessarily have to bring down the entire application. So we need some kind of distributed domain transaction. Luckily for us it already exists. It's called the Saga pattern and they say a picture is worth a thousand words so let's take a look at the picture. Here we have we could consider the entire thing a transaction. We want a trip. For that we need to book a flight to get to some other country let's say. Then we need a bus to get from the airport to a hotel and then we need the hotel to book a place at the hotel itself. But these providers let's say don't have to have anything in common and implementing a traditional transaction could be difficult. So we need a way to manage failure and that's Saga pattern. Basically we attach to all these things procedure how to undo the changes. So for example if we book a flight and goes well we have a flight. We try to book a bus we get a bus ticket it's fine. We try to book a hotel and it fails. The hotel is full let's say and we need way to you know make the whole thing appear as if nothing ever happened. So we need to cancel the bus ticket and we need to cancel the flight booking and Saga pattern handles that for us. It tricks the state and then tries to undo it undo the changes using these procedures. So to rephrase it the basic idea is we break down the overall transaction into smaller steps. Then these steps can be performed in atomic transactions internally. They may they may not we don't really care. And Saga ensures that either the transaction the whole thing completes fully or not at all. Let me just ask you wake you up for a moment who has heard of the Saga pattern? Okay so it may be a surprise to some of you. It's not a new thing. It was originally published in 1987. Back then it was published to it was intended to solve an issue of long-running transactions in databases. And luckily for us it's quite a good fit for microservices nowadays. There are two main approaches to Saga. One could be called orchestration which kind of provides a good way of controlling the flow which services are called how the rollback is handled. And there need it's called an orchestration because there needs to be an orchestrator a single note which tells all the others which local transactions to execute. The other approach is choreography where there is no central coordinator orchestrator or something. And all these services call each other and basically pass the state among themselves. So now we are going to the implementation part. That was the theory. And now we would like present the two approaches. First one me and second my colleague. This is the first is a long running action which is specification in development under micro profile umbrella. It's mostly Java based and expect to provide the transaction capability and handling the transaction for Java application written with the micro profile. But it's not limited just for that. We have like generic API that could be used in any like Java application and up to it as it runs over HTTP where the context is passed in headers that could be adapted with other languages as well that we wanted to show you later but probably not because we have some technical troubles with our demo. And this implementation, our implementation of the specification is done in under the Narayana transaction manager project where we use the as the saga needs some lifecycle management. We use this orchestration approach for having the LRA coordinator which provides the rest end point where you can manage transaction and let's participants get the ideas of how the whole saga runs and what the state is. On my side the project is called Dimeflow. It's a workflow engine written in Ruby. You can say workflowing, you can say backend processing, whatever. And currently it is in use by the Forman project and it can do all sorts of stuff which are kind of out of scope of this talk sadly. Just to give you a sneak peek, it can run independent steps concurrently, it can pull external tasks and much more come to us after the talk and we can show you more. And support for sagas was introduced there in the form of rescue strategy for execution plans and it will start to make sense after a couple of slides. So let's describe again a bit like the difference between these two approaches to these implementations. I start again with our long-running action. As I said there is the LRA coordinator which where the services enlist theirself and they are providing as well that's responsibility of the service to provide some end point where the coordinated and called back to the service. We usually expect there is some request call where the service starts the LRA, the saga and pass it this by call to LRA coordinator and enlist itself by providing the information where the end point stands. And then the LRA coordinator provides the ID which this service can take and pass to other services in the chain where they can use it and as well if they will enlist themselves to the coordinator under the same and does the same saga. Then at the end the responsibility of LRA coordinators to call back to the services and for them about the outcome in this case that information is that confirmation means that processing of the actions was successful or were successful and so the confirm is called the call back to those services. On the DEMFL side of things as I mentioned DEMFL originally started as a background processing engine and because of that it kind of shaped the way how the sagas are implemented there so you can see the the diagrams are kind of different. First the thing that is similar is that request comes from a client but it doesn't reach the service first it reaches the executor first and tells the executor do do a thing so the executor then knows how to perform the entire booking it knows it has to book the flight and it has to book the hotel so it makes a call to the hotel booking service and waits for it to finish and then it makes a call to the flight booking service and waits for the operation to finish and when everything succeeds it's done. There is no complete call back or something because in our use case usually the services we call don't really know their parts of something bigger and we can't well we could but we don't go and change their code to handle their API endpoints and how they behave so it would better fit our use case. We just use all these other services the way they are and for that reason we need to do a bit of a paradigm shift let's say in our case the executor has to know upfront the entire thing it has to do it just has to know so it can track it and possibly undo it if something goes wrong compared to the LRA use case where all the parts are more involved that's it. Now this was the happy path how things go with everything goes smooth and all the reservations can be done. Next we have a sequence diagram of the of the failure path let's say. So I will take the explanation of the of the LRA part so that's again the same example with the two services one calls the other but now the failure happens so the user calls the service the service joins the the saga by calling the LRA coordinator and meaning that providing the endpoint the information where the coordinator can call back process some business business operation succeeds and passes the call to the next service then the next service again joins the saga here and but the business business processing fails so there is information from the second service about to to coordinator that the saga should be cancelled or should be compensated and then it's up to the coordinator to call compensate to all the services that were that were registered to the coordinator so that's the that's the sequence. On the workflow side of things again the user calls the executor the executor starts processing an execution plan as it's called in the workflow and it starts a first external task it sends a request to the service one and in this case in this diagram the triggering is kind of synchronous so it waits until the request you know until the request finishes and we assume that by the time the request finishes everything is already done on that services side but you could just take that start external task and succeeds block and replace it with trigger and external task and pull for status for example so it can operate in several modes let's say depending what fits your use case for us what usually happens is that we call another service which has its own task engine and we just kick start another task there and when then we pull for it and wait until it finishes it would be unwise to wait with open tcp connection until it finishes so we just pull so back to this example we start one external task and it succeeds and then we call another service and for some reason it fails so at that time we stop processing of that execution plan and create an execution plan which should undo the changes done by the first execution plan that's below the rigor plan and what we do is here we expect that the services have a compensate endpoint where we can tell them to undo the changes they did so the executor calls first the second service and then it calls the first service so it undoes the changes in the reverse order then it made them in the first place now I I come I'm coming back to to long running actions and say a little more bit about how it's used from the like developer perspective again just to recap there is the the service needs to provide the endpoints for the coordinator to knows where to call back and the responsibility of coordinator is just to call those endpoints in like guarantee the consistency in way of even some network failures or the crash of the coordinator the service knows that those those endpoints will be will be called at some point in the end the saga saga pattern is eventual consistent meaning there is not set some precise timing but just you know that the the callback will be will be invoked the the LRA has said it's defined as we are trying to to get it being defined as micro profile specification and here we here we go with two possible ways how the developer can can manage is the LRA which is programing the API and annotation annotations annotations comes to be like similar to what the JTA JTA annotations in enterprise Java application provides just like to to change the the words to to be like LRA specific so if you are like familiar for example with and Java enterprise then LRA could be taught as being like those annotations defining if the transaction should be is required or requires new so LRA defined defined at the at the method which is rest endpoint start the saga and list the participants and and the operation inside of the method is process as in the in the scope of that saga then the then the developer uses complete and compensate annotations which again needs to be defined and at some at some rest endpoint but which informs the the the LRA client library that those are endpoints which will be called back which but I didn't mention this is like the API for the application developers you need to pack the client library with your application which provides this API I will just leave those other annotation for your bio interest you can check the the the specification or you can talk then then after the session about the details just mentioning here the programmatic API the same thing could be done just by running the the calls on the the library where you can start and join the saga get this the information about the status and about the just the but the LRA coordinator knows about the currently running LRAs uh as I uh sorry uh I wanted to just to quickly show here the code example oh is this large enough does everyone see okay cool yeah thanks just uh as we don't have a demo running here so that will be just the showcase of the code I hear the like the simple service like it's called booking flight booking and I the definition here is like to having this LRA annotation meaning this method is running under scope of one LRA uh I'm I can say in which cases the LRA should be defined as the card canceled to be compensated based on the HTTP status code and then I like do some business logic getting some matching flight and as the as the next step I am passing the the call I'm sorry here I'm passing the call to the next service okay no and this this next service gets the the Saga ID as a part of HTTP header which is automatically provided by a LRA client library so we just call the next service and this HTTP header is like packed into the into the header and the next service could use it then the application developer defines like some complete and compensate method as it was said and here the logic is uh done in the way that during the LRA processing the uh the business logic insert uh information about the booking is to be processed so it set uh state like in progress and during the complete or compensation that uh the that state that state into database in change like for complete or cancelled the question was what happens when complete or compensate fails that's uh like you are you are you are sure that you will be called back when that uh that complete and compensate uh relay is uh but you you are ensure that you will your endpoint will be called then uh now depends on you what what you want to do if you if there is some fail and you you will uh expose that fail back to the coordinator with some error code then you are uh you are ensure that you will be called back uh up to time you will uh you will um confirm that the processing of the of the complete and compensate method was successful on on the other hand if there is some uh fail that you do not provide back to the coordinator it's just up to you how you will handle it it's kind of you will it's hard to do how you will be do still like doing confirmation there and back how when you will end this kind of the if i go to the kind of the this consensus uh academic studies i felt the problem that you never know where to end so it's you are ensure that you will be called back when you inform the coordinator there was a failure if there is for example the error error at the network again the coordinator knows there was some crash so something doesn't happen but coordinator does not know in if you were somewhere in the middle of processing and that's up to how you will handle it normally that would be said that there is some um acid strand could be done that there is some acid transaction that is run in in scope of that particle might cross service and when the failure happens that this acid transaction process this consistency based on that particle part of code if there is some crash then you hope that the acid transaction manager rollbacks the changes and you will know you will be called once again by uh LRA uh saga uh LRA coordinator at the at your endpoint to finish the completion or compensation once again does it respond and if there is like bark you have to fix it before it can go through so it will just call back and yeah there is like there the currently not sure if i would now done in our implementation if there is some like or maybe there's some proposal for specific specification if there could be some limit how how many times you will be called back but in general yes that's uh the similar thing how the standard transaction manager works he just tries to finish with consistency and the transaction manager never knows what's happening and the other side that's same in case of the database if there is i can say that if there is some trouble in database providing the wrong error codes or something then transaction manager is like doesn't know what's happening and trying to calling the rollback or commit in in in cycle two times something happens that's point is this to to be locked to be in to be processed to some information so again you need to to have some tracing uh on on top of of that processing this is something that that's like business logic and uh that's uh it's hard to to manage by by the by this pattern that uh i think that was all from from here on this now a bit about damflow building blocks i mentioned actions and execution plans and rescue strategies and whatnot so maybe i should circle back and actually explain what those are so the core concept in damflow is an action and it's a logical unit it's a thing you want to be done they have three phases plan run and finalize where and actions can be composed usually that's the purpose of the plan phase where you can as a part of an action you can plan another action and this way you can split the thing you want to do into several smaller pieces which are more manageable and then you can have a single top level action which just composes all these together and basically puts there the glue that holds the entire thing then we have execution plans which are generated by planning actions so the mental model is you have an action you plan it and when you start it creates an execution plan and all the others all the other actions which are planned from inside that action and its descendant actions and their descendant actions and so on belong to that same execution plan so for us the execution plan is the scope of the transaction we want to do execution plans can have rescue strategies to be correct actions can have rescue strategies and these are then combined if something goes wrong and we use that to determine what to do with an execution plan which failed for example we had pause where we would just stop the execution plan at that time when a failure happens and let the user investigate and for example try to do it again or skip the step that failed and just go over it and move on or just fail mark it as done nope not going anymore and we edit the rollback or revert rescue strategy which implements the saga pattern and the last missing piece are steps which are the units of work they are the smallest item than folk and process and to put all this into relation we have an execution plan which has one entry action it's the top level one this action has three phases it has to have one step in the plan phase and it can optionally have a single step in the run phase and a single step in the finalize phase it's kind of complex to explain this without a new real world example so we'll just silently move on and hope it starts making sense when we see some examples this is an example action first we have a book hotel action which inherits from the dental action class it includes some kind of module which is omitted which does our http calls and parses the response and handles all that and then we define the run method by defining a run method inside an action means we want this action to have a step in the run phase we are omitting the plant step so it is inherited from the action class and the default behavior is just basically plant self which means that this action will be considered for processing in run and finalize phases and then to demonstrate the composability we have the book trip action which plans find five times the book hotel action so for example we want to book a hotel and we want to make a reservation for five people but they provide an api endpoint to only make a reservation for one person so we just make it five times and be done as i mentioned sagas in dineflow are implemented using rescue strategies for execution plan because for an execution plan we know how all of its steps finished and if we know how to undo every single of those steps we can undo the entire execution plan because we know all the information we need to do it so to put it another way you trigger an action that generates an execution plan execution plan gets executed and if it fails somewhere along the way we determine a correct rescue strategy if that is reversed we create another execution plan which should undo the changes done by the first execution plan and then we execute the second one here we added some changes to make the previous example reversible basically the key points are we include the dineflow action revertable module and we implement the revert run method and that's all that's all you have to do to go from the single action we create a booking to we create a booking and if something else fails somewhere down along the way we can revert it we had a demo prepared but somehow we missed an information about available connectors and we can't really present it to you now so if you want we'd be very glad to show it to you if you just stopped here after we finish i guess and the demo would show this thing still the same same idea of communicating to services or the executor processing to services and if failure happens it's code it is capable to like call the compensator or revert the work right so uh that's to somehow finish the talk small summary at least we believe the sagas are a great solution if you need transactions in microservice deployments but as everything they are not a silver bullet so you're doing a trade off if you're willing to lose in your requirements and kind of sacrifice strict atomicity for eventual consistency it's the way to go do you have any questions anyone yes please uh there's three of you so first in the blue please so the question is how it's handled in downflow if uh if an execution plan which is undoing changes made by the previous execution plan how we handle if the second execution plan fails um it's basically up to the developer who writes uh the code for the actions with it being another execution plan we can apply the same set of strategies and we can either pause it and wait for someone to handle it somehow or we can fail that and well we failed we couldn't undo it and we're done or we can have uh a revert for that and you can go down that rabbit hole and have reverts for reverts for reverts if you want you can okay so there is a common issue that uh okay so the question was uh it can happen that an execution plan plan gets stuck in a post state and whether we have any plans to introduce some kind of timeouts to you know provide a window for a user to fix it is that correct and no the I was thinking about how to fail because if you need something in a post state probably on time it also creates a locks on some of the tasks which can get executed because of that uh not yet but it would be a great addition so feel free to add an issue to our issue tracker the question is what's the underlying storage for both of these solutions so for workflow we use a traditional relation with database and we have support for SQLite, MySQL and post address and in case of long running actions it is based on the narena which is firefighter and section manager so it provides the storage issue which is implemented there and that's currently the file based system which is just the like some simple one or the uh uh running uh in with uh with the storaging as the active mq as the locks are stored so that's kind of the in better performance but still it's stored at the file system and then uh we provide us all the the SQL database storage so we can basically support all all databases where the the data could be could be put to provide uptime with this coordinator uh like if it's down so it's full system bound or yeah okay so uh the question is how we manage the the failure of uh coordinators what will be done in case of long running actions that's the issue we know about it we are working on the HA solution that could process this in in way that there will be multiple uh instances running and uh will be possible to handle the the uh the request uh like in HA so currently it's not not available but we hope it will be soon with the workflow we can run multiple executors which use the same database so they see all the same data and you can get some kind of fault tolerance let's say by doing that as always if all the replicas die you're done so the question is if there are the when the the LRA uses the rest end points if you can uh take your python application and use it yeah that's the idea uh that the the communication or HTTP uh rest end points should be are defined or should be used could be used but this is like not directly uh baked into the specification but it's kind of dependent on the uh implementation so for our case of narana LRA coordinator based on the wi-fi transaction manager yeah we provide the rest end points and defining the HTTP base how you can communicate from whatever application we have now the guys from Node.js which are thinking about to create the library which would be capable to uh run like as a part of the of our saga from Node.js applications yeah the point would the thing what you need is this LRA coordinator this is the service which provides the saga for the your microservice board so yeah you can just take this LRA coordinator put it into the microservice to the service we are running with tontail which is implementation of micro profile stuff from from redhead so you can just run one service get it running somewhere and that service provides the endpoint that you can use from your other services and getting this done just the last thing for me i'm sorry if you would be interested in the project uh checked our uh specification you can join our communicator or uh other like the the there are the hangouts where we discuss the issues and uh so you can you can join us we hope that we will be with uh 0.0 version somewhere in the middle of this year and i think we had one more question