 Welcome. This is the afternoon track of App Developer Con. My name is Carlos Santana. I'm a senior specialist essay with AWS. And we have here my co-speaker. What are you doing? Metal Molic. And we're here to talk about mainly orchestration versus calligraphy patterns on event-driven applications. And this is in Kubernetes. We'll talk a little bit about resiliency and scalability. So I'll give it to you, me, so to get started. Go ahead. Hello, everyone. So let me start with the whole context around why you want to build event-driven architectures. What is the whole reason why we are here? And what problem does it solve? So some of the benefits of building event-driven architectures. Traditionally, we have build apps which use APIs, microservices. One of the service, the front-end service, calling a middle tier. It, in turn, calls another microservice using an API. And it fulfills a certain business functionality. But what it leads to is a lot of interconnectivity, a lot of dependencies, and a lot of challenges with respect to maintenance and building, making more agile systems. So the goal of event-driven architecture is to make it more fault tolerant so that if one of the systems or microservices were to go down, you're still able to operate. It gives you that scalability since you're now trying to build microservices which are connected with events. You're able to scale them independently without having to change everything that's involved in it. You're able to have small systems and which are using asynchronous communication to connect each other. It is extensible because you're now able to bring in new functionality as you need because you'll be using what we call as an event broker. And that event broker is the channel through which all these microservices are able to communicate using what is called as a public subscribe type of a pattern. And it makes it more extensible because as your businesses evolve, you need to bring in new functionality. You simply connect them to that event broker without impacting any of the other pieces in your architecture. It gives you that agility because you're not dependent. So your teams can think more quickly. They can bring in new functionality and they can build faster. So as I briefly mentioned, in an event-driven architecture systems and development teams communicate via events. So unlike the conventional API-driven way where you're making an arrest API call, for example, here you basically use a middleware or an event broker to act as that intermediary through which the systems or microservices are able to communicate. As you can see here in the picture, you have the producer or the application which generates that event. An event is nothing but a state change. It basically indicates, for example, an order has been placed, order has been canceled, a transaction has been placed, and so on. So anything that's happened in the past that you cannot change. You can have another compensating event, but as such you cannot go and change it. So that event is then transmitted onto an event broker and based on rules, you have subscribers which basically consume that event and communicate with each other. As you can see here, the benefit here is that these consumers are able to connect with that event broker independently and there is no direct dependency with the producer. The only thing that it needs to be aware of is the schema for that event. So going further around choreography and orchestration, the whole idea of this talk was to take you through some of these patterns. So what does choreography and orchestration solve? So as most of you would know that in order to build something which is customer facing, you need to have more than one microservice talking to each other. And this is where you need to have something like an orchestration or a choreography coming into play. As you can see here in this picture, we have an order that got created but as part of that order creation, the backend needs to do multiple things. It may need to do a pre-authorization on a credit card just to make sure that the customer has that amount through his credit card or his credit card and then we get a pre-authorization. Once that pre-authorization is done, then you actually make a call to the payment gateway to apply the order, that kind of an amount. And finally, you go and apply the same transaction onto an accounting system for that particular business who's the merchant for this. So all these typically would be in different microservices. Now, given that we're talking about event-driven architecture and you want to make systems as loosely coupled as possible, how do you achieve this? You obviously don't want to have like a two phase commit or you don't want to have a transaction which is spanning across all these microservices. You may have a situation where all these microservices may have a purpose-built database, right? One of them may be using a relational database. The other one may be using, for example, a NoSQL database and so on. So how do you achieve such a business functionality? You may also need to have ordering maintenance that needs to be handled in a sequence. And these business requirements may evolve. So you don't want to be bringing in coupling between these microservices. This is exactly where the solution with a saga pattern comes into picture. So as you can see here, what the saga pattern does is it basically is a sequence of local transactions. So in other words, the first microservice, for example, may be doing a certain transaction, applying it onto a relational database, and then generating an event that we passed on to the second microservice. And the event of a failure in those cases, you need to be able to have something called as a compensating transaction. So the picture, as you can see here, is showing that each local transaction updates the database, publish the message, or you want to trigger the next local transaction, and then for error handling, you need to have similarly like a reversing transaction if you are compensating. So the two options for implementing saga. The first one is called the choreography-based saga. In the case of a choreography-based saga, what happens is you have an event broker or a message broker, as we know more commonly, and the microservice is basically just independently publish and subscribe events as they need to. What it means is that when an event occurs, the first microservice may be putting it onto the bus or the event broker or the event bus, and then the other microservices may simply be subscribing to that event bus based on certain rules. For example, when an order is created, you may have multiple microservices which may need to act on that same event in different ways. One of them may be related to a financial aspect of it, meaning that it may go and update the customer's account. There may be one which may be just simply taking that event and putting it into a data lake. There will be one which may be more from an analytics point of view, and identifying if there is some kind of a trend or some kind of an alert that needs to be generated because of some kind of fraud kind of an event. So all these may just go and subscribe to that same event broker based on certain filters or certain subscription criteria of that event, based on the event attribute. Orchestration-based saga, on the other hand, is the one where you have a central orchestrator. So imagine the use cases where you basically need to execute them in a certain sequence and you want to control the order in which this happens. It also helps you with visualizing that what are the different microservices that I'm calling, what happens if there's an exception, how many times do I need to retry, and how do I log these errors? So if your requirements are more around that, then you typically would want to use what is called as an orchestration-based saga. So diving deeper, in the case of event choreography, as I briefly mentioned, that you have an event broker on which you have different microservices, basically publishing and subscribing the events. There is no centralized logic as such, meaning that you can have new microservice or new publisher subscribers coming in and subscribing to these events as they need to. The only thing that they need to do is basically have that selection criteria for that event. This is mostly used for publishing events across business domains. So if you're coming from a domain-driven design point of view and you have gone through some of those concepts, you're likely to have seen how you have different business domains and how you create these context maps and the dependencies between them. Now, as you start building microservices that are aligned to those domains, you'll realize that you need to have certain patterns on how you exchange events that are across these business domains and events that are within the same business domain. So the event choreography typically is used when your one business domain is wants to generate an event that needs to be consumed by another business event. And in those cases, an event bus has the most scalable and the most flexible way to publish those events and have an event choreography. Event orchestration, on the other hand, involves a centralized event broker. And what the centralized event broker, as I said, is you use another service or another part, a component in your architecture, whose sole responsibility is to control the order in which these microservices are being executed. Most commonly, you will see that this pattern is used when you have microservices which are within the same domain. As I showed you in the other example, a previous example for payment processing in case of an orchestration, and we will see it in a demo as well, where once you get a certain event, you may have to go through multiple systems and look at multiple sort of components in your workflow before you generate a final event. And that is what orchestration allows you to do. It is easier to visualize and you're able to control the order. And it is typically used for orchestrating events within the business domain. So as I said, that across domains, you would typically want to use choreography, but if you're trying to coordinate or rather orchestrate microservices within the same business domain, orchestration gives you that flexibility. If you're using an orchestrator, you can not only use asynchronous communication, but also you can use synchronous API call as well. You have that flexibility. In AWS, we have a service called AWS Step Functions, which basically gives you that capability to have a cloud-native service for orchestrating microservices. It works across over 200 services integrations that you have. You can invoke, for example, a Kubernetes job. You can have a Lambda function being invoked and you can orchestrate them together. So we have a demo for a managed care plan. So this basically shows how an EDA is in practice or how it is used in a real-world scenario. So for example, in this case, I'm showing one domain, which is like a provider management. So if folks are coming from healthcare background, this may resonate much better. Let's say you have a doctor or a nurse who's publishing their availability in terms of the services they provide. So you have this domain provider management, where they're saying that they're published their availability, that I'm available at this time, and then you have something called as a patient care plan. So think of it as like an app that you may have on your phone, for example, and you simply go there and say that, hey, I want to see a doctor. This is the specialty I have. This is the language that I want my provider to speak. And these are the times that I have. I would like to schedule it. As you can imagine, this can be in two different subsystems altogether, and it gives you a classic way of building a system which is event-driven or how events may be exchanged between the two to achieve a certain business functionality. So the architecture that we've used for that is all based on Kubernetes, and it uses most of the components here are open source. We use Spring Boot as the framework on which the core functionality, business functionality is built for creating the APIs. And then we use Kafka to act as that event bus through which these events are exchanged. And once we publish those events and we need to create a workflow out of it, we use Argo workflow for that. So as you see, as I dive deeper into this architecture, there are different parts that we have around provider schedule, the care plan, and then the part where I show the event choreography, this is where you have the, as the bookings are made, for the same booking, there may be different states. For example, when a provider says that okay, I'm available at this time, you have the care plan which says I have booked this particular slot for my appointment, now the patient may decide that I want to cancel it because something else has come up. Once a patient does the cancellation, the state of that event now changes, but it's published on that same topic. So you see here what is called the care plan bookings topic. And based on this attribute of that particular event, be it confirmed or canceled, you have different handlers. So this is how you basically implement what is called as like a choreography, where the events are being published on the same topic or the same event bus, but based on the attributes on that particular event, you have different consumers which are consuming it, processing it, and maybe submitting it back or sending it back on the same event bus. And then if you dive deeper, the other use case where orchestration makes the most sense is basically when an event or a booking has been confirmed, you may want certain resources for that booking. In other words, let's say this is at a clinic, you will want to schedule personnel for it. For example, the nurses, you may want to have a room booked for it. So, and those may be different steps in a workflow. So this is where an event orchestrator comes in and we use Argo workflow for that to execute the different steps in that particular business processing. So for the demo, I'll just open up, I hope it's visible. So this demo will walk you through some of the pieces of this particular solution that we have built. So this is, as I said, we used data on EKS. So if you are familiar with data on EKS, this is basically our EKS blueprints where we have taken a lot of data platforms and show you how to run them on EKS. So this is the Terraform scripts that you, templates that you get with it. And as you can see here, the Kafka and the Argo workflows are the add-ons that we have used here to deploy that both Kafka as well as Argo workflow on EKS. So this shows you the Kafka cluster. It uses that Strimzy operator to deploy that. And then we will, I'll show you next to the, the Argo workflow piece of it. Yeah. So as you can see here, these are, this is the event bus that's there. We use the default event bus. And then use Kafka event source for it. And then the workflow is that, is the sensor that we have, which basically gets triggered. And going through the code, this is basically the Spring Boot application that I was talking about, which shows that business functionality we have around booking an appointment. It uses a Postgres database to store these bookings and the availability of the providers. And these are the, the deployment artifacts that you have, as you can see here, it's using the same Docker images that we have for these different components. So next, I actually show you how it exactly works. So as I mentioned here, this is where we are invoking the API to submit like an availability. So as you can see, the provider says that he or she is available from this time to this time, then counter type is on site. What it means is that whether you're doing a virtual visit or in person. So this has now created the availability for that provider. Now, if you go to the get all provider availability, you see the host URL has changed. What that means is that I'm invoking it on the care plan site to see the availability that was just posted in the previous API call. Mind you, these are two different microservices that I showed you earlier. Now, this is where I'm creating my availability. Sorry, my booking. So I take that availability ID and I say that I want to book an appointment. I'm waiting for confirmation. If you remember from the previous picture, it sends an event to get the confirmation from the provider side. Now, if I go and check the booking that I just made, whether it has been confirmed or not, you'll see that this is now showing as confirmed and the payment mode is showing that it's an insurance pay or versus a self pay. Similarly, as soon as the confirmation has happened, this is where our Argo workflow will get invoked, which basically shows the different steps, post confirmation for handling the resources in an actual provider setting. So once I take the token, go into my Argo workflow console, and this is where you see that there is a workflow that just got executed the one 50 seconds ago. Just come to it. There you go. So this is the one. And it then shows you the different steps. So for example, this is the input. This is the event that I just received with respect to confirmation. It's making a room reservation that is needed and then sending a confirmation to the patient saying that, hey, your booking has been confirmed and this is the clinic where you need to show up and so on. So with that, let's get into some of the deployment pieces of this and I will hand it over to Carlos. Thank you. So switching hats in here, literally. So I'm going to switch this hat from developer. I have one here that says platform engineer. And is there any platform engineers here? Yeah, app developer can rock. So two components that we talked in here, we mentioned DataNICS, but DataNICS is a space, we work with the community on data on Kubernetes to say working group, people from our team, working there, how to run best, how to run the best way to run the stateful workloads in Kubernetes. In AWS, for example, we have managed MSK, Amazon. MSK, we say manage Apache Kafka where you say give me a Kafka service and it give you just the broker URLs. If you're familiar with Kafka, basically that's the only thing that you worry about. When you're deploying Kafka on Kubernetes, there's different aspects of resiliency and a scalability you have to take into account. And one of them is when you deploy Kubernetes, you want to deploy in different failure domains. In the cloud, you will have two availability zones and these are like, for example, different buildings if it's on premises and things like that. And the brokers and zookeeper, they need to be close to each other because there'll be a lot of communication between them and you want to have them in the same availability zone and not cross over. The other aspect is, Missou mentioned the operator. So using an operator to deploy makes this setup more scalable in terms of scaling the configuration of them. Using StreamZ is very recommended to run Kafka we use in our patterns. And the other aspect is having the consumers, right? The microservices that will consume from these stateful workloads, right? And this will be your producers and consumers. They could be in the same cluster, but most likely you want them in a different cluster, but close to each other. So you don't have, you have low latency between them. Remember that these are like TCP connections when you're talking about Kafka where you have producer producing events into Kafka topics and then consumers taking out those messages that are sitting in partitions in the brokers. So that's basically, we have different patterns there if you're running on EKS or data on Kubernetes, there's different patterns but at least for us, we have a project called the EKS Blueprints where you can deploy a Kafka and then get it up and running and we have helped a lot of users and users running Kafka efficiently on Kubernetes. So let's come up a few of the tips of running Kafka in production. If you have run Kafka in production, it takes more than that, but at the high level, this is a good starting point of what will be the aspects of when you get started deploying Kafka and Kubernetes. The second one is Argo workflows, but like we in the demo, we mentioned Argo events and usually they go together because that's the event source for Argo workflows and Argo events has a sensor for Argo workflows. Basically it instantiates a workflow. So in terms of scalability and scaling these Argo events, talking about Argo events, there's two aspect on it. It's the event source which in our demo was Kafka but it can be a GitHub pull request event like some type of events and there's a couple of them there. They scale by deployment so there's not that much problems of scaling them. It's just to have a lot of deployments or pods handling. The one that will be your bottleneck or what you need to be more concerned about is using the event bus and in Argo workflow, in Argo events, there's one that is the one that we use in the demo which is using NATs. It's the name, we call it the native. There's also JetStreams which is the next best one from that and then you have, not saying that it's the best one but then we have folks using Kafka because they are using Kafka anyway so they want to reuse Kafka for the event bus and that will be one way of gaining that scalability resiliency from Kafka that you already deployed with StreamZ. When it comes to Argo workflow, Argo workflow where you have to watch out for scalability is the Argo workflow controller. The Argo workflow controller, the way it works, it creates, when you create an Argo workflow, that's an instance of a running job, you can say. And then any status that is on that workflow or inputs or outputs are safe on the CR. So you're running back to the HCD and as you get a thousand, 2000, like that many CRs, the Argo workflow controller has scalability problems. So you need to start sharding manually on which workflows match to each controller and you can have multiple controllers. But we started a SIG in Argo called the SIG Scalability between, we have a SIG Scalability where we are looking into helping writing the control or scaling Argo CD and Argo workflows. And to finish, as Argo workflows can run jobs in parallels on DIACs, the number of workflows that you run in parallel at the same time basically is how many posts you can run on an EKS cluster and it has a garbage collection into it. In terms of scaling Argo workflows, you have to take into account the different type of storage. So if you have a thermal kind of storage, you go at the bottom like using SSD, the local storage of the node, where you have more resiliency and more better data for resiliency will be S3. And then between them, there's different aspects of what storage to use. That will be also affecting your scalability of Argo workflows. And with that, we want to say, give it a try, this is the URL for the example out of the box. You can, if you're a platform engineer, you may know Terraform, you can do Terraform apply and it deploys. You can use things like the GitHub spreadsheet if you want to deploy the hem charts and all these and all the patterns are under the data on EKS where we have patterns for Jupyter Hub, Spark, Ray, all these stateful workloads to run in Kubernetes. They're kind of like complex to run but we have a patterns in there that people are using and sharing and open pull request is an open source project, the data on EKS. So with that, I think we're done. Thank you so much. Thank you.