 Okay, so who am I? I'm Mohan Muthukumar. I'm a senior software engineer at Avalara, the largest tech company that you've not heard of. We are in the tax compliance business and we are almost part of every transaction. So if you've bought something online, there's a very good chance that it went through Avalara systems. And so the task that my team does is we go over the internet, crawl a lot of pages, look for product information, tax changes, startup changes, all these things. And effectively we have to run a mini Google. So given the scale of our operation over the years, we've seen all kinds of systems. I mean, if you can name it, we've done it. So the topic at hand, microservices are becoming unmanageable. So if you work at a large company or even a moderately sized company, all this would be very familiar to you. These days, there are more and more microservices. You have a single engineer managing 2030 microservice and all the plumbing code to make all these microservices stock with each other plus the DevOps surrounding it. All this seems to take much longer and a lot of yaml, much longer than the actual business use case. You'll end up in a scenario where let's say a business use case takes only three days, four days to build, but all this deployment and plumbing code making and making sure everything can talk to each other, all that takes a week to x of what it took to queue to solve the actual business use case. So this is, you know, this is not a good situation to be in. So this is just to be clear, this is not a let's return to monoliths like good old days kind of talk. So this is a talk that's, I mean, this is a sentiment that's becoming popular in certain circles and I don't mistake them because, you know, things used to be much simpler 15 years ago. All you had to do was just write some code, deploy that in an application server and you just had a web app. So there is a reason why we got here and those reasons just don't go away so you cannot return back to monoliths. So let me give you an oversimplified version of how things were and how we got here. So this is how things were 15 years ago, you had a single service with multiple concerns. You had all the clients actually talk to this single service. So you had signups, you had multiple resources in different parts and you just saw that. But a couple of things happened after that one, the internet happened. So instead of serving thousands of users, thousands of users, now you had to serve literally hundreds of millions of users. So the internet scale happened and then the most popular runtime Node.js that kind of improved the developer productivity by a lot ran only on a single thread. Simultaneously, you had all those moves last time, you had processes with multiple cores, you had processes with 16, 32 and 64 core processes on the server side, but you had a runtime that ran on a single thread. So you're essentially going to be wasting a lot of course. But a few years later, Docker came in and one of the things that it gave us was the ability to run multiple Node processes on a single machine. So that suddenly gave you a way to utilize all those ID cores. And so people had something like this. You had multiple copies of the same app running on a single machine and you had multiple copies of the same app running across machines as well. It didn't make a difference. And you just put a load balancer in the front and any client talks to the load balancer, the load balancer directs the request to any client at random and you can just, you know, serve the traffic. And so you're able to scale really well. There was a problem with this. I mean, rather than running multiple copies of the same app, it made sense to slice these up vertically and have these scale up separately. So maybe you don't have a lot of slash sign up slash login requests. So how about we take all that and make the quintessential example of a microservice, the user service and deploy that separately. So people started slicing things up vertically and it also helped that most organizations also could have dedicated teams working on dedicated services with a limited scope. So people started slicing things up vertically and it also helped that you can scale these things separately. Let's say, you know, there is a service that is getting a lot of traffic. You can just scale that service independently without allocating resources for everything else. So it made sense. And this is, I mean, if you ever see microservices, this is probably, this is a diagram that you would have probably seen already. Except this is not true, right? So this is true only, if only your client is the only thing that's talking to all the services, only then this is true. But in reality, more often than not, your services need to be talking amongst themselves too. And so your diagram starts to look something like this and you can already see it becoming more complex than that the previous diagram showed. And soon enough, since there is nothing stopping people from creating more microservices, it is going to look like this. So this is an actual image from Netflix and it in fact looks like this. Now imagine working with a system like this. Not only is this unmanageable because nobody has a complete view of the whole system and companies also have dedicated microservices, sorry, dedicated platforms team, dedicated SRE teams and all that just to alleviate the pain of dealing with this complexity for the business developer. So not only is this unmanageable, but it is also unreliable. What happens if imagine working in this environment and you want to create a new service, you want to deploy a new service, you don't know what is going to break, right? So just as soon as you deploy a new service, if something is wrong, it could bring down the entire system potentially. So developers tend to work in this frightened environment. And also one thing that is often overlooked is if a service call fails in a distributed transaction, let's say for example, if you're doing an ordering service, an e-commerce service, if a service call to the payment service fails, you should also revert whatever came before that, right? I mean, revert the dispatch, something like that. But in a situation like this, people tend to miss that, right? And it's not obvious to people that they should be reverting everything that came before them. So this kind of makes things unreliable. And we kind of got here for scalability, right? We wanted scalability and we sacrificed reliability. And kind of this seems generally true anytime you try to increase scalability, your reliability suffers and anytime you try to increase reliability by introducing transactions or locks, your scalability suffers. But this is not actually true. This sounds true, but it's not true. You can have a system that's both reliable and scalable. So that's where you want to be, right? So is it possible to build a system like that? I mean, we've done this in the past, but it requires a Frankenstein's monster of systems. You need complex state machines involving Kafka queues, distributed locks using HCD, all those kinds of things. You need a centralized data store for anything that's critical, all of those things. And it might still break when a new service is introduced. There is no guarantee. And it does nothing to improve your situation with respect to rollbacks, right? And also, given the number of systems involved, you will have to write, you'll have a lot of different tools, deployment tools, CI tools, all these kinds of tools, and you'll have to write literally ungodly levels of YAML. I mean, I'm pretty sure nobody here enjoys writing YAML files. So this is the situation. And is there a better way? There is a better way. I mean, if you, for an analogy, if you think about operating systems, operating systems have solved this. Hardware is unreliable, just as distributed systems are. Network is unreliable. So is the hardware, so is your actual hardware, right? But your program doesn't need to account for that because your operating system gives you an abstraction. Your operating system abstracts all this away from the actual programmer, right? So what we need in the case of all these distributed systems is an abstraction. And Temporal is just that. So just to give you a bit of background on what Temporal is. So this is a project that's been brewing for the last nearly a decade. It started the team behind Temporal was also the team behind Amazon Simple Workflow and also the Azure Durable Functions. And right before Temporal, they were working on this open source project from Uber called Cadence, which is very similar to Temporal, aside from some internal differences like GRPC and Fifth. So Temporal treats your entire application as a workflow. You have this idea of everything being a workflow. So everything is a series of steps that executes to completion. If you think about it, that kind of describes any computer program, right? You have a series of instructions that just run to completion. But what makes Temporal different in this case is they have durable state, unlike let's say any, I mean, no way. So they have durable state. And you can do all this using code without writing any YAML. So Temporal gives you the building blocks to build your distributed system and it handles all the communication for you. So you don't have to deal with writing to the right queue and all that. And it also stores your execution state such that if something happens to your workflow because of some infrastructure issue, it can just recreate that state in a different work or a different part. So this kind of sounds like the missing piece in the stateless Kubernetes ecosystem, if you think about it. So creating a Temporal system is pretty easy. So you have workflow functions and you have activity functions. So workflow functions, in workflow functions, you need to specify what comes after what. And essentially, you create the state diagram using the workflow functions. And the workflow functions need to be deterministic. And then you have activity functions, which can be just any kind of functions. So let me just give you a hello world example. So here you can see you have a function, the workflow function, and then you have the activity function. So the workflow function has a few config stuff. And then you basically specify that you want to execute this activity. And this is your activity function. All this does is just return hello plus name, your general hello world example. So this is just going to return hello world to this result. And the workflow is also going to return hello plus name, right? That's basically it. So here is a more complex and more useful example. So let me actually zoom into the first part of the code. So here, if you see the first part of the code is just me setting up some config stuff for timeouts and retries. So if you see, we have some retry policies for the activities, I'll come to that in a bit. And also, you know, you're setting up setting some timeout conditions. So the interesting piece is the bottom half. So let's let me go to the bottom half. Let's throw away the config part. So this config part, you don't actually have to specify there are sensible defaults. So some of these are required, some of these are not. It'll actually for the things that you don't specify, it'll actually use the default value. And so this is the bottom half. So here, if you see, imagine an e-commerce workflow, right? This is just a part of an e-commerce workflow. Let's say you already have plays an order and now, you know, you need to start subtracting from the inventory. You need to take the item off the shelf so that somebody else doesn't add it to the card. And also, you need to send the user to the payment page. And let's say you want both of these running simultaneously. And so essentially, we've invoked both of these activities simultaneously. And here on line nine is where, you know, you are awaiting the result by doing a dot get call. So if you are now, both of these are going to get executed simultaneously. And here, if you see, we are using Golang's error handling primitives, the way you would normally handle errors in Golang to actually handle an activity failure. So this will, based on our retry policy that you saw on the previous screen, this will already try multiple times, five times, for example, or until you hit the exponential back off threshold. There's a lot already try five times. And if it tries, if it reaches the maximum number of attempts allowed, only then it'll go into this block. And here you can see, you know, if the payment info, info fails, we are putting this back to the ship. So we can just handle the compensation logic inside your error handling. And, you know, it will be, it makes it a lot easier to wrap your head around this. And if you think about it, there is a problem with this code, right? So we are doing this for the payment activity, the payment activity fails, but we're not doing this for the inventory activity, right? So it becomes very obvious that we should also be reverting the payment, but we're not doing it. So you can actually catch this during a code review. Imagine how long it would actually take for you to find out about this in all those YAML based systems where chances are things don't even exist in the same repository. Yeah. So let me give you an example of the activity. All these are, you know, activities can be anything, any goal and function. The only requirement is that it needs to take a context as a first parameter and return an error at least. So here I'm just making a HTTP call to this inventory service. I'm just making a post call. And if the post call fails, I'm returning the error for the post call field. So obviously I don't have the service running. So, okay, before that, so both these activities and workflows need somewhere to run. So that is your worker code. So here, if you see, we have, you know, we are configuring a queue name for the worker and then we are registering every workflow and also every activity with this worker. And then since this is going to run as a Kubernetes pod, we have set a health end point and that's basically it. Now, since this worker is going to be part of a deployment, if you want to increase the throughput of your workflow, all you have to do is scale this worker and, you know, depending on the logic of your workflow code, it is going to horizontally scale. So this gives you reliability since workflows are durable, like anything that you have in the workflow function will actually complete. There is no infrastructure failure, that can actually prevent the workflow from completing. So let's say a workflow function is running on a worker and the worker goes down because, you know, for some reason, any reason, your spot machine went out or something like that, any reason, Temple will know to restore that, restore the state of that workflow in a different worker and we'll resume the workflow. So, you know, that gives you a lot of reliability. Plus, like I showed you, you can handle your failure cases using the other handling constructs in your programming language and you cannot, you cannot forget to handle it and you'll catch it during your code review. And you also have retries with exponential back off. Imagine having to reinvent this every time you're sending an external request that to a flaky service and pretty much anything across network is flaky given long enough time. So you would want retries and Temple gives you that out of the box. So, going back to our activity code, since none of those services actually existed, all my attempts are actually, all my tries hitting that service was failing. So if you see, if you look at the standard, standard out, you can already see that we are attempting it multiple times. So you can see attempt two, attempt three. So, and also you can see that we are in fact hitting this simultaneously. We're not waiting for the payment gateway payment activity to finish in order to hit the inventory activities. So we are trying both of these simultaneously and both of these are failing. And you just by looking at this, you will just by looking at your logs, you can tell that, you know, this is the activity that's failing. Plus, Temple also gives you comes with a web UI, just like Kubernetes. So this is an optional piece that you can install. So this will give you a bird's eye view of all the workflows that are running all the runs that we have. So you can go inside any of these runs and it will give you the entire event history of everything that happened. So here you can see on line 13 that this is the reason you know, this is this is the part that is failing. And it will give you the reason for the failure and the number of price. So this is now here it says attempt five. So this is this this part is being this activity is being tried for the fifth time. And now since that also failed, it says maximum attempt reached. Right. So and that was on reliability and on scalability. So part of the reason why you end up with hundreds or thousands of microservices is one, it's convenient to start a startup whole project from scratch, but also because people try to tend to prematurely decouple services in anticipation of future scale. Right. People might think, hey, maybe today I'm getting only hundreds of requests, but in the future, this can potentially have hundreds of thousands of requests. So let me actually decouple this and put this as a separate service so that I don't have to decouple that at the time. And because people have burnt their fingers trying to decouple monolithic machines, right. So people want to prematurely decouple services these days. And but you don't have to do that in case of temporal. So all you have to do, let's say, let's take this example. So you have three activities running. And let's say, for example, online line 15, the payment activity needs to run at a different concurrency. And let's say it's also an expensive call. All you have to do is just remove this activity from this worker and create a new worker, register that activity over that, and now deploy this worker in a separate Kubernetes deployment and can scale that independently. And here if you see, we have also set the concurrency to six, the default is 1000. So you can have only six instances of this activity running for this worker at any given time. So that way you won't be bombarding the payment service with a lot of requests. If you have especially burst traffic, you'll only be handling six requests at a time. And you don't have to also prematurely decouple services because it's as simple as just registering an activity and registering it in a different worker. So if you think about it, I mean, I'm sure all of you have services at your company that do not really get a lot of traffic, but still need their own deployment because there is no other way to deploy it. There are services that hardly get 100 requests a day, but they run as dedicated services. But with Temple, you don't need dedicated services for that. That can just be another activity executing the logic inside the activity function itself, and this can get scheduled whenever it is required. So that way that also reduces the number of microservices that you need. And your application can be made of thousands of workflows, and all these workflows can actually be executed concurrently. So let's say you are building a complex application, there is no limit to the number of workflows that an application can have. In fact, every day we run 200,000 workflows. And also if there is no limit to the number of activities your workflows can execute, the only limit is you can have only 50,000 events, and you can basically work around that limitation by creating child workflows and making sure that no child workflow has more than 50,000 events in the event history. So one more thing. Now, if you remember the beginning of my talk, I told you part of the problem is also all these infrastructure automation tools for which you have a right conflict. What if your application can bring its own infrastructure? Imagine at the beginning of the workflow, you have an activity that talks to the ASG using AWS SDKs or CDK, anything, and increases the ASG count. Let's say you start with five, you increase that to 10 at the beginning of the workflow, and then you talk to the Kubernetes APIs to increase the part count, and then you execute your business logic, the actual logic that you want to execute that we built in the previous screen, for example, the commerce workflow. You execute that as a child workflow, and then on the way out, you can decrease the part count, decrease the ASG size, and also maybe perhaps send a Slack notification indicating the workflow completion. And you did all this without using a single, without writing a single line of Yaman. So yeah, I mean, your application can bring its own infrastructure. So other use cases, we have a lot of these DAG based systems like data pipelines, the ELT pipelines, and machine learning pipelines. All these actually tend to naturally lend itself to temporal, because temporal DAG based pipelines are a subset of what temporal can do. And also CI CD systems, like I showed you in the previous example, can very easily become temporal workflows. And making them temporal workflows opens up a lot of possibilities, running smoke tests, running integration tests, running chaos tests, all those kinds of tests can be done. You get a lot of flexibility. You can also, aside from these DAG based systems, you can also build complex event driven systems. So temporal gives you a construct part signal. So just like how you can wait on a Golan channel, you can wait on an entire workflow. So your workflow execution is going to halt until you receive a signal from that signal. I mean, you receive an input from that signal. And one more thing that you can do is you can build your own DSL based workflows. Let's say you already have some DSL that you want to execute, you can use temporal to write an interpreter, you write an interpreter for that DSL, you can execute that. Yeah, closing dots. So you don't need me to tell you how powerful an abstraction can be. You already saw that with Linux containers, with Docker and the kind of impact that it did, kind of impact that it had, you already saw that with Kubernetes services and the kind of impact that it had. So a few years ago, we had this serverless moment at the peak of the serverless moment. We had pretty much every single cloud company come up with an offering, and also a promise, a promise that developers wouldn't and shouldn't have to deal with all this glue code around their business code. And that ops will also be taken care of seamlessly. But that kind of fizzled out, right? So now here is a system that helps you check a lot of these boxes, although in a different way. So with this, you know, temporal might just be the system that fulfills the promise of serverless. And with all this unlocked productivity, imagine the possibilities. So thank you. That was my talk.