 Hi, folks. This is Code That Can't Fail, backed by Cassandra. My name is Lauren, and I'm a Developer Relations Engineer at Temporal. I'm going to introduce a concept called Durable Execution, and then I'll explain how we designed a system that provides durable execution. Finally, I'll share a few learnings we've had working with Cassandra at scale. Part 1, Durable Execution. So what even is Code That Can't Fail? Now, I don't mean expected errors like your code charges a card and gets a card-expired error. What I mean is that you can write a function that will not fail to complete executing. It is guaranteed to finish running. Now, you might be thinking, what's the big deal? Of course, functions finish executing. That's how running code works. But there are a number of different cases in which your code might fail to complete. The process could crash. Maybe you divide it by zero. The process could be killed by the OS. Maybe it's out of memory. The machine could lose power. You could deploy a new version of your code. And while usually you'd set up a graceful shutdown for the old processes, they may be doing something longer than the grace period like a sleep statement for one hour. And that would get interrupted by the fore shutdown after, say, a five-minute grace period. The final reason is transient failures like temporarily unable to reach a downstream service. Now, of course, you could catch and handle this type of error and retry with exponential backoff, but A, that can be a lot of code to write everywhere you hit the network. And B, if the retries go on for long enough, you'll go past the redeployment graceful shutdown window and get killed and lose the state of what you were trying to do. Durable execution takes care of all of these failures. It runs your code in a way that persists to each step taken so that in the event of a failure, it can read the execution history from disk and continue executing your function from the same place with the call stack and local variables intact. So this is great for any application that needs a high level of reliability and correctness, applications that already are or should be manually checkpointing progress, saving state to the database after each step to ensure the ability to recover from execution getting interrupted. These applications can throw away their manual database updates and state machine code and instead write a durable function with automatically persisted, persisted steps and recovery. But maybe you're like, my application doesn't need that level of reliability. I never have long sleeps and I already have retry code and it's okay if there's a restart on rare occasion and a retry gets interrupted and lost. It's a bug and we'll either lose data or end up in inconsistent data. But if it happens once in a blue moon, that's fine, support can handle it. My answer to that is if you have a decent sized load, even with the application with a normal amount of complexity in terms of downstream services and third party APIs, you're going to have a significant amount of dropped work. So you're going to need to persist retry information and timers and have a pool of workers that pick up dropped work and you need to do that at every service boundary and for every important third party API. That's a lot of code to write, debug and maintain. That's why people do choreography using a message bus to coordinate between services or orchestration where there's a central orchestrator deciding which services to call. Choreography still involves a lot of code and gets really complex to reason about and debug when things go wrong. In the microservices patterns book, Chris Richardson recommends using orchestration for all but the simplest use cases and you can think of durable execution as like a developer friendly version of orchestration. So if your work is important and you don't want to drop it, you can either write and maintain a lot of code to try and make sure that it's not dropped or you can use durable execution and code at a higher level of abstraction where you don't have to care about crashes and deployments and retries and even about saving state. Not only is durable execution taken care of things you used to have to do manually but it opens up new ways of programming. You can sleep for 30 days and not only will the function reliably go to the next line of code after 30 days passes but you also won't be taking up resources during those 30 days because the system will know the function isn't in use, kill it in order to free up resources for active functions and when 30 days comes around, recover the function to the correct state. So you could easily code a subscription charging the customer every 30 days in a loop and that will run reliably. Since these functions are potentially long running you may want to query them for their state or send them instructions. So durable functions provide mechanisms to receive and respond to RPCs. For instance on Amazon if there's a 30 minute cancellation window on each order if that was implemented with a durable function you would start the function at order time the function would preserve the item sleep for 30 minutes and if it received an RPC that said cancel before then it would free up that inventory send a cancellation successful email and return. Durable functions can also be indefinitely long running for example you could have a customer function that runs forever and implements your customer loyalty program. Whenever the customer makes a purchase you send that customer's durable function an RPC with the purchase info and that function adds to the customer loyalty points and decides whether and when to send them an email maybe with a coupon or an encouragement to hit the next loyalty level. Finally, if you have a function that can last forever and can fail and that can respond to RPCs you no longer need a database. In the last example when the customer logs into their account to view their loyalty points we don't need to look it up in a database instead we send a get points RPC to the function and it responds with the number of points. To more concretely show what I'm talking about I'll look at a couple examples. First here's the subscription implemented as a durable function it takes a user object and an amount to charge it starts out with canceled false sets up a handler for an RPC called cancel which sets canceled a true and sends a confirmation email then it loops while not canceled charge the user and sleep 30 days so the function will wake up when the sleep timer goes off or when it receives a canceled RPC and otherwise it won't take up resources. Also, both the send email and charge functions will automatically retry on failures like our email services down or we can't reach stripe. Now here is the loyalty program example when a user signs up you call the function with the user object and they start with zero points whenever the user makes a purchase we send a notify purchase RPC to the function and it adds to the points total and decides whether to send the user a coupon whenever the user views their profile we send a get points RPC to the function which returned the points total finally the function needs to not return at the end so that it stays running so we wait for a promise that never results. So that's a whole loyalty program it's pretty basic but it's reliable scalable and durable we don't need to save the points total to a database we can trust that the local variable will always be there and accurate and that's durable execution it's running functions in a way that is process independent they survive process death and long-running functions are intentionally run by different processes over time it automatically retries any functions that might have transient failures you can sleep for arbitrary periods of time you can send messages to the function and receive responses your functions can last indefinitely long and you can treat your local variables as durable state. Here are some of the major systems that support durable execution at the bottom is temporal where I work our co-founders Maxime and Summer launched AWS Simple Workflow Service in 2012 the first durable execution system it was developed after Amazon switched to microservices ran into the problems of communicating between them with messages i.e. adventure of an architecture and choreography and wanted a better way of coordinating. Summer went on to Microsoft to and build Azure Durable Task Framework which became so popular that it was adapted by Azure Durable functions Summer and Maxime got back together at Uber where they created cadence in 2016 which is used by Uber Engineering to coordinate across over a thousand microservices and was adopted by a number of other companies as well they left Uber in 2019 and forked cadence to start the temporal project and company in order to improve the software and bring it to more developers and we now have 180 employees working on it including me and here are some of the companies that used in portal including Stripe, Netflix, Box and Datadog every Coinbase transaction every SnapStory every Airbnb booking is a durable function. Now we have a better sense of what durable execution is let's get into how temporal implements the system that provides durable execution at a high level there are three parts to the system there's a client library that uses start durable functions or stop them or send RPCs to them it talks via gRPC to the server which saves the progress of the function to the database there's a worker which uses one of our language runtimes we support Go, Java, Python, Node, PHP and .NET and the worker has your code the workers are polling the server for tasks and when they get a task they run your code like calling a durable function or calling a function's RPC handler then the worker sends the server the results of running the code for application developers writing durable functions you just write your functions start run them with our worker library and start them with the client library the gRPC messages sent to the server are an implementation detail that you don't need to think about but since this is a talk about how temporal uses Cassandra we need to talk about how it works under the hood let's look at a concrete example of communications we can start the subscription function we saw earlier by using the client to send the server a start function message the server saves the request in the database and replies that it has received and accepted the request the server creates a start function task for the worker the worker polls and receives the task and calls the function the function inner slices a variable sets up an RPC handler enters the loop and hits the line await charge that's the result of executing the start function task and at this point the worker sends that result to the server the fact that that start await charge has been called the worker is saying the next step of running the subscription function is calling this charge function now let's look inside the server to see how it handles receiving the next step inside the server component are a few different services and a couple data stores the front-end service receives the next step from the worker along with ID well along the ID of this instance of the subscription function each instance has an ID that is unique among all running functions and it's provided by the client at start time we hash the ID to determine which host the function belongs to in this case function ID five belongs to host B so that's where the front-end service forwards the request to each host has a database partition or shard we support my SQL Postgres SQLite and Cassandra but it's a very right heavy load since each step of each function you run in production is written to the database and you can scale that right load much higher with Cassandra so we use Cassandra for our cloud service our cloud service is a hosted version of the server but it's open source and many companies host it themselves the cloud pitches that you'll save money time and that's money as well as peace of mind if the experts host it and maintain it for you now let's zoom in on just host B and partition B when host B gets the message to call charge it needs to do three things update the state of the function add a charge task to the queue so that a worker can pick it up and execute it and add a timeout there are many types of timeouts that might be set in this case it might be the maximum length of time the charge function can be tried and retried before we consider it permanently failed at which point that line of code and subscription function would throw an error it's important that all three things are done together automatically because otherwise you'll run into various sorts of race conditions and inconsistencies like if you update state first and then fail to add the task to the queue then the system will think there's a task that's not there or if you add to the queue first if the update is slow the task might be completed before the update goes through which was the cause of a 28-day outage at Azure between Azure service bus and Cosmos DB so we use a single partition batch statement to get atomicity and isolation in this case we have four statements in the batch first we assert this host still owns this partition set partition ID to be if partition ID is B then update the function state if we have the correct state version number then add the charge task and the timeout if any of the statements fail none of them will be committed an issue with having the tasks on each partition is query performance with Cassandra I think of your schema design as following your query patterns versus relational where you can just to an extent start with a steamer and query arbitrarily so when we're thinking of the query give me the next task on the payments queue we would need to check all of the partitions because they're partitioned by function ID not by Q name and that's not scalable so we have a separate table called tasks on the right partition by Q name and a separate service for responding to Q queries and we have a service that moves task task from each functions table partition on the left to the tasks table on the right and it takes care of retrying and de-duping this is the transactional outbox pattern and the Qs on the left are also called transfer Qs also a single Q can have a load higher than a single host can handle so in reality we shard the hosts on the right further beyond just the Q name so now when a worker asks the front-end service for new work on the payments task queue it will get forwarded to host one which will take the charge task from partition one when the worker receives the charge task it will call the charge function if the function fails the worker will report that back to the server if it's something transient like a network error the server will schedule the same task for a future time if it's something permanent like a card expired error the server will record that result and put a task on the queue to activate the subscription function which will throw from the await charge line of code if the charge succeeds the server will record the result and activate the subscription function passing the result and that's as far as I'll go into temporal architecture for the next section I talked to one of our engineers that has the most experience with Cassandra to get a few quick things we've learned from using Cassandra a lot under high load the first thing is that we have a lot of queues and as an old data stacks blog post says queues are an anti-pattern in Cassandra a simple example is if we have a task table with Q name and Q dat and task info if we insert 10,000 tasks onto the payments queue and then delete all but the most recent there will be 9999 tombstones which by default will stick around for their 10 day garbage collection grace period if we do the query on the right select a single task from the payments queue then Cassandra will have to go through all the tombstones before finding the most recently inserted task and that could take 300 milliseconds the solution is to avoid most or all of these tombstones by adding to the where clause to tell Cassandra where to start scanning and here's a list here's a link to the anti-pattern blog post and an example in our code of avoiding scanning tombstones here are some miscellaneous things we found working with Cassandra generally Cassandra is great fit for right heavy workloads and especially nice for a pendulum in tables we try to avoid lightweight transactions beyond the 4x latency of PAXOS 4 phase commit there's also the potential of contention during two of the phases we wound up implementing a client side in memory lock to avoid sending concurrent lightweight transactions there's also the need if you use lightweight transactions to insert rows to only use lightweight transactions for reading and writing on that table we try to avoid secondary indexes or materialized views and use elastic search for searching this decision came out of a outed several years ago with high load on the secondary index it's really possible that particular issue has been fixed since then we also try to avoid large partitions while Cassandra is great for right heavy workloads it's still the bottleneck for us in terms of supporting higher throughput and we had a customer with use case that was going to exceed our capacity to scale with Cassandra so we had to reduce the amount that we write to Cassandra while still recording on a disk somewhere every step all the durable functions that are running take so we added a right head log between our server and Cassandra instead of writing every update to Cassandra we write everything to the wall and periodically flush wall updates to Cassandra this results in significantly lower load on Cassandra which means increase throughput walls are cheaper to also cheaper to write to which resulted in cost savings and they're faster to write to which means we're able to respond to user requests faster and support lower latency use cases it currently takes us 90 milliseconds P90 to start a durable function which involves two Cassandra operations in series we're planning on replacing that with a couple of wall writes in parallel which taken average of six milliseconds and are hoping to get a 10 millisecond P90 wall writes also have less variability in latency compared to Cassandra operations so P90 and P99 are closer to the average which makes sense it's it's simpler and faster to write to a few discs than a general purpose database can execute operations we also found that the new system increased reliability we know a few incidents and better time to recover the big caveat to doing this is complexity it's a lot to design and code and get correct and bug free whether you can do it and how you implement it depends a lot on your use case and there are a lot of things to build like the recovery process when the service that's writing to the wall restarts it can't just read from Cassandra it needs to read everything from the wall that hasn't been flushed the database in order to rebuild in memory the correct state we also have a backup wall that we can swap to if the primary goes down and there are many other details you're effectively building a new database on top of Cassandra so it's not something I'd recommend most to but it's an interesting spot we've gotten to do unusually high throughput requirements to recap we learned what durable execution is how it's programming at a higher level of abstraction where a number of different distributed systems concerns are taken care of automatically for you we also took a look at part of the internals of how that's implemented and shared a few learnings from our time with Cassandra I'll end with something we'd like to say which is distributed systems should hold you up not hold you back durable execution can support you in building more reliable systems with a better developer experience if you'd like to learn more our website is temporal.io I'm Lauren, Lauren DSR on X and happy to answer any questions you have and these slides are available at t.mp slash can't fail Cassandra Thanks a lot Ciao