 So, a word about us, we're working at Datalog. If you don't know us, our company is providing observability as a service at scale. I guess our flagship product historically has been collecting metrics and showing dashboards, but today we have many more things, alerts, APM, synthetics, security. My name is Christian. I'm working as a senior software engineer at Datalog, so, and my first encounter with Cassandra was in 2016, I think, small cluster, three nodes, things have changed here since. And when I'm not on computers, I run, went for a run yesterday, went for a run today, we'll go for a run tomorrow, you get the idea. And Steve, my co-worker, senior software engineer as well, is the team lead of our Cassandra team, which is composed about roughly six engineers, like six exactly at this time. So, what are we going to talk about? First, explain to you a bit what Temple is and why it's a great fit for us. And then we'll dive into two very concrete examples, you know, get our hands dirty on the health check team and migrations. We here, we're talking about moving data from one place to the other, not scheme on migrations, that kind of migrations. And then we'll give you an idea of what's our future. What problem are we trying to solve? Of course, we run 24 hours a day. You possibly do the same. For us, it's really critical. I mean, several companies, but they did not, people rely on us to just know if anything goes wrong. So, we have to be more reliable than our customers, long story and make sure. We run hundreds of clusters, thousands of nodes, numbers are changing, but that's fairly big. And one thing, since we are full Kubernetes in the company, Kubernetes came after Cassandra and Cassandra and its design makes assumptions that are not that Kubernetes friendly, so you have to just, you know, navigate through this and it gets a little complicated. Also, our infrastructure is always changing. When I joined the company, it was just virtual machines all over the place, only one cloud provider. Now we have multiple cloud providers with full Kubernetes. And even within the Kubernetes world, we've been through different ways of using it and you have to adapt your thing, move it from one place to the other, just deal with the computing world is ever always changing. Also, our team is not that big and we're spawned on different time zones, so this has an impact on operations. So, what is Temple? I got that from the Temple website. That looks great. It's going to simplify your code, make your applications more reliable and everything. You want that. But what is Temple exactly? Let's just dive back into ops. In the beginning, I guess everybody is doing, you know, that simple thing of running shell script, like not even shell script, like bash commands, you're in this tutorial on how to do this or that, just like shoot some comments and do things and that's what you do initially. And then at some point, as you organize yourself, you kind of document this because you don't want your coworkers to rerun, like hit the same walls you've been hitting, so you just document the process. And you get a run book that tells you to do this, do that, that, that, that. And at some point, you figure out, well, those engineers, they're just doing this sequentially. Why don't you write a script that does this automatically? And so, you know, you turn your roadbook into some kind of script that does the thing, has been hopefully tested a little bit. And then you can go one step further and have your scripts be orchestrated, organized. I'm naming a few tools that allow you to do that, Chef, Ansible, whatever. Basically, you have this thing under control and you know that whenever you deploy something, a bunch of things are going to happen and you get the thing under control. And then Kubernetes takes it a step further. You just describe, hey, I want this state. I want these things. Please do the magic behind the scenes and converse to that state I want. And it's just going to run all the comments that are needed. So why would we need Temple? Temple actually is a workflow engine. Think of a workflow as a script on steroids. It's really a sequence, a program, a sequence of actions that is running. And Temple is going to orchestrate that. Why is that important? We'll see that later in detail, but it allows you to typically know what's running, at which state is it. It's going to retry things. It provides you a single pane of glass of whatever is happening in your infrastructure and it's a great addition to all the other things. Like all these five things, runbooks, bash scripts, bash comments, tools that bundle them together or Kubernetes, they all coexist together. It's not like one is replacing the other. A pattern that we used a lot on Kubernetes is the toolbox. This thing is kind of the old world of built-in machines making its way to Kubernetes. It's not really Kubernetes-ish. You are not recommended to do that, but it's really efficient. It's like almost it feels like you're essaying on your server or your infrastructure and you have this site thing. Where you can run comments and on our toolbox, typically we have comments like this rolling cleanup, which is doing exactly what it is written with it's in, it's all the cleanup on all the nodes and a bunch of other comments. And that's a useful pattern that we still have today. And let me describe you a little bit our temple setup. So we have this orange zone with the temple server. So that's just the headquarters of temple. So you have this application running. And then it dedicates the execution of scripts of workflows to the temple workers. So here we have two different workers. Each color represents a Kubernetes cluster. And worker blue is going to operate on these Casper and Y and worker green, or light blue is it, is going to operate on the other ones. And temple does the work of dispatching things in that place, in that place, it does the magic behind the scenes. And you'll notice one thing is that there's a Casper and Rar that is operated by temple and that powers temple. So when you operate on that cluster, you just pay attention because you could break things. But we've always been fine. Here are a few random names of our workflows. There are self-explanatory, deployment, health, cleanup, list out pods, replace nodes, rebuild nodes. We do pretty much everything with them. So now one thing, the health check, that has been a really important building block for us. It's just about answering this simple question, is my cluster okay? And that is not something that is either totally okay or totally broken if you've been working with Casper and Rar. So there's this thing that from the take, accumulate this point of view, it's going to be, it's okay if all my pods are just ready and have their readiness check okay. From the Casper and Rar point of view, no tool status is going to tell you those nodes are up and working. Those nodes are not. And depending on which Casper and Rar node you're asking, you'll have a possibly different answer on the four node cluster. It's rare, but on the 200 nodes cluster, well, your manage may vary. And then there are monitors, we're like a monitoring company. And at some point, maybe your node is at 99% CPU. Maybe it's like 98% disk usage. It's not okay, you may not proceed. So it depends. So yeah, this thing has been really a very useful building block. We use it pretty much for everything. Deployments, node replacements, migrations. And one thing that is really important is that by building this block, it helped us enforce good practices and things went much more reliable once we had something that was reliably telling us, hey, you can move forward. And that is possibly the most important slide I'm going to talk about. I'm going to move a little bit because I need to show things. This in the temple world is an activity, this purple block. Purple blocks are activities and there are retrieval chunks, something that temple is going to retrieve if anything fails. And inside are just function calls. And one block like this is a workflow. So let's take workflow one, you run your thing, you do a health check. The check is okay, so you proceed to A. A fails, oh, that's too bad. Then you try the whole activity. And then so you try, check A okay, now A is okay, B is okay. Proceed, go to activity two, now the check fails. And this is usually typically fatal. If the check fails, you just stop everything. That means the cluster is not healthy. I'm not moving forward. I'm not retrying anything. The other example on the right, I'm not going to detail it. But the semantics is different. There's only one check. So you could end in this situation at the end where C is being retried, there's been no check before it. And we tend to prefer the first option on the left. But it does have a cost because the check itself can be lengthy. Because imagine it's going to do in some variants like a nodal status, the equivalent of nodal status on all the nodes. So we have a big cluster, it's going to be lots of calls. It could be up to 10 minutes or something. Do you want your workflow to wait 10 minutes each time? That's a design question. But I think we really made great progress being forced to ask us all these questions. And now most of our operations are more reliable. To sum up, a health check looks really simple. You're just like, does it work? Does it not work? But you have to dive into the details. I'm going to let Steve talk about another use case. Thank you, Christiane. Hey, everyone. I'm Steve. I'm going to be talking about another use case of hours for temporal, which is migrations. So health checks are the building blocks. But this workflow is going to be a little bit more complex and a lot more moving parts. But with the continuous evolution in technology, infrastructure migrations are inevitable as teams constantly try to improve performance, scalability, and reliability within their systems. In the time I've been here, we've undergone a number of these migrations. We've transitioned our Cassandra infrastructure from a non-Kubernetes environment to a Kubernetes environment. We've migrated Cassandra clusters across Kubernetes clusters. And we've migrated our Cassandra clusters from our initial Kates config to an operator ecosystem. All of these migrations involve a large number of engineering hours with incredibly high stakes as removing massive amounts of data that are critical to our product and our customers. And it's really difficult. There are many, many steps involved. Health and safety checks are required. Actions need to be repeatable and reliable. And as Christiane mentioned earlier, with the evolution of operations, temporal solves a lot of these problems for us. So let's dive a little deeper into the progression of our migration between Kubernetes clusters and how it evolved over time. So we designed the workflow with three things in mind. It needed to be robust. It was gonna run for all of our Cassandra clusters varying both node count and density across multiple data centers, multiple cloud provider regions. It was gonna move massive amounts of data. And we needed to have the utmost confidence in the automated operations that'd be executed by temporal. It needed to be pragmatic. It was okay for us if it wasn't just a single button press. Although that'd be great, having pausing points in between would give us and the application team a chance to assess progress and verify that there haven't been any lapses in performance. But we did want to automate the boring parts. Streaming operations between nodes take many hours depending on cluster size, density. And we didn't want an engineer to have to sit there and monitor constantly for completion. Lastly, we didn't want to raise waste resources and double the cost of our cluster by standing up an entirely new cluster and doing a rebuild from there. We also had, wanted to design it with making it incremental. We knew the workflow wouldn't be a complete product from the start, but it would help us achieve our goal in the short term. And we would build it with reusable and adaptable components like the health check. So the core logic was converted from an initial migration that we'd used previously, but as it developed and we encountered more nuanced situations, features were added such as being able to pause and resume at different points in the workflow. So if we look at what the typical timeline would look like manually, in practice what we wanted to do was a series of node replacements where the old node was in the source Kubernetes cluster and the new node was in the destination Kubernetes cluster. So thinking of how this would work in manual operations, timeline would look like this. Repair node one, we'd perform a cluster wide health check, we'd stop the previous node, we'd prepare the new node for the replacement, we'd start up the node and we'd let the replica stream in all the data and wait a long time. Start for the next node and the next node until all the nodes have been converted from the source cluster to the new cluster. So there's an obvious pattern here and we knew that streaming would take the majority of the time. If we were to do this manually, there would be constant changing of context as we look to see if the streaming has completed and we can move on to the next node. This was the perfect situation for temporal, which could start the next node replacement as soon as the previous node was finished and sufficient health checks were completed. So here's a simplified diagram of what this migration would look like in a case environment. This is a Cassandra cluster composed of a single staple set striped across the AZs for high availability. This means that nodes migrated to the new Kubernetes cluster are gonna be housed in a separate staple set on the right. But we need to ensure that data is kept consistent across racks and availability zones. Therefore, we need to be a little more careful than simply scaling up the new staple set and scaling down the other staple set. So instead, we need to work in batches of three and slowly decrease the size of the source staple set to reduce the risk of being in an unhealthy state. So on the left here, pods three, four and five have already been migrated and correspond to ordinals zero, one and two on the right. To add ordinal three, we needed to replace pod zero since it's in the same AZ and let the cluster stream in the replicated data. We can't just scale down the staple set at that point because that would delete pod two, which still is an active replica. So all these steps that are described needed to be implemented in our workflow for it to be successful, which meant deciding which of these actions were a tribal interacting not only with Cassandra but the Kubernetes API, understanding what the different failure mechanisms are and how to gracefully fail if things go wrong. So if we take a step back and look at this migration as a whole, we could start to visualize some of the building blocks that would make up the workflow. We have our macro plan, which is what we wanted the general flow to look like, starting with a health check and targeting clusters for configuration info, a confirmation prompt, muting monitors and finally migrating the nodes. Then we also have individual steps for migrating a single node, including another health check, updating replace address flags, alternating case resources and monitoring for completion. Each of these bullets correspond to either a retrieval activity or an entirely separate workflow. And as I said earlier, we valued reusable adaptable components that could be linked together. Workflows are the building blocks of temporal and by creating these sub-workflows, we were helping develop components that could be utilized in future workflows. The workflow on the right is something we actually still use today in our daily operations for cluster management. These steps are a real program at a very complicated one at that. Building it required numerous designs, reviews and many tests with a goal to constantly improve. Although our initial implementation automated a number of steps, there's still a process both for our team and the application team that occurred in tandem alongside the workflow. Our teams would sync to develop a migration plan, we'd bootstrap the new cluster, launch the workflow, confirm the plan, let it migrate some nodes, reconvene with the application team to update some config, launch the second half of the workflow all while monitoring dashboards to ensure that there are no deep lapses in performance. The one thing to notice is that on our side of this Cassandra team, we had a lot more steps here, but that's by design. We wanted migrations to be seamless, non-intrusive and require minimal engineering hours from application teams to reduce the total burden of the migration and increase their team's confidence for future migrations. And that worked for a while but we continued to iterate and we had the incentive to do so, which was reducing our own burden for the migrations. We dog food our own tools and we were constantly seeking to improve the pain points we were experiencing. In a later state shown here, the total amount of coordination and effort was greatly reduced and improved to the point where it was basically a single button press. We could reach out to the team and say, hey, just the heads up, we're gonna move a bunch of your nodes in the background, let us know if you see anything suspicious, but we don't need anything from you. And we could launch the workflow and let it migrate all the nodes behind the scenes. Improvements then just come from us. This was largely possible due to a service product from our traffic team that allowed nodes across different Kubernetes clusters to be discovered transparently. Improvements also came in the form of safety. We were launching a lot of migration workflows as well as workflows for daily operations, but we needed a way to stay out of each other's way. We're 16 mates spread across two countries and multiple time zones and require plenty of coordination to avoid stepping on each other's toes. Enter distributed locks, a product created by a sister team which we were able to integrate into our workflows that ensured only a single workflow was operating on a Cassandra cluster at a given time. In conclusion, moving data is hard, but with tools like Temporal it makes the job much more manageable. Doing this entire migration manually would have taken much more time and would have been much more susceptible to human error, although it was still not immune from the occasional fat finger. This quote really resonated with us when we were designing workflows, creating workflows that were deterministic and creating idempident activities reshaped the way we thought about our current runbooks and future ones as we wrote them specifically with automation in mind. What does our future look like? Migrations to more cloud friendly infra, upgrades to new versions of Cassandra or possibly even moves to alternative storage solutions. I can't say for sure what our future will look like, but I can say with some level of competence that Temporal will help us automate our cluster level operations. That doesn't mean that there won't be roadblocks along the way. As we develop these workflows for specific situations, they may eventually become unused and deprecated. We're proud to remove dead code and keep our repositories light and it's okay to retire something, but there is a cost to clean up and untie dependencies. We can also envision a world where workflows get triggered automatically. Maybe after a monitor fires. How can we be confident that the situation calls for that exact workflow and not something else? These are questions we have to ask ourselves. Temporal can solve a lot of problems and it's greatly reduced our dedicated ops time. And it's natural to want to automate all the things, but it doesn't mean it can or should solve every problem. Thank you. Any questions? When you're looking at like the operator versus, like enhancing the operator versus putting something into Temporal, how do you make that decision? That's a great question. I think one clear, if it depends on something that is not interpolated, like typically some monitor, some external resources, we're not going to use the operator. There's also something related to the release process. Releasing your workflow is rather easy. It's just about deploying a single worker. If you get it wrong, you can revert easily. If you enhance your operator, you're just like committing to something deeper. So that's how I would determine whether I want one or the other. As of today, most workflows are for something rather out of the norm that is you do it regularly, but ideally, you would never do it. Ideally, I would never replace a node. I mean, why would I do that? But I need to do it. I don't know, pretty basic question, but what impact have your customers seen based on the implementation? Or have you actually seen anything really in terms of customers able to access their data, see their dashboards? That's probably something you can answer, but I don't know if you've heard of it. I think, honestly, the impact has been mostly positive. Or maybe nothing. I think the real good thing has been on our side, on the ops side. Typically, the migration examples Steve mentioned. We used to do migrations before. I had been doing some primitive Cast and Role operations before and it was painful. Also, you have this mental pressure of I need to think tomorrow that I need to clean up this thing and not forget this step or that step. So I think that the progress was ready for us as operators. And then if we're not spending time on the users things, that means that we're spending time on the right things that make the service better. I think for our customers to change is really there, like we're doing the right thing and we're doing useful things. As opposed to wasting our time on users tasks like repetitive tasks at which humans are not great. At least I am not great at doing this. Workflows are better than me. So I wanted to ask, when you switched your workflow to being a simpler process for your team, it seemed like you'd skipped the step for informing the customer teams that you had a new host in, which made me wonder if you'd switched to a different form of discovery for your clusters. Yeah, so that was due to a service product by our traffic team, which basically had a DNS address that could resolve the either of our clusters at a given time. So from their perspective, they were connecting to one address, but what was behind that address was completely transparent to them. So we could control what the info was behind and move the entire cluster without their DNS address changing. Cause I was going to ask, did you find switching to DNS cause you any loss in availability or resiliency? No, we had very little, I'd say like loss of availability during these migrations. I think occasionally, if we migrated nodes and like somebody forgot to update the DNS address in the original versions, those were times where we would see that. Or if there was one node left and then we brought that one down, they would immediately lose access to it. So it's like small configuration changes like that, which is why that service product ended up being a nice win for us because we were able to not have to interact with the application team and could reduce their time of saying, okay, let's make this change, let's roll this deployment out to all of our data centers and verify everything before we can finish the rest of the migration. Great, thanks. But just to add a little bit, something that was complicated is that due to the fact that Casandra, exposes the topology to clients and to wherever you come from, it was complicated for us to just be 100% sure that they were using the right address and not a specific address because the TCP traffic at some point is just like coming from anywhere. So we have ways to mitigate that, but that was a small issue, yes.