 All right, thank you very much. Well, we will jump right into it and we will start this presentation with a question that is almost as old as time itself. What is Kubernetes? Now, let's jump down this rabbit hole. We have two lovely tour guides, rabbit number one, that is Matt, senior engineer at Chronosphere, and then rabbit number two, that is this guy, Dominic, principal engineer at Temporal. So I would argue the answer to the question, what is Kubernetes? We have at least as many answers as there are people in the room and probably also online, but let me give it a shot. So I would argue Kubernetes may be characterized as a container orchestration platform. That is a very agreeable definition. However, it should give us some pause because Kubernetes does do more than just orchestrate containers. There is at least networking and storage involved as well. So let's broaden this definition a little bit and let's say Kubernetes may be characterized as an infrastructure orchestration platform. But again, the defining characteristic of infrastructure orchestration is operation automation. So I would say we may safely arrive at the conclusion that Kubernetes can be characterized as an operation automation platform. Kubernetes implements operation automation with the help of Kubernetes controller and Kubernetes resources. There are two types of controllers and there are two types of resources. There are core controllers and core resources provided by Kubernetes itself. And then there are custom controllers and custom resources, basically provided by anybody else. So custom controllers are creatively named operators and custom resources are self-explanatory named custom resources. So we let the historians decide which naming scheme is better and we focus on more important questions. What do controllers do and how do controllers work? Well, controllers perform reconciliation, the very core of operation automation. Controllers transition your system from its current state to its desired state. And for that, a controller performs two distinct tasks. First, detection of state drift. Detection of state drift is a determination that the current state differs from the desired state. And second, mitigation. Mitigation of state drift. That is a determination and the execution of a sequence of steps, a sequence of actions that transition the system from its current state to its desired state. Now, a Kubernetes controller performs reconciliation continuously in a what we call control loop. We determine the current state that is typically reading lots of Kubernetes resources, core and custom. Based on that, we determine the next action and then we perform the next action. Very important, conceptually, every single time we perform the loop, we start in an initial state. That is, we have absolutely no memory of the previous runs. So in the control loop pattern, the control state, the knowledge of what we should do next is derived from the operational state by reading in a bunch of resources, core and custom. Let's look at an example, because this works very well, very well for straightforward reconciliation. For example, the replica set controller. Replicar set controller manages a set of pod objects. The replica set controller performs two actions with instantaneous effects. Create a pod if there are too few and delete a pod if there are too many. Now, reading the replica set object, the replica set controller makes sure that there is just the right amount of pod objects. And reading the pod objects, the Kubelet itself makes sure that there is just the right amount of containers. Having multiple controllers with multiple control loops involved in the reconciliation is called cascading reconciliation. Each control loop managing the part of the world that it knows about. There is no central, no all-knowing controller. Very cool approach. However, that approach gets progressively more complex. For example, let's take the deployment object and its rolling upgrade strategy, which is still pretty straightforward, yet already much more complicated. The deployment controller manages replica set objects. Yet during rolling upgrade, the deployment controller guarantees availability. Therefore, the deployment controller must reach beyond the replica set object and also check on the pod objects. It needs to check on the status of the pod objects to see that the pod object is ready and available to serve requests. Here we are at three control loops. Three cascading reconciliations. And as you can imagine, life is only getting harder from here on out. So imagine we are performing a database upgrade. Our business requirements require that we upgrade the database one short at a time without ever falling under the replication threshold per short. How could you implement that? Well, you may very well attempt to express that sequence of interrelated actions as a number of control loops and a number of cascading reconciliations. But I mean, just a thought, you may also express that sequence of interrelated actions as a sequence of interrelated actions, a workflow. Just for that statement, I feel a touring of word coming on. But now you get to combine the control loop and the workflow. The control loop is still responsible to detect state drift. However, now the workflow is responsible to mitigate state drift. Well, but if that wasn't that easy, why have we not done this before? Well, I'm glad you asked. The control loop is inherently reliable. Any failure, like a crash, is trivially tolerated since the control loop has no memory of a previous run to begin with. After the controller restarts, it will just continue business as usual. However, a workflow is not inherently reliable. A failure, like a crash, is a major, at least a minor, catastrophe. If step one has been performed, but step two and step three have not, your system is in an inconsistent state. And the next run of the control loop may actually not be able to detect that. We all know the consequences, dangling resources, difficult garbage collection, and lots of messages on Slack if anybody knows what machines they are and who they belong to. Well, meet Temporal. Temporal is an open-source platform for reliable workflow execution. You write your workflow in plain old Java, plain old Go, plain old JavaScript, plain old PHP, what have you. And Temporal guarantees that your workflow execution cannot fail. If a crash happens, Temporal resumes your workflow execution in the exact same state, at the exact same location it was in when it crashed. So from the point of view of the workflow execution, that crash never happened. Now, that's quite a promise. And Temporal as a technology can back that up. And to prove that Temporal can back that up, the hard part of this presentation, I will faithfully hand over to Matt for the rest of this presentation. Thank you, Dominic, for the great intro. So now that Dominic has told us all about how operators, controllers, and Temporal each work, I'm going to talk about how we've combined the two at Chronosphere. At Chronosphere, we rely heavily on M3DB as part of our hosted observability platform. M3DB itself is an open-source time series database that was particularly built for large-scale and high-throughput use cases. And part of what makes it really well suited for this is that M3DB is highly fault tolerant. And it accomplishes this by sharding data as you write it to the database, replicating those shards across different failure domains so you can think of this as like zones in a cloud region and ensuring that those shards remain available. In order to make managing M3DB easier, we built a Kubernetes operator for it. And the Kubernetes operator is also open-source and really manages the lifecycle of M3DB. Some of the core parts of this lifecycle are creating new clusters, scaling existing ones, and upgrading existing ones. And upgrading clusters actually tends to be one of the trickier operations, and this is where we introduce Temporal to make it easier. To give you an idea of why this can be tricky, so as I said, M3DB provides fault tolerance and it ensures, or provides this by doing all reads and writes based on a quorum response. So in this case, you can see that we have three replica sets, or three M3DB replicas contained in three stateful sets, each pinned to a zone. And it's okay to have an entire one of these replicas be down. So you can think if you lost an availability zone or had some other sort of catastrophe, so long as two out of the three instances for a given shard are up, users will not see any failing reads or writes. However, as soon as you lose instances from a second stateful set, you won't be able to uphold quorum and users will begin to see impact. In addition to needing to respect quorum when we do upgrades, there are a few other kind of safety aspects to consider. So one, we wanted to prevent against a case where you go to roll out a new version of M3DB and maybe it's passing health checks, but misbehaving in some other way. And this is an intentional choice, mainly because we didn't want to have to embed a lot of complex logic into the health checks. For one, you can imagine that then you would have to deploy the database in order to deploy or in order to change the way that it checks its own health. And also each node would have to have a much higher level view of the cluster than it typically has on a one-node basis. And then in terms of how practical deploys are, we also need them to be fast. So when a database node comes up, it has to build up a lot of in-memory state from disk before it can serve requests as part of a process that we call bootstrapping. And if you only deployed one pod at a time, which would be the default with a stateful set, it could take potentially days to deploy larger clusters. Around the time that we were trying to improve the speed and safety of deploys, we had already started using temporal in other parts of our stack at Chronosphere. So temporal drives our internal deployment platform, a bunch of end-to-end tests, as well as a variety of Kubernetes cluster-level tooling. And throughout introducing it, we have built up this large library of activities and other kind of helpers for writing workflows. So these include things like checking internal Chronosphere APIs to see what alerts are firing for some given dimension of the cluster, checking metrics as we roll things out to make sure that, say, incoming writes didn't drop off a cliff, things like that. And then there's also a lot of support for driving workflows via Slack, which I'll talk a little bit about in a bit. So taking a look at what it takes to upgrade a cluster using temporal in our operator, so now when a user goes to deploy a new manifest rather than just directly sending it to Kubernetes, it first gets sent to a service called Deployer. Deployer is also run inside of the Kubernetes cluster, and it's a single binary that serves both GRPC service for interacting with deploys as well as a temporal worker. And then upon receiving a new manifest to deploy, Deployer first performs some validation and then submits it to Kubernetes so that the operator is aware of it. Instead of the operator immediately acting on each stateful set, it actually waits for the temporal workflow to kind of unblock processing of that given stateful set. So this is all driven via annotations. And when you first go to deploy a new manifest, the operator actually won't do anything on a given stateful set until it notices this annotation present. Once the annotation has been set, then the responsibility for the next part of the deploy is handed off to the operator. So the stateful sets are all configured with an on delete update strategy, meaning that we actually control when pods get restarted and brought up on the new image. So the operator will patch the stateful set and then proceed to delete pods in batches until the entire stateful set is processed. And then upon finishing processing, the workflow notices that that stateful set has been processed and blocks until the user kind of unblocks the workflow. So in this case, again, you've upgraded an entire stateful set and the operator will actually just wait until the workflow unblocks it. In terms of what this workflow looks like, the workflows that we have in our deployment system would have been really difficult to replicate without temporal. So we've leaned heavily on nested workflow execution and running multiple workflows in parallel. So every deployment workflow actually runs in parallel to a workflow that makes sure that the state of the entire system remains healthy while the deploy is rolling out. And we'll talk a little bit about what that health check looks like in a bit. But when we go to kick off a deploy, we first execute a temporal activity to get all the stateful sets that are owned by a given cluster. And then for each stateful set, we begin processing it. The processing steps are pretty much what I mentioned earlier. So the workflow will set an annotation on the stateful set, wait until the operator has deleted all the batches of pods and the stateful set is healthy. And then it will also actually wait for that stateful set to remain healthy for a given settle period before moving on to the next batch. And then finally, when it's time for that, it prompts the user via Slack interaction to unblock the deploy. And what's nice too is that all of this is configurable as inputs to the workflow. So you can imagine the wait times can be configured whether or not you prompt users is configurable. So for example, in our test clusters, as we roll out new images, we're not going to prompt a user on Slack constantly. But in production clusters, we would have more conservative defaults. What's great is that all of these workflows run while another workflow is in parallel checking for errors. And we'll actually stop the deploy if anything in the system becomes unhealthy. And again, this is based on more than just anyone pods given health checks. This is based on a view of the entire system. So if any alerts begin firing for the cluster, then we will halt execution of the workflow. This notification goes out to the user again as a Slack message and they can choose what to do. So for example, maybe it was a transient error where you can actually continue the workflow. Maybe you want to stop it where it is or maybe you want to roll back to a previous image and then rolling back is actually just beginning a new workflow to deploy whatever image was previously present. The Slack prompts that I mentioned are provided as part of these generic helper activities. So workflow authors can configure which actions at any point in a workflow would correspond to kind of the next steps in the code. And then users will get Slack notifications and interact with the deployment system this way. So you can see in this example, a deployment failed. The workflow that runs health checks realized that there were some failing pods in this case and asked the user what to do. In this case, the only options were to either stop where it was or roll back. In the case that Slack is down, we also have CLI fallbacks for all of this and or if the users just want to interact with the deployment system that way. And then we're also working on adding UI on top of it. So to kind of wrap up some of the concepts we've talked about here, typically when you're building these sort of reconciliation or self-healing systems, as Dominic said, there are two things that you care about, detection and mitigation. Detection is figuring out if parts of the system have drifted or need some sort of action and then mitigation is converging on that desired state. And when you're working with Kubernetes, which itself is modeled declaratively, you would typically be forced to model everything yourself declaratively as well. But the actual steps of mitigation and converging on that next state are often a series of imperative steps, and it can be much easier to write these as workflows that kind of maintain the state for you than trying to model that all declaratively. So rather than limiting yourself to just remaining in a declarative model by combining operators with temporal workflows, you get the best of both worlds. So your reconciliation or detection loops are still declarative, and then the actual mitigation steps you perform can be performed imperatively as a workflow. That is what we've done with our operator, and we've been super happy with it so far. So with all that said, that is it that we have for you, and we're happy to take any questions you have.