 All right, hey everyone welcome to our talk today We're gonna be talking about multi-cluster stateful step migrations and how they can solve some of your upgrade payments My name is Matt I'm a software engineer at Chronosphere where I work on the infrastructure team dealing with all things storage Reliability scalability and for Chronosphere. I was an SRE at uber on the observability platform there And I'm Peter. I work at Google on staple workloads on GKE so deal with storage Drivers that GKE uses for connecting to storage infrastructure So well general idea of what we're gonna be talking about today Gonna give you a bit of context on How we use Kubernetes at Chronosphere some of the use cases and challenges we have with cross-cluster staple workloads Peter is gonna tell you about some of the work going on in open-source Kubernetes to Make these use cases possible and we're gonna show you a demo of how it all ties together So first to give you a little bit of context about what Chronosphere does and how we use Kubernetes Chronosphere is a software as a service observability platform particularly built for high-scale use cases in cloud-needed environments and Given how mission-critical Observability is we have a really high SLA and we take reliability incredibly seriously to meet that SLA So although we have three nines In our SLA we aim for four nines and have achieved four nines in production part of the way that we achieve these reliability and Fault tolerance needs is by running on top of Kubernetes. So our Kubernetes footprint Spans multiple regions with thousands of kubernetes nodes in total and over 40,000 pods in total each of these clusters run a mix of stateless and stateful workloads But the largest stateful workload is our metrics data store, which I'm gonna tell you more about So the architecture of our metrics data store really heavily influences how we operate it how we deploy about it Sorry, how we operate it how we think about it deploy it and kind of architect our clusters for it so our metrics data store is based on M3, which is an open-source metrics engine and Metrics data store clusters are deployed as three separate stateful sets each in a separate zone and Each having a full copy of the data so data is sharded and Sharded across multiple nodes in a stateful set and then each stateful set owns an entire copy of the data All reads and writes are done through client-side quorum. So when you go to write Data as long as two of the three nodes that own that shard Acknowledge the right and persisted on disk then we consider the right successful similar pattern for reads But because all this quorum is done client-side It means that the clients have to be aware of each unique database node in the cluster Have to be able to open a connection to it and have to be able to individually address each database node so you might be used to some systems like Cassandra, where's there where there's the concept of Hinted handoff or kind of coordinating rights between database nodes In our database. There's none of that. It's all done on the client side So we have a number of kind of robust operational workflows and general day-to-day workflows that are uniquely aware of the databases architecture its replication strategy and fall tar and Quorum requirements. So for example things like moving database cluster between node pools coordinating resizing of the underlying storage Or upgrading the database cluster But there have been times where we've wanted to move one of our metrics data stores between Kubernetes clusters and the complexity was that was going to be involved pretty much required us to find alternatives Any design that we came up with was going to involve Deleting stateful sets within or with the orphan propagation policy and then manually moving nodes between clusters and hoping that there were no other Disruptions in the meantime Anything was going to be kind of a disaster And you know if you lost other pods during the migration then you could lose quorum So we had to find alternatives And in terms of why you might want to do this why you might want to move a stateful workload between clusters There can be a number of reasons a lot of organizations will start off with just one or two production in Kubernetes clusters and Over time have to split up those workloads So maybe your cluster has maybe your Kubernetes cluster has grown too large and you're worried about control plane scalability Or maybe you just want to reduce the blast radius of any one Kubernetes control plane failure Either way you and you might want to end up splitting your workloads across multiple clusters You might also have to move workloads from one region to another So maybe you need some data to be in a specific region for compliance or data severity purposes Or maybe you just want to move some data closer to your users Maybe you have a large user segment in a new geography and you want your data to be as close to them as possible There are also Some cloud provider features that might only be available on new clusters that you might want to take advantage of or maybe you're re-architecting your clusters Or you're swapping out kind of low-level components like maybe you're changing your CNI implementation And kind of doing that on a live cluster can be like changing the engine of a plane mid-flight and kind of terrifying And if you find yourself in the situation of wanting to move workloads between clusters For stateless workloads, there are a number of solutions that the community has come up with that make moving workloads between clusters easier The most notable ones are multi cluster services and multi cluster ingresses Both of which are implemented by multiple open-source service meshes as well as Cloud provider implementations So if you're running a purely stateless workload, then you can just bring up a new Kubernetes cluster Bring up your workloads in that cluster and then shrift and then shift traffic over to them using one of these tools but for stateful workloads the story isn't really as clear and for our workload specifically given the Requirement of the client having to address each database node We can't just put a load balancer in front of a staple set and then move them over and call it a day So I'm gonna talk about some of the challenges you may face with performing across cluster migration So we have the complexity of managing multiple Kubernetes control planes If you're running in a single Kubernetes cluster, you have a bunch of invariants that your API server will enforce You have unique this guarantees per pod. You have At most one semantics of a staple set and those are enforced through API server now if we're dealing with migrating a logical application across clusters you're dealing with two API servers, so you're managing state across Two API servers two clusters. You don't have the same invariance. So That's that's one of the main main challenges here and we're gonna cover like both this orchestration networking and storage But this is kind of a cement Illustration here of say a migration you have a migration that's partly underway You have some some subset of replicas in a new cluster And you're you're scaling down replicas in no cluster While still referencing your storage layer between the clusters So let's talk about network first. What are the challenges here? first one is clients need access to updated application endpoints. So as Matt mentioned Clients of M3db need to be able to uniquely address replicas that are running And so for this you need to have a way for clients to discover replicas as they turn up in a new cluster So usually clients will communicate through a cluster IP your balancer or through a headless service within the cluster But since we're moving across clusters, I need some other sort of solution to handle that The other the other issue is peer discovery. So peers need a reliable stable endpoint to connect to Or a discovery service to determine which replicas are currently available So in a single cluster, you know, there is peers are assigned with the same DNS endpoint the names consistent Even though an IP address might change as pond lifecycle changes And DNS updates are almost instant. Thanks to cube proxy But across clusters replicas Even though you're they're uniquely addressable. They may not have consistent DNS naming Based on that your service match of choice and the fully qualified domain name might not be consistent So that's that's something that we need to account for And then finally when we're actually migrating application, we don't want to have to rearchitect our application to support cross clusters There may be some changes that we need to do to use a different peer discovery endpoint But we don't want to actually have to modify the application significantly Before we incorporate this class cross cluster move because that's just a bunch of extra engineering that we don't want to do So that's another constraint of the system The other challenge here is storage. So We have Typical applications may may use I guess for read write many that's fairly simple because we have Application storage that may be accessible across both clusters just through a network endpoint that stays consistent But if we're using read write once we need to make sure that our storage references are able to be shared across clusters So let's consider this case where say we have an application that's using a persistent disk in one cluster and how do we get the other cluster to Recognize that persistent disk and use it once it's no longer being used in the original cluster if we're moving over a replica So we need to have a way to have this data be accessible across both both clusters or at least this data reference and Within a single cluster pvs are a global resource But now that we are kind of breaking out of the cluster and going out of the cluster boundary We need to still enforce this invariant So that's another challenge. We have to we have to handle So yeah API server isn't enforcing the pve to pvc uniqueness So we need the orchestration layer to handle that something that's outside of the cluster That's able to make sure the application still obeys these semantics. We wanted to obey even though we don't have the invariance That API server is enforcing And finally the major challenge here is orchestration So there's a number of things when you think about number one is Replicas need to follow storage. So if we're using dynamic provisioning if we simply scale down replicas in the source cluster and bring them up in a destination cluster And we're using dynamic provisioning We may get volumes provision that we don't want and really we want to have our data layer be migrated So storage replicas need to follow storage and we need to bring up storage in the new cluster so that our Application can access it in the in the destination cluster The other challenge is pod disruption budget So how do we make sure that during a migration any infrastructure changes? Maybe a cluster update or Maintenance events don't affect the application and don't disrupt the budget of the global application That's kind of logically running across both clusters. So we need some way to Enforce a pod disruption budget or maintain a certain budget of availability while we're migrating and The other challenge is with network. So we touched on that. We need some way to have Clients be able to connect and pure is able to connect to other peers But the mechanism that we're using as I mentioned, you know cube cube proxy is very quick You know we can update a DNS endpoint of a sudden the pod IP will get reflected there But if we're using a service match that spans cross clusters our DNS updates might not propagate as quickly as cube proxy so we need some way of Signaling to our orchestration system that maybe a DNS endpoint isn't ready To be served for clients and we need to wait for that to happen So that's another thing we need to be aware of the other challenge is with if we're using an operator The operator typically will reconcile to a particular state So if we have some orchestration that's sitting outside of a cluster that's moving replicas over in the staple set We need to have the operator remain in sync with what we're doing If we simply, you know Move replicas from a staple set An operator will try to reconcile back to its Specified state so we need to have some way to either have their operator be aware of the migration or Let an operator know that we're scaling down And not to get in the way and fight over staple set resources So I'm going to talk about some of the some of the ways in open source. We can maybe tackle these problems So the first building block I'm going to introduce is multi cluster services And so this is kept one six four five. It's introduced in 2020 And it provides a specification for cross-cluster domain naming So the benefit we get here if we look at pure discovery that one problem I talked about before Is that we can use a headless service a headless multi cluster service to uniquely to allow replicas across clusters to be uniquely identifiable so that solves this problem of having Unique identity but also having a discovery service So if we query the discovery service similar to a headless service within the cluster We can discover all of the peers that are behind this multi cluster headless service and then for client networking as Matt mentioned m3db as this constraint of clients needing access uniquely to replicas Multi cluster services provides us a way to uniquely address these these endpoints As long as they're qualified by the the cluster ID that these replicas are in so Addition in addition to that. Yeah, we we have this unique this Provided by multiple services and we have some way to discover these endpoints the other building block I want to introduce is Staple set slices. So this is a new cap that we're working on for 126 and we're targeting alpha but at its core, it's a way of segmenting a staple set into a Subset of of replica IDs. So it enables You to take a staple set of say like n replicas and split it into two to have like a a set that You can scale down in a source cluster and scale up in a destination cluster In a complimentary fashion. So one of the challenges of the staple sets today is, you know, they if you're using ordered ready They scale from zero to n and they scale down from n to zero If we simply scale down in cluster a and then try to scale up in cluster B We're gonna be scaling up from zero and so we have an overlap of replicas But with this cap, we're able to kind of provide more granular control over those replica ordinals And scale in from replica n and down to zero in the destination cluster So we can have a complimentary slice to the one that we've just scaled down in cluster a So how do we tie these together to build a solution from for cross cluster migration? So we need to coordinate the building blocks. We need to leverage multi cluster endpoints Or incorporate a peer discovery protocol that allows for communication across these clusters We need applications to provide signals that they're healthy Whether that's the application finding quorum from its peers if DNS propagation is an issue making sure that The application is able to notify us when its client endpoints are ready We have to make sure that operators can coexist with application orchestration and We need to if you're using say like a CI CD pipeline for managing your staple sets You need some way to signal to that so that you don't have contention between migration and Your get ops get ops workload and then we also need to migrate our dependencies over so This is our storage layer or our config apps that we're referencing Those need to be moved over to the other clusters that we can bring up our staple set In cluster B So we're gonna showcase a demo here of m3db migration And some context on this demo. So we're running m3db across three zones So we have three staple sets and we're gonna migrate these one by one. So one per zone And in this demo, we're using multi-cluster services on GKE For the orchestration of these staple sets. We're using cap three three three five So it's a modified Kubernetes cluster that has this enhancement And then to move the migration over both of these clusters are running in the same zone So they're the same or the same region. So they have the same zonal footprint So as you as you migrate our storage references over we're still referencing the same underlying Persistent disk and so as long as our compute resources match the storage resources our replicas are able to come up as Our storage is attached to the compute resources in the new cluster So we're gonna roll this demo here and I'm just gonna Talk through it as we go So we're just kicking off this with a CRD So on the left here, we have our source cluster and that's running it through UVV on the right We have a destination cluster that doesn't have anything in it And so orchestration just kicked off and it started scaling down the source staple set and bringing up the destination staple set as I mentioned Replicas need to follow storage. So the first thing we did was migrate the storage references So taking those PV PVC config map dependencies and moving those over to the destination cluster Once that happened we were able to create a new staple set that referenced one of these When one of the replicas that we wanted to migrate over so we migrated over One of those owns if this staple set had more replicas we would scale in the highest replica first But we were just Moving over stays a staple sets of replica size one in order to fit it on the screen here and at the top we have The commit log traffic for our m3db application so we can see as replicas are scaled down Our traffic scales down to that particular replica and we start to see as DNS propagates. We can see there Traffic search to scale up on the destination cluster And so I sped up the demo and cut out kind of the boring sections We're waiting for our DNS names to propagate propagate to our clients But in effect we're able to showcase how that scales up and scales down and clients are able to get updated endpoints As we're migrating over. This is a little glitch in cube proxy We just had to restart it here, but our applications back up and We'll just wait for the final instance to come healthy and then That should be the end of the demo there. So the other thing to mention is for for readiness what we're using here is pod readiness gates to Note that a pod is ready And we can tie in the network piece here as well So once that endpoint becomes ready with our logic cluster DNS name Then we can signal the pod health Propagate that up to staple set health and then our orchestrator is able to know it's safe to Move on to the next replica or the next zone depending how we configured migration Okay, so what's next? So safety that's one thing I didn't really touch on but I did mention as a constraint We need to make manage budget across our clusters. So making sure we've protected our application across clusters We're modifying pod reception budget in a distributed way so that both clusters have a Global view of budget and we maintain budget if there's any maintenance events Increasing the speed so aligning our unavailability budget with failure domains in this case Moving one zone at a time because that's kind of the constraint of m3db We need two out of three zones up in order to handle rights But we can speed this up based on failure domains Maybe we take one zone down at a time or maybe we can handle a subset of replicas with within a single zone And we want to have a max unavailable per zone that we're managing at a time Data flexibility so being able to move data across regions You know this demo just showed moving across a single zone. We're still referencing the same underlying disc, but you can imagine, you know instead of using that same disc reference rather than just using it directly and we if we want to move migrate across cross zones we could Replicate that underlying storage to another zone and then bring up a new replica in the other zone So there is some additional challenges here in terms of network latency if we have the other cluster in another region During the migration. So, you know, this would probably want you probably want to do this only during a period of lower traffic During a regular maintenance window And then finally operator compatibility. So supporting general operators. So for this demo, we made some Changes to make the m3db operator multicluster aware But we do want to make some improvements in open source to kind of have Specification for multiple like generic operators kind of tie into to handle this operation. So how do we signal between? Orchestration systems they're outside of clusters to operators within clusters That Orchestration should be deferred to another another controller. So that's kind of the idea here making it multicluster aware Then a hundred back to Matt for closing up All right, so that's about it that we have if you want to come hear more about Stateful set war stories. I will be at the currency or booth today, which is right near the GCP booth So you can find Peter as well And we will take some time for questions now as well Because we have a we have a hand up in like row six over there Thank you So, how do you handle the dependency in in between two applications? So as an example, I have a application which is kind of dependent on other applications And that application needs to be migrated first to the second cluster Yeah, so for dependencies like in this example m3db has a dependency on at CD And so in this scenario We you configured at CD to be multicluster to be behind a mod cluster endpoint and m3db was communicating with it Through its multicluster domain name And so as m3db migrated at CD remained in the source cluster What what you can do after you've migrated say that Application or maybe you migrate your dependency first like we could have migrated at CD first and then migrate m3db or vice versa, but We didn't think about like which pieces which building blocks to in a migrate as we're going and with like The applications do need to be multicluster Available to kind of have this cross cluster dependency be resolved. Hey, I have a question here All right By the way, thanks for the great talk so One of the things that I wanted to understand is how do we kind of are there today provisions to cut off traffic for stateful Workloads in the sense that if I wanted to ensure that in-flight lights are persisted to the backing store Are there hooks to ensure that those are flushed before we kind of move over and What kind of support do we have within the I'm just gonna step up here in front of the monitor so I can hear a bit better, okay So as we migrate workloads one of the things that we want to ensure that data that is written On to one of the stateful states is fully written So is there a way to ensure that rights on the final destination store are complete? Is there a way is there some mechanism to ensure that that is notified before the orchestrator kind of kicks off and Move it over to the other cluster. What kind of mechanisms do we have on the on the you know, the Kubernetes control plane to You know ensure that this done. Yeah, so if we're concerned about data consistency from a single replica Yeah What one thing we can do is make sure that our application has the right Permanent is in built-ins to make sure we're flushing that data to our rewrite once disk as it's being terminated So that means setting up appropriate graceful termination windows So that we can gracefully terminate and flush our rights to that disk Before we actually migrate the replica so as long as we kind of have that API Semantics of allowing like the Kubernetes API to take down a pod and have it terminate gracefully We can control orchestration safely and have that Consistent set of data written to the disk before we bring up a new replica starting on the new cluster Because effectively if we're using the same disc reference, it's no different than say a pod being restarted For like maintenance or an application upgrade or a termination event also in this example the workload migration controller Gives feedback to any operator that's observing it about where you are in the migration So if you build your application to kind of be aware of when it's being migrated or when the thing that your operator is Controlling is being migrated Then it doesn't just have to be this opaque migration You can kind of inform your application and you know like you said shut it down to Ensure that like all rights have been flushed or something before it moves Okay, thank you Peter. We one more question over there if we have time So within a single cluster the tashi-tashi controller is responsible for making sure we write once volumes are attached one to one note Going forward like how the house attach attach control across both clusters are gonna work together to Yes, that that's one of the challenges. I mentioned that these invariants that are enforced in a single cluster today They're not enforced across clusters. So whatever is doing your Migration so whatever is orchestrating your migration outside of the clusters needs to make sure that those invariants are logically Kept even though there's no way to enforce it through PI server So, yeah, we're kind of dealing with like a split-brain scenario because there's two attach-tash controllers running into separate control planes so it's really on the The orchestrator to make sure that we've safely shut down a pod and that disk is no longer attached to that VM so There we do need to make sure that like our signaling is appropriate for termination state when a pot is actually Brought down Before we bring it up in a new cluster. There also are kind of some safety checks, you know Most storage limitations whether it's a cloud provider or something else like they won't let you attach a read write one Disk in multiple places. So worst case you get an error and hopefully not just totally corrupted application state Thank you Just on the same node. I have another question and stateful set If your application is so close to attach to this in this case is a DB and we are not going for Cluster migration. We are going for blue green for that specific DB schema change How we will do that because existing DB is still supporting and we need to spin up and we need to validate The new DB schema. Is that something you have thought about? Yeah, so blue green kind of scenario So this area is kind of like we're both migrating to a new cluster. We're also doing an application upgrade at the same time, right? Yeah, so I agree that definitely adds like additional complexity So I think that's it's not something that we showcase in this demo And I think it can be done under the right safety protocols But it certainly is a bit of a challenge because usually when you migrate a database schema to an updated version You can't roll back say for my sequel you update my minor version. You can't roll back to the previous minor version It's just not supported so I think that in to ensure safety and kind of like Isolate some of the changes to your system It's better to do these updates separately So that you're kind of minimizing this like the benefit of moving across clusters is you can you can you can roll back? if we have the same database schema across clusters, but Doing this doing this upgrade does kind of introduce risk and kind of Makes the the rollback option no longer possible So it's kind of like reducing the safety of this of this cross cluster migration anyway, so Thanks. Thank you everyone