 Hello everyone. Welcome to KubeCon. Welcome to my office and welcome to my talk on cluster reconciliation and how to manage multiple clusters. My name is Valerie Lancy. I've been involved in Kubernetes for a handful of years now. KubeCon 2017, if you recognize this, was my first time attending and I have been tilting at the problem of how to handle multiple clusters for most of that time. Starting from very early into my experience with Kubernetes, aside from all the problems that come with trying to jam a stateful app that was always meant to run one single VM forever into a cluster, I quickly encountered a lot of problems with how to run a redundant geodistributed system in Kubernetes effectively. And there's been some evolution that I'll talk about, but it's still something that's bespoke and there's a lot of way to go. So first off, I want to talk about why we want to use multi cluster at all. There's roughly four reasons that I usually kind of see identified. First one is regional points of presence. Kubernetes clusters don't span multiple regions or data centers very well. There's an expectation that there's very low latency between the components. There's a huge amount of internal traffic between components and they don't exactly partition well. So if you are running your app in multiple places, each place should be a cluster. You will often want cluster level redundancy. Clusters can fail in many, many interesting ways. You can deploy something cluster wide that fails. You can take down the cluster control plane. You can take down some key automation and this especially goes if you are running your own control plane, if you're not using managed service. If you're running your own control plane, please, I'm begging you, run redundant clusters. You do not want to be fixing that live. Next one, also a big plea for, I guess, cases where you have sufficiently heterogeneous workloads for cluster level security isolation. There are many, many ways between past CVEs and just common misconfigurations and attack services that someone can take a fairly inconspicuous level of access to the cluster and wind up managing to exfiltrate substantial amounts of data from workloads or take over the cluster altogether. In particular in cold water and Tim Alcler have had some fantastic prior KubeCon talks about some of the ways that things can be misconfigured or exploited. And the last one is, it's common in large setups to have blue, green cluster upgrades, potentially having clusters that you specifically spin up and replace instead of trying to upgrade everything in place. And again, this kind of goes well with having cluster redundancy versus we'll do it live. So what kind of things do we actually put in a cluster? We have things that would consider to be the core cluster setup, stuff like the admin RBAC, the basic monitoring stack, maybe like data center or cloud provider integrations, things that we probably want to put everywhere. There's a lot of stuff that maybe is more specific. It might change more often or it's more specific to the workloads we're running. So like various storage backends might be specific ingress controllers that like a specific application or team needs, it might be, you know, operator or framework, like something to run Spark. And then most importantly, we have the applications, things that actually, you know, make our business money and are the reason that we are actually here doing this. We also need to deploy those. So I can sometimes be a detractor of how the cloud native world turns into a lot of very fancy technology and checklist. And you'll hear me call back to that a few times. But I want to explicitly call out what problems that we are considering in scope and what problems that we're trying to solve in terms of building a cluster of reconciliation system. So we want to deploy common resources to multiple clusters. We want to loosely couple the clusters and those resource definitions. We want to clearly map the relationship between those resources and those clusters. And we want to build a garbage collect deployed resources that should no longer be deployed there. So what does failure to build a decent multi cluster system look like? Well, so what does failure to build a working multi cluster system look like? What does failure to build a good multi cluster system look like? You can easily wind up with a dependency tankle between resources and versions. Anything that is consuming the Kubernetes API can become very tricky to version. And there is a staggering effect for then components that consume components. You also need to be aware of any dependencies that are required in your application, because you typically will want to have a certain amount of locality between actual services that you're running, making sure that you are not having a golden path workflow that goes between different data centers or making sure you have data locality for services that need to consume that data. You could easily end up with orphan resources on clusters. So this would be like a decommission service or an experiment or something that is running in a place that it shouldn't, because we don't intend to be deploying it anymore, but we're not cleaning it up. You might wind up with wonky workload and cluster mapping where you have a cluster that's been around forever and you know, everything's on it. You might spin up a new cluster that's cheaper or in a better location or something. There's just not much stuff on it because people haven't adopted it yet. And all these things can make it very hard to bootstrap new clusters. So if you have that really nasty dependency chain and you have to kind of figure out how to fendangle everything to the cluster, it can be difficult to just deploy somewhere new. If you're a random application. So in order to talk about specifically why some of these challenges are surprisingly hard to do well, I want to talk about what deploying something actually means and a lot of the logistics of the Kubernetes API itself. So forgive me if this is overly basic, but I don't want to assume anyone's level of knowledge walking into this talk. I want to briefly cover how the Kubernetes API itself works. So Kubernetes API has an object model, tiny bit analogous to like Linux's everything is a file model. So everything is represented with a group aversion and a kind, which identifies it as like a deployment is a particular object in the apps V1 API. So we see that represented in like the animal representation of a thing. And every single object also has an identifier. So it's how we tell a resource like a deployment apart. We know that one particular deployment is different from another because it has a unique name namespace pair. Every resource has a name. Any resource that exists in a namespace has a namespace. So most things exist in a namespace like pods, deployments, et cetera. Some things like nodes or cluster role bindings don't exist in a namespace. They have no namespace field. And Kubernetes itself is a rest ish HTTP API. We often don't see that because we're busy dealing with like a typed client or kubectl or something. But this all forms together to create like a fairly long API that we identify any individual object with. Kubernetes supports a bunch of different verbs. I'll point out the ones that are relevant to making changes to things. We have create, we have update. They work like you'd expect with a rest crowd API, either create a new thing that doesn't already exist or replace an object wholesale with a new state update has a version check, which is quite helpful. There's a version in the metadata of any existing object. So if you fetch an object, change it in memory, and then go to reapply it. The Kubernetes API will warn you if there's a version conflict. In other words, if you fetched and changed a different version than it's in memory right now. There's a patch operator, which lets you have custom semantics to say like I only want to update a specific field and I don't want to worry about or touch anything else. And then there is deleting, which removes the object from the cluster. It's worth knowing that something called finalizers on objects. It's a list of individual things that must be all removed before the object is able to be deleted. So if you have some kind of cleanup operation that's required before an object can be deleted, say like making sure that you know a pod is actually distracted and removed from the system or making sure that like some load balancer infrastructure is torn down from an ingress, you have a finalizer on the object, which then whatever machine is responsible for that teardown removes the finalizer once it's done. Once all finalizers are gone, the delete happens. So this is also why you usually shouldn't force delete because if you're forced deleting it's because there's a finalizer trying to do some cleanup and it's not making progress. If you delete the progress never happens. There's also the server side apply API. This was rolled out very roughly a year ago. I am not as familiar with it, but it provides the kubectl apply type behavior as an actual API in Kubernetes, as opposed to forcing people to use kubectl apply the CLI tool. So this adds a bunch of metadata giving ownership to specific fields in an object. If you change a field, then you are considered an owner. So like individual controllers or kubectl itself are examples of owners. And when you want to specify a change, only the fields that you own are taken into account. So if you own a field and you don't specify it in your new desired version, then it will be removed. If there are fields that you do not own, they're not in your desired version, then they're ignored. The existing version is kept. Understandably, this can lead to API conflicts when you want to change something that you don't own. There are three different ways that you can handle this. One is to reapply without the fields that are conflicting. In other words, give up on trying to make that specific part of the change. The next one is a no op in terms of the actual state. It's to change the field value that you're trying to apply to the existing one. So this doesn't change the state that data is in, but you become a shared owner of that field, which then means in future apply attempts, you would be able to apply without any conflict. And the last one is doing a force apply. You become the sole owner of that field and you update it. So there are many gotchas when it comes to the mechanics of trying to deploy or update something in Kubernetes. I'm going to go through a couple specific cases and then kind of talk about how the number of kind of intent issues can rise drastically when you don't know what kinds of objects and intent you're dealing with. So for the sake of simplicity, when I'm going to talk about a service in Kubernetes, I'm going to say it's a deployment, it's a service object. We know it's a bad confusing name and an ingress, maybe there's like config maps or secrets or something, but we're just talking about like a deployment, its way to get traffic and the accoutrement with that. So I'm going to show a couple examples of state that we're trying to apply in states in the cluster. The state that we're trying to apply is on the left. The cluster state I'm going to show on the right. So here on the left, we want to deploy, you know, like the archetypical and Gen X example. So say we've actually now, we've deployed that and we want to patch in a specific image change for our deployment. I think I have the version slightly mixed up here, but if we are only specifying a subset of the fields in the version that we're trying to apply it, what we will likely accidentally do, say if we're just doing an update is we will wipe out a bunch of fields that we wanted to keep. So when we update the individual engine X container struct, if we only specify some fields in it, everything else in that just goes away. So suddenly we have no port on our engine X, and it will understandably probably cause some things to fail. Suppose we want to change the name instead of calling engine X for whatever reason, we want to call it web server. So we change the name in our deploy system, we deploy and now we have two objects. And this happens because we've changed the primary key of our object. There's no garbage collection with like any kind of naive kubectl apply approach or whatever. So now we have two. Maybe this is going to just run some extra resources that we shouldn't be running. There's also kind of attack surface implications and leaving an old thing running forever. This could cause really, really wacky behavior. Say if this is like a thing that's pulling jobs out of a queue because then you have, you know, ancient software doing whatever and taking away real work. So there's lots of really weird ways that this can get at resources or fail. And this is a common thing I see when I'm looking at kind of an under managed cluster. Another problem we can have is trying to figure out if we should remove a field or not. So we have our service definition on the left. On the left, we're trying to create a cluster IP service to our demo. And we don't specify a cluster IP because that's normally something that's created automatically grabbing like whatever sets up the load balancer, typically kubroxy grabs. So I don't think it's kubroxy, but we grab an IP out of our cluster IP space and we put that in the cluster state. However, we know how they field that is different. It is empty in our third state. It is present in our current state. And a naive reconcile tool might want to replace that offhand. I actually don't remember if the behavior is to allow that field to be removed and wind up generating a new IP or if it's an immutable field. But either way, it's not a thing that we want our tool to try. But it's a thing that a naive approach would try. And the counterpoint to this, if we just left fields, is we could easily wind up with drift of automation or people either innocently or maliciously inserting things into our production copies, and us not noticing or at least not ever removing those fields. If we also don't have, you know, a tombstone or garbage collection method for specific fields, if we take a field out of our definition and then apply, we might have kind of a similar problem to, you know, the two engine x copies thing where we don't know that we're supposed to clean up our own field. So it's just remaining there forever, even though it's not in our definition. So to kind of summarize and bring up another couple of use cases, we'll often, a lot of update methods will leave fields untouched. It's also easy to try to update fields that you don't want to update such as fields that are managed by other automation, things that are set by the cluster. There are a number of fields on resources that are defaulted on admission, which means that when you create a resource, there are fields that go from being unset to being explicitly set to some default. So you might kind of wind up in a reconcile loop if you're just trying to always change something when you see it's changed, because there will always be that set of missing fields in your intent that semantically translate onto what's in the cluster, but it's not like a literal deep copy match. And there's also a couple of kind of weird gotchas with specific fields like some fields are immutable. Some of the fields in service, I believe are immutable and secrets is a common case. You can't switch between data and binary data formats in a secret. This exposes a kind of unpleasant problem called break before make when you have a singular identified object that is immutable and you want to make a change to it. You basically have to delete it and recreate it. And there are many, many things that can go wrong with that, such as if the create fails in some way or the delete hangs, you're kind of stuck with no resource. If there's any cascading effects that can be very bad. If it's just a slow process compared to, you know, whatever's running, that can be very bad. It's a problem that can come up a lot in infrastructure automation and usually comes from kind of poor API designs and then trying to do generic actions on top of that. So let's talk about the journey to having a multi cluster system. I'm going to try to break this down. Things that I've seen into three general models, but there's a general trend as the number of services and the number of clusters at a company increases to going from tight coupling to looser coupling between those workloads and the infrastructure details. And there's a trend from being very manually managed to being much more automatically managed and focusing more on the user's intent. And as a reminder, the goal of building this kind of stuff is to reduce uncertainty and toil. The goal is not to build a beautiful machine that, you know, works magically and it's all cool. The beautiful system is one that works well enough. So be sure about the problems that you need to solve and how expensive ops work is to avoid sinking, you know, weeks or months of development work to solve pretty marginal burdens. So our first model, it's kind of the starting place is having hard coded clusters. We explicitly have some kind of play config that says this workload goes to this cluster. Maybe this is like a Jenkins pipeline where each stage specifically like uses one kubectl context to apply to something. We wind up with a fair bit of duplication as we have an increasing number of services because we have to specify in each one how they map out. We have a configuration leg as the cluster topology changes, you know, as we add a new cluster, we have to opt new services into it. If we shovel things around, we have to change what workloads point to where. If we remove a cluster, we probably have to update all those deploy pipelines or we break them. So it's super easy to start with, but it definitely doesn't scale with the organization well. You start to experience more and more friction. So once you experience that friction, kind of logical way that a lot of people alleviate that is with creating cluster groups. So instead of explicitly targeting individual named clusters with a workload, you target some group and a group is just a mapping of labeling a cluster in some way. So like production or, you know, this app or staging or production us something of that sort or like production GPU that describes the capabilities or attributes of the cluster that we care about. But human has to create that group map it all out. Human has to decide what group to target. And we don't necessarily get, we don't necessarily get any choice in this matter when it comes to anything, anything beyond that mapping. So a common case is I want to run on some clusters. I don't care which ones. Like if I have just some fairly ephemeral batch workloads, the kind of common way to do this is you more or less designate batch clusters or you explicitly pick a random cluster, like it doesn't realistically need to be US West one a it's just someone hard coded that in because they don't want to deploy it to the whole fleet. So this can work well when you have a moderate amount of stuff, but the kind of number arbitrary choices made, or just the sheer amount of mapping to describe different workloads with different requirements starts to become unwieldy. So the evolution from that, which from my understanding is not something that is many Kubernetes users are doing is to start to treat workloads in Kubernetes kind of like we treat pods. So we specify in this model, the constraints on a workload, like we want to put this on three geo distributed clusters in the US, or we want to run this on any one cluster that runs X workload that we're dependent on, we specify abstract constraints that describe the requirements that we have. And we let something actually make that scheduling choice for us, as opposed to us making either an ad hoc or arbitrary choice about what that's going to be. So to summarize this, we really start to capture the intent of someone who's configuring this workload without directly coupling things together. So let's look more at a model of this. We want to be able to treat a workload a lot like we treat a workload within a cluster now. So let's look at how we want to schedule workloads across clusters. This is a very kind of vague architecture here because this is something that most of it does not exist in any open source form that I know of today. We have a bunch of individual clusters. We have some kind of cluster reconciliation tool that is able to fetch the desired state for that cluster. So everything it should be in that cluster from the admin RBAC to all the services in it and what the definitions of all those services, it's able to fetch that from some API. And it's able to push any changes from the API, such as, you know, image updates or configuration updates into the cluster, as well as reconcile and fix anything in the cluster that deviates due to something changing it from the intended state. We have a top level workload API that represents, you know, like application foo globally that then is materialized into those individual clusters. And what does that decision is a cluster scheduler. So I submit something to the workload API and say, Hey, I just need one cluster that runs this database because I need to run a batch job. And so the cluster scheduler goes, okay, you need a affinity with X, X is scheduled here, I'm going to schedule you here. And in order to have that knowledge, we also need to cluster registry something that is aware of all clusters that we have in the fleet is aware of the various, you know, tags and metadata that make up these attributes is aware of their current status. So making sure the cost is online, potentially being able to factor in things like, you know, cost and capacity in these decisions, as well as being able to easily manage credentials, so that we can authenticate back and forth with arbitrarily arbitrary cluster control plans. If you want to learn a little bit more about this model, because I'm only describing it briefly here, I wrote a blog post that I think someone generously described as a white paper at one point, but it's, but a 10 minute read into kind of the idea of this kind of model, more about what the API could look like and why I think it makes sense. This is something that someone in SIG multi cluster saw. And because of that, we've started prototyping a little bit of this material. So the cluster inventory API I mentioned is the part where it's the concrete view of, you know, I am cluster us east one, these are the things that should have in it. It's pretty much, you know, a precise definition of a million lines of YAML or something of the sort. There's no high level ambiguity. It's just a dump of the state. There's the cluster reconciler, which is responsible for doing the syncing. So anything that is in that cluster inventory that is missing, create it, if anything has a differing state, then make the update and anything was previously managed by the cluster inventory API and is no longer in that API, should get deleted. So garbage collect anything that should no longer be deployed. That can be surprisingly effective for doing cleanup. One of the ambiguities currently in our prototype of this is how we should handle updates because as I mentioned in like the API logistics portion of this talk, there are many ways where it's hard to capture the intention of the user into what operations to do. Currently, one thing I've considered is trying to add some kind of basically metadata to indicate your intention with specific keys. Like if you see this key in the cluster, don't change it. This is an area where building custom tooling for the stack that you're running, especially if you have more of a platform as a service versus Kubernetes as a service internally can be very valuable. Because in reality, for any set of objects, there's a finite number of gotchas around doing updates with the state, things like the cluster IP. If you're able to build tooling explicitly around those, you will probably have an easier time than trying to build a tool that solves every single use case. It's challenging to build an open source tool because you basically have to accommodate with some reasonable wiggle room. Everyone out there doing everything. So for example, stuff like CRDs, you have to either say you're on your own if you had an edge case, or you have to give the user a way to kind of specify that intention of, you know, something that's supposed to be right once or whatever. And as I said, we're prototyping some of this stuff now. It's fairly early stages of prototyping. We're doing specifically a minimal version of the inventory API, just enough of something that we can reasonably do a reconciliation process and figuring out the mechanics of how the reconciler should work. And that's one of those things where it's easy to do something, but it's hard to necessarily do what you want. And this is an area where the SIG and I personally would really, really appreciate some feedback as the exact use cases and problems that you have. So there is also a cluster registry. That part is not vaporware. However, it has a somewhat uncertain feature. I forget the exact fate that was decided upon because it's not super well maintained. And as far as we're aware, it's not extremely well adopted, but it's something that you're able to kind of insert all your clusters so that you can list them out, get credentials for them, and put status for them in a central place. And this is something that we'd require for a cluster scheduling system because, again, you need the inventory available of what clusters are at your disposal and the metadata about them that you need to make a decision. And this is the point where we get into full conjecture land. And this is something that the original blog post about this idea covers in more detail. But for our workload API, there's extreme ambiguity as to how you define a workload exactly and how you deal with the want and many mapping that happens when you materialize it to multiple clusters. So how do you gracefully roll that version out when I have service X? How do I, you know, not deploy it everywhere because no change should ever go everywhere all at once. So how do you roll it between clusters? How manual is that? How automated is that? How does it work? How do you do with templating if you want or need to have specific fields in your configuration into your like Gamble or whatever changed between the primary version and the version for any given cluster? What kind of scheduling options do you have? And how sticky is that cluster scheduling? Like what should happen if you delete a cluster and no longer satisfy scheduling constraints or something? Like if you go from having three replicas or three cluster replicas to having two cluster replicas of a thing, do you spin it up somewhere else? Do you make like the state and the load balance or follow that around? Do you make a human intervene? It was a very, very long tail as to how to do that. And it winds up being very domain specific. There's a couple of people in the community who have definitely inspired a lot of the way that I thought about this. In particular, I want to call out Paul Mori from SIG Multi Cluster, who strongly advised that we start prototyping from the bottom up being like the reconciler up rather than starting from the top down with the scheduler and workload API because there are many, many ways to do this. And there's many tools that try to do vaguely similar things that become very opinionated. So what we're trying to do on the open source side is figure out the building blocks that we can give you so that you provide the very opinionated parts that are specific to how you run apps and how your infrastructure works for less cost than it would take to build the one size doesn't quite fit anybody component. So thank you very much for attending. This is not specifically a talk about Apple systems, but I work at Apple. I really, really like the people who I work with. So if you want to work with me and people who I really look up to you, please reach out or stop by the Apple booth to learn more. We're looking for a lot of Kubernetes people right now. Now I should be able to take questions. You will be speaking to a future version of me who is hopefully slightly wiser. Thank you for joining.