 So good afternoon, everyone. I'm Seb. I work at Elastic. In this session behind a very long title on the slides, we're going to talk about how we've been redesigning our cloud platform at Elastic recently and how we make heavy use of this controller pattern that Kubernetes provides. So just a little bit of context first, so everybody sort of understands what I'm talking about. That's what I work on. Elastic Cloud, so if you don't know us at Elastic, we're the company behind Elastic Search, Kibana, and all those products. We run a cloud service where you can basically create an Elastic stack as a service by clicking a few buttons. We run around the world across 3D from cloud providers and across more than 50 regions. And we've done that for quite some time now with our own sort of homebrew orchestration system, which is not Kubernetes because it predates Kubernetes. And now we run more than 600,000 containers in production, so that's what I'm working on. And I think at Elastic, we've been involved in the Kubernetes space for a few years now. Two things in particular I can highlight. One is our observability solution that allows monitoring Kubernetes and monitoring your services on Kubernetes. So we do metrics, APM traces, and things like that. So we've been heavy users of Kubernetes in that sense, like for developing that solution. But we've also invested into that project that we like to call the ECK operator. So that's an operator that allows you to deploy Elastic Search very easily on your Kubernetes cluster. I don't know if people here have used ECK. Yeah, cool. So I've worked on that one for, I think, the past few years. And so while we were working on that, I think we realized that we would really benefit from also relying on Kubernetes internally and sort of slowly changing and redesigning our own platform to be based on Kubernetes because we gained that experience of working with Kubernetes, working on developing against Kubernetes, like building custom controllers and things like that. I think with that experience, we also did a few presentations like this one from 2019 that is also about Kubernetes controllers and operators if you want to have a look at some YouTube now. And so we kind of know how to use Kubernetes now. We're used to it. And that's why I think two years ago, we made this sort of big decision that we would redesign entirely our cloud platform and we would really embrace Kubernetes fully. That's where we are headed. We started to think about how we would redesign that cloud platform for a very large scale, like talking about million containers, many, many Kubernetes clusters, and doing that in such a way that everything is resilient to failure, highly available, and that sort of considerations. Yeah, so that's what I started working on two years ago. No pressure, easy, right? So I'm going to share a little bit of what that architecture looks like now, sort of some insights and design decisions that have been applied behind this. So it basically looks like this. We can split it in two different halves. There's the top half, which we like to call the global control, which is made of services deployed in Kubernetes. And there's the bottom half, which we like to call the regional application clusters. So if we start with the top, that global Kubernetes cluster on the top is our control plane from a user perspective. So our users would basically talk to an API and make a request to create an elastic serverless project, for example. We store that information somewhere and we have a bunch of controllers at the top that are going to process that request to create a project. And then in the bottom half is where we have really a shitload of Kubernetes clusters in many, many regions, in many, many cloud providers. And that's where we actually run the actual workloads, so elastic search pods, Kibana, and things along those lines. So as a user, when you access your elastic search deployment, you would go through this proxy service that we have and then we arrive request to the right pods. And there as well, we have a bunch of controllers. That's the green box that do a lot of stuff. We're going to talk about this. So globally, we have this split, right? A bunch of global services and a bunch of regional services. And we split things that way. One of the first thing we asked ourselves when we started designing for using Kubernetes is how large can we really push Kubernetes? We have these 100,000s of containers. How many can we run in a single Kubernetes cluster? How many clusters can we have? This sort of question. And there's no easy answer to that. I think there's an entire group that is working on this problem. There are many limits. It's not as simple as saying, oh, yeah. As soon as you have 500 nodes, it doesn't work or something like that. It has many dimensions. So I think what we quickly decided is that we don't really want to be in the business of trying to push Kubernetes to its limits. It's probably super fun that we don't really have time for that. But rather, we're going to try to scale the number of Kubernetes clusters horizontally. So we have many, many clusters of 500 to 1,000 nodes rather than a small number of very big clusters. And we wanted to design in such a way that our Kubernetes clusters are disposable. So I think the big idea here is something could happen and an entire cluster gets destroyed. Like there's a failure on the cloud provider, or we have a problem. And we basically lose the entire state in a TV. We want this to not be a big problem in the sense that we can easily sort of recreate that cluster and repopulate everything that was in there from somewhere else, like another source of truth. So in this analogy, you may know about, like, versus Cattle, we treat our Kubernetes clusters themselves as Cattle. Like, we don't want to store anything so important here that we could not tolerate the loss of it. And it's kind of nice because, for us, it also means we don't really need to deal with having a very solid story around XCD backups and restore. If we lose a Kubernetes cluster, you could say, well, maybe we can just restore an XCD backup from five seconds ago. But what do you do with that delta of data that is missing now? Like, that's very hard to deal with. Like maybe you know how persistent volumes that have been created, but they are not in the backup, some resources that have been deleted, but exist in the backup. Like, it's very, very complicated. So we don't want to deal with that. And we'd rather have good ways to recreate a cluster and everything in there automatically. So going back to the architecture, I think we're going to zoom a little bit into this concept of controllers, because we have controllers all over the place here. Since we decided to embrace Kubernetes a lot, we really sort of embraced that pattern of controllers as well. I think we've now written many, many controllers ourselves. We have a controller to manage Kubernetes clusters, another one to manage serverless projects, another one to manage Elasticsearch, to autoscale Elasticsearch. We also run a bunch of controllers that come from the Kubernetes ecosystem, like the certificate manager, some ingress controllers. And of course, Kubernetes itself has a lot of controllers built in, so the deployment controller, the replica set controller, et cetera. So controllers all over the place, it's not always easy to navigate in that because you have chain of controllers, like the output of one could be the input of another one, et cetera, et cetera. But overall, it's kind of useful because each little controller can have a very defined scope. So I guess a lot of you are sort of familiar with this concept of Kubernetes controller now, but I can maybe re-explain a few things very quickly in case you're not. So when we say controller, I think in general, we mean the same thing as when we say operator, more or less naming, right? But basically, it's that little white box here, which is a piece of code that we write, which generally does one thing. It's watching something that represents the desired state. So in that example, let's, in the last six steps, start with three nodes at the top. And whenever there's an event related to that, like there's a new one or there's one that has been updated, the controller is going to trigger what's called a reconciliation. So it's basically sequential logic of codes that is going to compare that desired state, which is what we want, with the actual state of things. And maybe currently in Kubernetes, we have a analysis cluster with two nodes, for example. And so it's going to compare desired versus actual, and it's going to basically do the right thing so we eventually reach that desired state. And you can run again and again and again and again. The same piece of codes on any event that may happen. And it's going to retry if things fail with some exponential backup logic. So it's not that complicated. I think we don't necessarily have to take everything from Kubernetes, like it's not always the best design in the world. Not always the best thing in the world, like I guess the best thing in the world is a bottle of French red wine, probably, but still there are very good things to take away from this controller pattern in Kubernetes. And it was useful for us to think about it when we were developing our services. One of the nice properties is that these controllers are generally resilient. So if they crash, if they restart, they're going to sort of self-repair automatically the resources they are responsible for. And this pattern for us is kind of nice because then we moved from a system where we sort of execute things as a one-off operation to a system that is able to continuously repair everything we have. It's also nice from a development perspective because if designed correctly, the code of this controller is really idempotent. It can run again and again and again. And the reason about it in terms of what's the desired state. And I think that's what we sometimes call a level-triggered pattern. I think the logic behind this naming is that the code just works based on what's supposed to exist. That's the desired state. As opposed to edge-triggered where you would implement some code that deals with some particular change of state. And that's harder to reason about it because it leads you to build state machines storing the state somewhere, having small pieces of code that deal with one particular use case. So we sort of like this in the development pattern as well. And finally, the other nice thing is the controller runtime library. So I don't know if you have controller runtime maintainers in the room. Yeah? Thank you. That's a really great library. Like we use it a lot. It's really awesome to build controllers. It has a lot of batteries included that it's really, really nice. So internally, a controller has what we call a work queue. So remember I said there's something that is gonna work the desired state and this thing goes through an event handler. And the event handler eventually pushes the object we want to reconcile into a queue internally. And then we have a bunch of workers that are gonna process the items in that queue in parallel. And there as well, there are some very nice properties that we can rely on. The items in the work queue are deduplicated. So in our case, maybe there's a user that created a project. And then immediately they updated and they updated another time. We don't have to process that thing three times in a row. We can just rely on the work queue deduplicating the items for us. And similarly, we don't have to reason about concurrency in the processing of these objects because there's some guarantee in the work queue here that the reference of one object is only gonna be processed once concurrently. So the code is simple as well. We don't need to take care of parallelism and race conditions. So it's nice. It does a lot for us. It also makes sure all the objects we need to reconcile are always reconciled because at startup it's gonna sort of reprocess the world automatically. So we know that we run things in the desired sort of state basically. So I said we use controllers a lot. But we have a bunch of services that we call the global controllers that run in this global cluster here. They are a bit special. They are not like regular Kubernetes controllers, like the ones you're used to. You see, for example, in this diagram that they push stuff to different Kubernetes clusters so they don't manipulate the actual state in the current Kubernetes cluster rather than they do that in different clusters. So an example of the global controller for us is what we call the project controller. So remember I said when a user creates a project on Elastic Cloud they would talk to what we call the project API. That project API is gonna store the project information somewhere and then asynchronously we have a project controller that is gonna read what project should be created and that is gonna make sure that this project is reconciled, which for us means that project controller is gonna make a scheduling decision of where those Elastic Search Buzz are gonna be located, like on which Kubernetes cluster. And it's gonna create some custom resources in that particular Kubernetes cluster so some other controllers can then take care of orchestrating Elastic Search in that cluster. So, potentially it's very much like a normal Kubernetes controller, right? Like there's a desired state somewhere, there's the actual state somewhere else and there's a reconciliation logic. However, in our case there's something that was bugging us a little bit, which is that because that thing looks like a Kubernetes controller, we are very tempted to make it work with Kubernetes. So the desired state would be stored in the API server. But it's not particularly great in that case because remember we say it's, we want to treat Kubernetes clusters as disposable. So what if we lose that at CD states and we lose the API server? Like where is the data? Basically how do we know which projects exist on the platform? And that's complicated. So we thought a little bit about it and I think in the end we quickly decided that we don't want to store our big data that is the source of truth in a CD, we want to store it somewhere else. And there are a lot of nice properties that different data stores provide to us. Like if you look at the very recent teachers in Google Spanner or Azure Cosmos DB for example, they have very, very high guarantees around multi-region replication if an entire region dies, you immediately make use of the other region and the replication is synchronous so there's a lot of benefits in that. Additionally the data we are talking about here is like the configurations of 100,000 of projects. And that's the same here, we don't really want to put that much data in that CD because maybe it's gonna have to fit in around, like maybe not, but it's gonna be complicated. And it would be nice to also do SQL queries on that data to serve the needs of other APIs, et cetera. So anyway, we decided let's not store that data in a CD, let's put that somewhere else. But quickly we get into all the problems because we used to develop controllers that work with the API server. So it's one thing to put the data somewhere else but we still want to have controllers. And the API server does provide quite a few things that we like. It provides this CRD mechanism, sorry, and we can deal with the version of those CRDs. It provides R back to those objects so we don't need to care about this. It's easy to build controllers with it so we try to think about how we would still sort of apply some of those properties but not by using the API server directly. And we brainstorm a little bit about many solutions to this. Those brainstorms are the best thing. Like you sit in a room, there's a whiteboard and you leave at the end of the day knowing you find the solution. It's the best thing after French wine, I guess. And so we have several ideas. We thought maybe we should just propagate and duplicate those entities from our database somewhere into the cage API server where we run our controllers. Then we run into the troubles of how do we duplicate that data? Is it asynchronous? Then do we have consistency issues and things like that? And then we thought maybe we should sort of steal around the Kubernetes API server but we just plug it to a different data store. And I think that's what some projects achieve. And then if you've looked at Kine, for example, Kine is doing that. It's sort of allowing you to plug Postgres behind the API server. There's also this interesting KCP project that you may want to look at. But we are still not entirely sure there was the way to go whether the API server is really what we needed here. And in the end, I think we decided to sort of do our own thing. So we would do without the API server. We would do differently. So I think we ended up with a pattern where for most of those global controllers we have, we also have an API that comes along with it. So that API sort of replaces the API server for a particular type of object. And we separate the database access so different services don't access the same table. And with this we actually sort of replaced what we were interested in that the API server provided by our own custom tooling. So we developed a bunch of libraries to access the data store, to deal with schema migration in the data store, background compatibility and things like that. Some API utilities for those API services to provide a watch-end point, sort of streaming data from the data store. And decentralized controlled access we can actually very easily do with mutual TLS and Kubernetes. We're using Katsbin for authorization between services. So that was actually not too hard. And I think it did bring us some flexibility in the API as well. Because the Kubernetes API model of custom resources is kind of nice, but whenever there's a case where you need a slightly different type of API, maybe something that is slightly imperative rather than completely declarative, or maybe something where the payload for updating a resource should be different than the one to create a resource or to get a resource, then we sort of have that flexibility that we needed as well. Yeah, so it was a little bit more work, but I think for us it was actually the right thing to do. But the last bit is making this work with the controller runtime library, which is what we wanted to do with these controllers. Man, it turned out that was actually quite easy. So plugging our controllers to a different data source is just something you can natively do with controller runtime. So you create a regular Kubernetes controller, you create a Go channel, which is sort of the source of the events that you want to fit the controller with, and you just populate that Go channel with another source. And in our case, we're populating that Go channel with watches that come from our own custom API, which itself serves data from the watch end point of our data store. And then there's the regular reconcile function that takes a request as parameter, and from that request, we retrieve the ID of the object to reconcile. And it basically works like a Kubernetes controller, but the state is not in the API server. So things worked fine. It was very easy to sort of hack around and make things work. But then we started doing like scale tests. We wanted to scale the thing to 300,000 resources to reconcile 300,000 projects in our case. Remember, one thing I said is that whenever the controller restarts, it's gonna reprocess the entire world by default. And it's something we want to do because maybe there's a new event that needs to be processed that happened right before the controller crashed or restarted, so we don't want to miss that event. So we need to process everything at startup in doubt. But also we want to make sure that everything has been reconciled at least once with the latest version of the code of the controller because maybe the reason why we restarted the controller is that we made a change to its code. So there's more logic now, so we need to run it again. And like how do you deal with 300,000 items to reconcile, well, you can throw more workers at it. So you process them in parallel by batches of 1,000, for example, that's one way, so we tried that. So the controller sort of looks like that at startup. There's this massive amount of stuff that needs to be reconciled. And the CPU of the machine is like a German car, right? Like it's going very high. One of the worst thing for us was the database because even if we have a database that can scale very well, we suddenly started to throw thousands of requests per second at it. So started getting a lot of four to nine from the database. But I think the worst experience was the impact on the user. So the user here is a real customer. What happens is that maybe that user is creating a new project right in the middle of the controller, just restarting. And when the controller restarts, it has 100,000 stuff in the work queue. And so that new project that the user created it gets at the end of the work queue at this moment. And so if it takes 10 minutes to process the work queue, that new project is going to be created in 10 minutes and that sucks. We don't want that. So again, let's get together, take a whiteboard, think about it a little bit. And I think what we ended up is sort of realizing that we really have two different kinds of reconciliations that we want our controllers to do. There's what we call the high priority reconciliations. So that's when we know for sure there's actually been a chance through a particular resource, maybe because it has been newly created or maybe because there's been an update to it. And this we want to process as soon as possible so our users can be happy. And then there's the low priority reconciliations. So those are for stuff that we know we've successfully reconciled before. We just want to reconcile again because maybe there's a new version of the controller code. So normally, no thing should be broken, right? It's just that we still want to reconcile it again in doubt, but that can happen slowly over time. So figuring out how to divide this thing between low and high priority is actually easy. What we did was introduce a new field in our objects that we call revision and another one that we call last reconciled revision. So whenever we are completely done with a reconciliation, we would update the last reconciled revision with the version of the revision at the time the resource was processed. And we know if something is high priority by just comparing these two revisions, right? So if the new revision, which has been increased because of a resource update is not the same as the last reconciled revision, we know for sure that this entity is pending some changes. So we need to reconcile it as soon as possible. Naming is hard, right? But I think this is highly inspired from this concept of generation and observed generation in communities' deployments. And then with the controller runtime, it's very easy to plug our own event handler. I can remember this watch from the staff that comes from the database. We plug an event handler to it. And what we had to change is that the event that we receive now contains the entire object rather than just the ID of that object. So then we can inspect from the object itself the difference between the revisions. And there we can make a decision whether that's a high priority thing. So we just push it to the controller work queue or whether that's a low priority thing. So we push it somewhere else. And in our case, we push it in what we call the low priority work queue. What's the low priority work queue? That's another little thing we tweaked in those controllers. So the green box that you see, controller work queue, is the regular communities' work queue in the controller. I think what we added is another work queue. So we ran two work queues now. And that's just a standout work queue, except we consider it has low priority. So we would immediately add pretty much all the events that start up in that low priority work queue, at least the ones that don't need to be immediately reconciled. But we have the red box here that is gonna consume that extra work queue at a fixed rate. So maybe we have 100,000 events in the queue, but that red consumer is gonna consume those events at a rate of 10 per second maybe. And it's gonna add them to the main work queue. And the logic here can be even a little bit smaller than that because before it adds them to the regular work queue, it can check the size of that regular work queue and only decide to add the low priority items if the main work queue is not already, I don't know, made of more than 50 items, for example. And because this yellow box is a regular controller work queue, we sort of benefit from the same sort of prometheus metrics that we have for the regular work queue so we can compare the depth of the queue for the low priority stuff and the high priority stuff. So it's very convenient. The code itself is really not that complicated. I'm not gonna go through it, but you can look at the slides there on the website. All right, I think we are already at the end. I'm gonna stop here on the controller stuff. There's a lot more I can talk about, how we rebuild this cloud architecture based on communities, like how we do continuous deployment, for example. We do that based on August CD plus some custom tooling on top. There's been a presentation on Monday about it that you should check out on YouTube when it's available. My team here on the front row. Of course, we use the Elastic solution for observability and then we have to implement a lot of stuff like we do custom auto scaling of the Elastic search pods. We provision our communities clusters with cross-lane. Well, there's a lot. So if you're interested in any of this, I'm happy to talk about it. I'm gonna be in the hallway probably after the presentation. And I guess that's it. Thanks everyone. I think I have like three minutes for questions and then I'm gonna be around if you want to ask questions after we're done. So anybody that wants to ask questions, there are microphones I think on these two places I'm pointing at, yeah. Hello, I'm George from Graph Analytics. I have a question regarding memory. So you mentioned that you're adding the full objects to your work queues. So if you have a work queues with hundreds of thousands of objects, this can quickly grow in terms of memory. So we, especially if you have multiple pods, so how do you manage that? So we don't add the full object to the work queue. In the work queue, we only keep the ID of the object. We use the full objects in the event handler before we push to the work queue. So as we are streaming changes, we get the full objects, but then we immediately get rid of it and we just put the ID in the work queue. So there is no local cache on the controller. Exactly, there's no local cache, yeah. Yeah, thank you, David. Do you have any plan to contribute this low-probability queue back to the controller runtime library? Possibly, yeah, we need to think about it a little bit. It's actually not that easy to configure a controller runtime to swap the underlying work queue. So we hacked around a little bit to make this happen. So yeah, I think quite likely what's gonna happen is we're gonna open an issue on the controller runtime repo to mention this. And yeah, if that makes sense, contribute the work queue, absolutely. It's not a lot of code, so yeah. Hi, yeah, so when your global controller launch a new Kubernetes cluster, I guess it may need to provision some machine or some cloud resource. Then it may also install like some Kubernetes components, like a certificate manager or ingress controller. So what kind of tool do you use to automate these kind of things? So we separate to this process of provisioning projects. We have an entire set of tooling to provision Kubernetes clusters, right? So we use cross-lane for that to provision the clusters themselves. We have logic around having thresholds to detect when we need to provision new clusters because the existing ones are full. And then we use Helm and August CD to deploy all the services automatically in those clusters. And then each other cluster is auto-scaled with cluster auto-scaled automatically. I see. So you cross-plan and the Helm, this kind of thing. Yeah. Okay, cool, thank you. Hi. When you were considering the issue about the global controller, did you consider the Kubernetes API aggregation layer? Yes, although like remember one of the important thing for us here is that we don't want to use that CD as our data store. Yeah, but with the API aggregation layer you could have just implemented, taken all of the advantages of the Kubernetes API, but have your own API aggregator storing objects on a SQL database, not a net CD and having controllers reconciling those objects, right? Yeah, so we thought about this when we thought about using KN for example, I think we realized that the API server abstraction like those CRDs and how we interact with them, that was actually not good enough for us. We wanted to do SQL queries for example, on that data, we wanted to have different kind of transactions and things like that. So the actual primitives of the Kubernetes API like these declarative objects, I think we realized it was not the best fit for us. But if your API aggregator was starting the state on a SQL database instead of that CD, you could have as well done all of that on the SQL database, right? Yeah, yeah, but I mean you, if you put the API server in France, you can do CRD operations on these objects basically, right? But you cannot do things that are more specific than this. Like you know this thing about maybe your update endpoint is different from your create endpoint and things like that. So all those extra things I think we benefited from not using the API server itself. Thanks. I think we are at the end, I can still take questions around here. One more, you said? Yeah? Okay. It's quick, I was just wondering if you were gonna share the slides. They're already on the website. If you go on these scads websites where you can build your schedule, they are shared here. Thank you very much.