 My name is Connor Nolan, I'm a senior software engineer at Akamai and I'm joined by my colleague Reherj, he's going to speak a little bit later. Hello everyone. And we're here to talk about ProviderSeph, which is an open source cross-band provider that we've been developing over the last year or so, and we use it to manage S3 buckets in Seph clusters from Kubernetes. So I'll go through a little bit of the agenda, kind of what we have in store today. I'm going to give a bit of background, kind of how we ended up here, the challenge that we faced and then how we overcame that, some of the specific requirements and how that led us to cross-plane. A little bit about cross-plane, I won't go into too much detail on it, but kind of how it works for us and how it's kind of the perfect solution for us. So then I'll get into ProviderSeph itself, give it a little bit of an overview, some of the features, how it works, that kind of thing, then Reherit is going to take over, give us a demo and get into a kind of a discussion then on scalability and performance, that kind of thing. Then we'll wrap up with a Q&A and head for the airport. Okay, so like I said, a little bit of background. Reherit and I both previously worked at a company called Ondat, which was a London-based Kubernetes startup. It was originally called StorageOS, later rebranded as Ondat. You might recognize the logos from previous Kubecons, but what we did there was we offered a Kubernetes-native block storage solution, and as part of delivering that, we maintained a bunch of different Kubernetes peripherals. So we had a KubeStale plugin, a bunch of Helm charts, obviously a CSI driver, and then we had probably half a dozen or so Kubernetes operators, all of which were like QBuilder-based, so we would be kind of intimately familiar with that whole branch of the Kubernetes ecosystem, the operator pattern, all that, and that kind of, I suppose, informed our decision-making going forward on this. So then about a year ago, I think it was actually about a year ago this week, I'm not sure of the exact dates, but we were acquired by Akamai, and then that led us to a new challenge, which kind of combined our history and knowledge of Kubernetes and storage. So the specifics of that challenge were we need to be able to manage S3 buckets across multiple distributed CF clusters, and we need to be able to do that from within a single Kubernetes cluster, so basically we're going to have Kubernetes as a control plane, and then external to that multiple distributed CF clusters. We want to have each bucket represented as a custom resource, so that would be the designated source of truth for that bucket, and then we will attempt to reconcile the desired state off of that. The solution would need to fit into an event-driven architecture with an eventually consistent model, so you can kind of see where this is going, already ties in nicely with the idea of having a custom resource representing that bucket, and then trying to reconcile that desired state with the real world. We also need to be able to handle these S3 operations asynchronously because we're dealing with multiple backends, obviously for performance needs, we need to be able to send off these S3 operations asynchronously, and we need to be able to do it with a degree of visibility into what's happening on each CF cluster, where buckets are being created successfully, if there's failures, what's happening, why it's happening, when it's happening. And then the final thing was we're going to have a lot of traffic, we're probably going to need to be able to handle upwards of 100,000 buckets. That also means being able to handle upwards of 100,000 CRs. We hear it's going to talk through scalability a little bit later. So based on like all of these specific requirements, you can kind of see where I'm going with this. The obvious solution that you would jump into would be some kind of a Kubernetes operator controller, but before we kind of jumped head first into writing it, we decided to do a little bit of research, a bit of investigation to see was there something, some kind of solution out there that we hadn't used before that might make our lives a little bit easier as developers, and that's where we came across a crossplane. So what is crossplane? I'm not going to get into a big discussion on crossplane because I don't have time, but this quote, which I found on an article in the news stack there about a week ago by Jared Watts, who's a co-creator of crossplane, kind of sums it up nicely. So crossplane is an extension to Kubernetes. It teaches Kubernetes all about external resources, and that's kind of the key there. It kind of sums it up both from a general perspective and also from our specific use case. Like from a general point of view, if you're a cloud provider, you would develop a crossplane provider to make Kubernetes aware of your cloud resources. From our perspective, we wrote a crossplane provider, ProviderCef, to make Kubernetes aware of our Cef clusters and their resources. So if you really want to press me on what's crossplane, for me, I look at crossplane, if anyone asks me, anyone who's familiar with KubeBuilder and Kubernetes operators, KubeBuilder is to a Kubernetes operator, what crossplane is to a crossplane provider. So crossplane is kind of like the framework or the utility for building or scaffolding out your crossplane provider and you insert your custom logic from there. So then obviously the next question is, what's a provider? So a provider is basically just another Kubernetes controller. I mean, if you peek under the hood and you drill down fair enough into the code, you will come across a conventional operator reconcile loop. But you're kind of given out of the box a really clean abstraction on top of that reconcile loop and that allows you then to kind of easily develop controllers that are used for managing external resources. Then there's obviously a bunch of other utilities and functionality you get with crossplane. For that specific use case, I won't go into too much detail on that. That's kind of just a drive-on crossplane itself. So then why would we use crossplane for this use case? Well, basically for all the reasons I just said. So crossplane is kind of built on top of the whole operator pattern. It's an extension of that. So for us, even though we weren't familiar with crossplane, we were intimately familiar with operators and the operator pattern. So the learning curve wasn't as deep one. It wasn't a difficult jump for us to make and we were able to upscale it pretty quickly. The second thing, so if you're developing a Kubernetes operator that reconciles your desired state in a custom resource with some external system or external API, you find that it can get really complicated really, really quickly. And from my experience anyway, I find that it's probably because if you generate a cube builder operator, you're given a single reconciled function, reconciled loop, where you need to insert all of your custom logic. So it needs to handle all of the CRUD operations on a custom resource. Then off of that, you need to handle whatever external API calls you need to make. All of that needs to be done idempotently. And so it can get really complicated really quickly. Obviously, you can structure your code in such a way that you don't have to do it that way. But what Crossbin does is it kind of gives you, like I said, that layer of abstraction and a kind of a separation, a logical separation of functionality for you to then insert your business logic on top of. So it gives you a cleaner starting point and it just reduces co-complexity and reduces software engineering overhead. So then the last thing, so down here, transparency of managed resources through conditions. So conditions is like one of these extra utilities or pieces of functionality that Crossbin gives you to help you write a controller for this type of use case. And conditions, specifically, we lean on heavily with providers. So they give you kind of visibility into your resources. They're both machine-readable and human-readable. So like I said, we lean on them really heavily. And that's just kind of one utility that we use. So how do we do that? So this would be kind of a really familiar output to anyone who has developed in Kubernetes, but specifically anybody who knows Crossplain, that's just a Qt you'll get output of a bucket resource, test bucket. And you'll see there the things there. You can see the age, it's 10 seconds old, but the ready and synced, these are Crossplain conditions that you get from Crossplain. So what we found is these conditions kind of match really conveniently with the semantics that we had in mind for our bucket resource. So I'll explain what I mean by that. So what this looks like in real life, this is like a visual representation of that bucket. You can see there's a bucket. It's called an MR here, a managed resource, which is basically Crossplain speak for custom resource. There are two are kind of interchangeable. Someone from Crossplain will probably correct me on that. But for now, you can just look at the bucket managed resource was created by a user, provider stuff, then attempts to reconcile that. It does so by creating a single S3 bucket on each of our Ceph clusters. And you can see from the diagram, it was successful on one of those, and it failed on two. But for us, what that means in terms of condition is the bucket managed resource is considered ready once it's created on any backend. So you might say, well, why is that? Because obviously it's failed on two. The reason is if you're a user, you only need one instance of that bucket. You don't need it with 10 other replicas. So as far as provider stuff is concerned, the bucket is ready to use by the user, but it's not synced. So what this means is provider stuff can continue to reconcile in the background and attempt to overcome whatever transient errors are happening to stop it from being created on those remaining Ceph clusters. So the same bucket again, you can see the age. Now it's 40 seconds old and the two conditions. Now it's both ready and it's synced. And you can see what that looks like in real life. Whatever errors were being encountered by provider stuff have been overcome and the bucket is now created on all three Ceph clusters. And what that means is that it's synced because it's been created on all backends. I should mention this slide is slightly out of date. We added a feature recently, minimum replicas. So you no longer need to have the bucket created on all backends for it to be synced. You just need to have it reach the minimum replicas quota. So I also mentioned at the start that we needed visibility into our backends because of the asynchronous operations. So what we looked at there was the conditions of the overall bucket managed resource, which represents a bucket in itself, but we need visibility into each specific replica on each Ceph cluster. So we expanded on the idea of conditions by kind of extending the bucket status with these what are called here individualized conditions. So you can see up the top there, hopefully you can read it under each status. There's a list of backends and those backends are the Ceph clusters that that bucket was supposed to be created on. So for Ceph cluster A, the condition is ready. It was created successfully for Ceph cluster B, the same. And then for Ceph cluster C, some kind of error has been encountered. And because of those conditions, we can see the last transition time. The message gives us the relative S3 error that we got from the Ceph cluster. And so that's it on a more granular level. And then down below, you can see the actual conditions of the bucket resource itself. Those are what we just looked at previously. And obviously it's ready, but it's not synced because it's failed on one backend. So this is useful, obviously, for platform engineers to come along and be able to say, okay, we'll book fail here, when it failed, why it failed, et cetera. We also use our conditions to monitor the health of our backends. So again, if you're familiar with Crospin, you'll be familiar with the idea of a provider config. So in provider Ceph's world, each Ceph cluster is represented by a provider config object. And within provider Ceph then itself, we have an additional controller, a reconcile loop, which does periodic health checks on each Ceph cluster, and then updates the relative provider config statuses with the health check condition. So that's an example there of a failure. Again, it's got the same machine-readable format in terms of its condition, even though it's a customized one as opposed to what you get specifically out of the box from Crospin. This is our own health check condition. And again, you can see last transition time and the reason it failed. So again, this is useful, A, for a platform engineer to come along and see why is a cluster unhealthy, and B, for provider Ceph itself, because now it knows not to schedule any more books to this cluster while it's unhealthy. So that's it for me. I will hand over to Rihard. So I would like to speak a few words about the forward of this architecture that exactly Kubernetes gives us everything we need out of the box. We just deployed the application and enjoyed the result. Crospin gives us the possibility to manage external resources out of the Kubernetes cluster, which we exactly need. And the CNC Earth landscape helps us solving other difficulties which we have during operational logging and everything that this landscape is always growing. So we have options to select our tools. But one of my main favorite point is we have full control of what and when to reconcile. So we are able to reconcile individual buckets and actively monitoring them with the crossplane. And also we are able to reconcile set of buckets based on built-in or custom labels. And also we are able to reconcile the buckets based on the health check events which we are doing during the periodic health check. And exactly you can build your own business logic based on cloud native events. So if you would like to do anything with buckets, you just watch the buckets on the Kubernetes API server and you can react to any kind of event. But it's easy to speak about it. So let's do the demo. So first of all, we have three identical providers. For simplicity, I'm using local stack for this demo. But as you can see, we have three of them. And the health check is active. It is on. And the interval is two seconds. So I can apply this on the cluster. I have created all the providers. And we have a bucket here. It is a simple bucket that we need some validation. And the auto pulse function is true. I would speak about this a bit later. And then I can apply this bucket. So let's see what's happened on the cluster. I have a command to watch periodically the labels of this bucket. Exactly the labels are, let's say, the desire state which backends had been targeted for this bucket. And I have another command which is able to check the very small screen. But it is fetching the status of this bucket. Exactly the status represents the last known state of this bucket. So back to the labels, you can see it's auto paused. And it's up here on all three backends. So what the other nice command is, we can monitor the bucket on the actual backends. This is the local stack B backend. And you can see the bucket exists. It's available. And I do the same command for the local stack C backend. And you see the bucket exists there. So the next step I would like to show you is, and don't do this at production, I would destroy one of the providers, the local stack C providers, and set the replicas to zero. So I destroy the backend. And if you can see in this box that there is no response from this backend. And it's time to speak about this auto paused feature that once the bucket is available, all of the backends or reach the minimum replicas which we said, the provider pauses the reconciliation of this bucket because we can't keep in memory every bucket we would like to reconcile. So in this case, we just slip the reconciliation of this bucket. So any kind of huge check events, we can wake up the bucket and do what we would like again. So if I disable this auto pausing feature on this bucket, that you can see it's nothing happened because the backend is not available and providers have marked the backend as failed. But on the status field, you can see that only A and B are, so the bucket is available on the A and B backends. So C is missing. So once I set the replicas to one, and the backend came online again. After a few seconds, we can see that the backend is online again, but it does not contain the bucket. We have an error here. Hopefully after a while it takes a few seconds. What is happening? Sorry, it's up here. We have the bucket on the missing backend. This should happen. So in this case, I disabled and enabled the auto paused by manually, but this should happen automatically. I don't need to change the bucket itself because the Heel Check Recon style there wakes up every bucket which is affected in this scenario. So hopefully the bucket will be available again after I upscaled the cluster, and the bucket is on this backend. So the next scenario is very interesting that I downscaled the localstackc backend again, but this time I disabled synchronizing of this bucket by the label. So I set the localstackc label to false directly, and you can see it has changed here. That means that this bucket has to be available on these two backends, but it's not mandatory on the third one. So if I start the backend again, it won't create the bucket on this backend because I temporarily disabled it. We can wait a few seconds, but you can believe in me. So anytime I can enable the synchronization on this backend by setting the label to true, as you see it, I set the label to true, and it's appear in the backend again. And that is the demo I wanted to show you. And it's time to talk about performance. But before the performance, that's a small summary of everything that we absolutely love Kubernetes and also crossplane, which is very easy to develop, but a bit harder to scale. And where we are in the journey that we had feature freeze the first version of the provider, you have to know that the objects are not the scope for the first release, and we are open for any kind of contribution. So if you would like to join to an innovative open source project or working on cutting edge technology, just feel free to reach us on GitHub. So the next topic is performance and scalability. Before I jump into it, please raise your hands. If you believe you have more than 5,000 records in your ETCD cluster under your 10,000s, things start to be tricky at that scale. So the first slide is for the ones who are rushing to the plane. It's just a summary. I would like to talk about two dimensions of scaling. One dimension is the vertical, where we increase the number of the custom resource definitions. So we have lots of kind of objects in the API server. The other dimension is the horizontal scaling when we scale the number of the custom resources themselves. So we have lots of resource from one type. So when we increase the number of the custom resource definitions, that the API discovery should be a big issue. Exactly. It should be a system killer issue. If you have some misconfigured clients, they can destroy the API server at all. And also because of the API discovery and other kind of backend jobs that you can experience very poor API server responsiveness. And you have to count with very high peaks of CPU and memory loads when API server starts to do some background jobs on that many objects kind. From the other side of the story, the horizontal thing that unfiltered list operations should be an issue. They can kill the API server if you start a few of them to fetch every custom resource definition. Storage is the most critical part of everything. And also you have to count with high memory and the network utilization because lots of data moving around and some data moving redundantly and multiple times. So it's tricky to predict. So just a few words about the vertical scaling because it's not an issue for us because we have only one resource. The problems tend to start around 500 custom resource definitions, which sounds pretty much. But keep in mind that some cross-chain providers have more than 800 custom resource definitions. So exactly one provider is able to destroy your cluster performance. But luckily the cross-chain have a new feature for partially deploy custom resource definitions. So it's already solved. You have to keep in mind that some clients are not designed to work with more than 50 or 100 custom resource definition. The cube cut command is one of them. My next advice is not relevant anymore, I think, but always use Kubernetes 1.26 plus because the cross-chain folks did awesome work to optimize Kubernetes and lots of components around the custom resource definitions and the API discovery. So the next topic is the horizontal scaling of the story. And the first layer I would like to mention is the storage side, which is exactly not part of this presentation. It's a very deep topic alone. But ETCD performance is very critical. But ETCD's performance depends on the underlying storage. So exactly the storage speed is very critical for random access. So this is an important thing. So what you can do is hire experts who know what they do. And based on your storage type, it's pretty easy to achieve the case in the picture. You can see that the ETCD is sleeping. All threads are waiting for IO. And once ETCD is dead, the next component would be the API manager. After a while it would dead. And when API manager is down, the Kubernetes controller manager goes down. So the entire cluster just falls down. The other side I would like to mention is the controller side. The provider itself, the operator, what we develop. The main bottleneck here is the memory. And the reason of this is that each controller instance has a cache, a local cache, which holds all of the resources it is reconciling. And all the last non-state is in the memory in every instance. I don't suggest to turn this off because otherwise you should have other troubles. But you have to find the best cache synchronization period of your system. And also you have to filter the watch, filter the resources by label because you can't watch every bucket when you have 100,000 of them. You have to count with longer startups because if you have that much of resources, it took minutes for the controller to fill up the cache. And also you have to configure the timeout because it is very easy to find yourself in a situation when every client is always times out because you have too much resources. Always take care on your rate limits and burst and timeouts and actively monitor them because misconfigured clients otherwise can kill your API server performance. And in our experience, it is much cheaper to retry the failed actions in a controller than dropping everything and recueing the object and building everything again. So if you have any failure, it's better to retry it once you have everything in the memory and everything in place. And one of the trickiest part of this is you have to forget the derelicted operators. It should be a bit tricky. And always ensure your best reconciliation concurrency, how many items you reconcile at the same time. And sleep and wake up resource reconciliation, as I mentioned, the auto post feature does this. We can put some buckets to sleep. I would like to talk about from the Kubernetes API server side of the performance. My first advice is use the latest kernel because engineers of Google changed the kernel stack and they have found the big performance improvement on the TCP stack. Design your network to be able to handle as many active connections as possible because there are watchers everywhere. From the CPU and memory side, the infinite is the best, of course. On the server side, you also have to configure rate limits and burst and timeouts because this is the first line protection of your API server. The pain point is that custom resource slices are misses, so we can paginate the custom resources when we fetch it from the Kubernetes API server. And you can only scale Kubernetes API server vertically because it has a leader election architecture, so there is only one leader at a time. And best if you can configure a proxy service in front of your API server to avoid unnecessary load and repetitive load on the server and you can delegate load with the aggregation API and it also should have to decrease the load on the API server. Related to the configuration options we have, there are only just a few configuration options like watch cache disabled and event time to leave and mutating maximum request in flight are the most important, so there are not too much configuration to set up. So I think everything is prepared, we know everything, so it's time to start the cluster. So code is start your engines. It was a short drive. Kubernetes API server after 20,000 of buckets, we experienced the object out of memory errors in the Kubernetes API server. And after the kernel tried to restart the API server, we experienced that we have a higher baseline compared to the initial baseline, but the tendency is the same that it will fill the memory, all of the memory available and after a few restart that the kernel just terminates the process and we don't have API server anymore, so we lose the cluster itself. After a few hours of investigation, I have found that the watch cache is the one who is trying to keep everything in the memory, so I just disabled the watch cache and I hopefully you can see that the memory consumption went to normal and we don't have any memory issues with this cache. But the question is, what is cache good for? And the main reason of this cache is to speed up least kind of operation when we fetch lots of data from the Kubernetes API server, but unfortunately this isn't the case because when you have 50,000 buckets, building the cache is almost the same time as fetching the data and sending over the network. So maybe you can see the numbers, but it was 53 seconds without the cache and it was one minute and 41 seconds with the cache. So it seems cache is for speeding up things, but not at this scale. So the just summary benefits are not included. Kubernetes API server is a single point of failure. So you have to deal with those limitations and in all cases, this system is an internal system, so we can deal with them, but these limitations are driving the design choices you made in the system. So the API server does not scale well and it is pretty easy to kill with unfiltered at least operations. So you have to take care on this part. The server and client side burst are first line protection of your server and the leaderless operators are tricky and not so common. You just can't find too many information about them. And the object reconciliation pause logic is mandatory and you have to count with some clients may not tolerate huge data sets or large timeouts. But the question what we did is can we go further? Can we do 100,000 buckets per second? And the answer is yes, we should, but our future plan with this project is a seamless integration with distributed control planes like KCP or Carmada, where we can distribute all the load across multiple regions and data centers what we want. So what we would like to achieve is a single entry point, infinite buckets and everything is top-on collaborative. Thank you so much and I think it's time for question to go now.