 Good morning. Good afternoon. Good evening wherever you're hailing from. Welcome to another episode of Red Hat Advanced Cluster Management Presents. I am joined by the team we call Rackham. It's one of my favorite products here at Red Hat because it does such an amazing job at multi-cluster management. So I'm going to hand it over to Scott Barron to kind of tell us what we're talking about today. Awesome. Thanks Chris. Always a pleasure to be on your show. I think we've managed. I think this is our eighth episode. The name is always your longest title I've noticed in all the topics that come through. So we're at least winning in that category. I'm a product manager and I love to solve problems in the multi-cluster management space. So that's why we're here is to talk about what are we doing with Rackham and management at the edge. Very exciting topic. I'm going to also introduce my colleague, Brad Weidenbender. He's new to the team, but his focus is in this telco edge scale space. Brad, go ahead and introduce yourself. Hello everybody. Nice to be here. Thank you for your time. And like Scott said, I'm focusing on the delivery of multi-cluster networking at the edge. Definitely performance and scale initiatives. And glad to be here. Nice. It just got off the boat. I mean, literally he was fishing during his lunch break. So he's down there in South Florida, just soaking in the sunshine. And then in terms of the actual technical horsepower and the real brains behind this, I'm going to turn it to Hao and how Lou can introduce himself and we'll pass it around to the team. My name is Hao. I've been on the ACM team for a long time. I was there since the initial POC of the product. There are a lot of random things on the team. I helped design the plus lifecycle bits with integration with Pyve. And also, I'm online on the CICD team. And now I'm focused on getting ACM to scale and help expand into the edge arena. No, you have a team of, I don't know, stalwarts here. They're all stellar. Who's going to go next? Who are you passing off to? Crystal. Go for it. Hi everyone. My name is Crystal and I'm a developer on Hao's team. It has been fantastic so far, getting to be part of the Far Edge and also working under Hao and learning from him. So I'm very excited to be here. Can I pass it off to Han? Oh, hi everyone. So I'm Han and I'm also in Hao's team. I've been in ACM for about a year and my job is basically writing go codes with some controllers. And before I was in class lifecycle team with under Hao's lead. And now I'm in Far Edge team under Hao's lead again. Yeah. Thank you. And Chris. Hi everybody. My name is Chris Doan and I've been with ACM for quite a long time. But I'm actually from the SRE squad but somehow Hao was able to wrangle me onto this Far Edge effort and I try to contribute wherever I can. But yeah, glad to be here. I'll pass it on to Alex. Hi, I'm Alex Cross. So I'm the one member of the team that's actually on a different team. I'm on the Telco 5G performance and scale team based in Raleigh. Actually on my second tour of duty with Red Hat, that makes me a boomerang employee. I did take a break in 2019 for a year as a cluster administrator, but found out that I really enjoyed performance engineering way more so I just boomerang gone back. It's been great to see Alex's contributions and everybody that we brought together here, Chris, is dedicated to the mission of management at the edge. And I guess the best way to frame this from a multi cluster management problem statement perspective is that like we know you know clusters right you've been hanging out with clusters for six, 10 years not a long time. Not enough right here. But before that it was virtual machine. So anyway, here we are and we understand that the notion of a cluster being this gigantic thing with hundreds of nodes and just a large footprint of multi tenant cluster that still exists. And we do see that, but less and less of that we start to see smaller clusters we have new topologies coming out like compact clusters, where there's a shared you know three master three worker kind of scenario. And then as that gets smaller you see like a single node open shift we don't even really want to call that a cluster maybe. That's a whole different debate over a picture of beer but you know we're in this space where we need to have a smaller footprint, you need to be able to manage that so there has to be enough tooling enough componentry in place to manage that thing out on the edge. And that's what we call a single node open shift that's been introduced as part of the 4.8 release that's coming out. And our team has been working with that day and night I see Chris shaking his head because I think he had, he had one of the first like feature complete builds deployed back in December. Maybe that's a little, a little too soon but the fact is like we've been beating our heads around that same problem statement is how do we start to deliver a single node open shift out to the edge in large scales and do that in a formant way. And one of the big tasks that we had to solve is how do you do that in a certain amount of time with a certain number of deployments in your scale so I think. Let's just say arbitrarily it was you got to you got to finish this in 10 hours and you have to be able to deploy 1000 of them. Ready set go. So like how would you solve that right so what we're what we're here to talk about today some of the growing pain some of the learning. Some of the stories that we've gone through and why we have the gray hair we do to get to the point that we're at which is incredibly awesome like we could deploy 1000 clusters is that right how we we've successfully deployed 1000 bare metals in under three hours with configurations in place I mean I don't want to spike the football too soon but some really some really strong results that this team has been able to drive. How it's your turn tell me tell me what we're doing in that space. For first of all a little bit of background like Scott came to me with this, but like end of last year I was like, you won't what now. A little piece of history right at that at that point, like, we have only tested ACM, or rather only able to test ACM up to 50 cluster and that's us, like steel bag and oral clusters to be managed by ACM. With the resources that we have we're only able to test the ACM up to managing 50 clusters. For a short period of time so. So, understanding that 1000 right is order of magnitude higher than 50. So, I started, I was like, okay, that sounds fun. Let's do it. Right. So, there's a couple early early lesson that we learned I think that that is just fascinating and it's generally applicable for any web app that we built. So, the first thing I ever try after Scott approached me is okay, I'm going to go stand up an ownership cluster I'm going to create well. Let's see how many namespaces I can create right because ACM every single namespace every single cluster have its own namespace to serve as an RBAC border to contain the resources that manage cluster can access. So, well let's see what how does open shift respond to 1000 namespaces or 2000 namespace so we started to just simple script looping through creating namespaces and we found I found that like what the heck. After about like 2000 namespaces, the control plane crashes. And at that time, I started to panic a little bit oh crap, am I am I sending myself out to for something that's not doable. So, we learned our first lesson. Right. And let me show you a little bit of graphic about about this. It's definitely not the news you want to hear on a Friday as you're heading off to your weekend is oh we completely trashed at CD and the API can't handle what we're pushing on it. Okay. So, I took the weekend. This I'm reading and ran across this document. So, this is the comparison of different story type aws so at that time we were mostly testing aws because that's just the most highly available resource that we have. So, GP to right is the default storage that we use. And we find one thing that I found out is that it's got a first budget meaning that like when you first provision it, it performs fantastically right. Hi IOPS comparable to one. But as we exhaust the first budget, the IOPS tank, like by default I think we're a 300 gig storage. And that's only about 9000 IOPS after we exhaust the first budget so that's what we saw. So, the first lesson here when you build a cloud native application that deploys on cloud provider as well. There's a hidden wall. You wonder why your wonderfully built application doesn't scale. So, storage, I guess should be the first thing that we take a look at. So once we replace our storage with IO1 with reasonably high IOPS like 3000, then I could, I end up being able to query like, you know, 10s of 1000 base base with our problem. So, but at this point, right, I started to realize that there's a lot of hurdles that we can foresee without actually testing this, literally with 1000 cluster. And I have no idea right where where I can get that resource, like, I was about to go ask Scott, hey, can you hand me a blank checkbook please. You basically did that. You did that. Yeah. And what so magically you made it happen. You just you crafted it up in your garage or something. The checkbook actually didn't help that much because AWS photos your API right so you can actually create 1000 cluster with a snap of finger AWS, because certain API is a real limit. Like, you know, the ones that you create. Those are real in there. So, unfortunately, big, big checkbook didn't help. And I grabbed my buddy, Chris Stone here and we just started to throw our heads together and at this time, like being the wonderful company that our heads, like for some reason, Alex Cruzo just pop out of the thin air. Hi, I'm a. Yes, exactly. Hi, I'm a seasoned performance engineer. I'm here to help. Wonderful. So, I would like to pass it to Alex to talk a little bit about how we ended up approaching this problem. Sure. So, when I kind of joined to help here with ACM is probably around November so having been seasoned with the scale lab and the, the hardware that we have available at red hat for scale testing and whatnot I'd kind of known what our capabilities were and whatnot and the first kind of test that was, it was kind of asked of me was, let's just see how many open shift clusters we can get off of a certain chunk of hardware so the first iteration of that actually involved open stack so we had requested some hardware we got some hardware like around 32 nodes or and we deployed open stack on top of that deployed a hub cluster and I tried to play as many spoke clusters that I could after sizing everything down as small as I possibly could we actually only got about one hub cluster plus 55 spoke clusters so we really got nowhere further for testing than what how had done in the past with the we're bag borrowing and borrowing everybody's cluster they could find out of AWS and whatnot. So a fun exercise as we were cobbling together, you know, clusters from every different line of business that we could find like oh you've got three you've got to here's five over here. Anyway, that's a story for another day but Alex, you're on the right path you you're the you were the shining light that figured out how to actually get these resources in place. And about that time that's when I was playing with ACM as well and seeing when you manage clusters you're creating a namespace for it knows, you know, obviously heard the goal there of 1000 clusters being managed by ACM my first thought was like, well shoot that's 100. That's 1000 namespaces right there and previous performance testing of open open shift has stressed the namespaces in a certain dimension. Open shift in Kubernetes is a multi dimensional. It's multi dimensional in order to try to form where scalability is so you could create a ton of namespaces but if they don't have a ton of other resources that might work fine if you create so it's really multi dimensional so you really have to kind of test with a real environment. More closely to what kind of customer should have deployed there. I had reached out, worked with how on testing this cluster, actually showed him hey one way to actually improve at CD, and we do this in the skill lab is we actually pass through an envy and me into the hub cluster, or into whatever cluster we have under test. And we put at CD on the envy me so we give it the best possible disc performance. At that time also shared with how how to look just using Grafana to look at some of the metrics that at CD has there, just so we could see how it's performing at that at that time. So, Remind me that got us from like 50 to 300 like what range where we were within striking distance at least 1000 target. Yeah, so so the next big jump there was, we had changed the test bed a little bit we had actually asked for more nodes since we got to, we have the 55 clusters. And throughout that we've also had to manage through various infrastructure issues of just this cluster is not working or something else or this build was not not working. But anyway we improved we added more hardware. And then we actually got to the point where we could decrease the size of the open chip clusters themselves rather than 55 full three node masters with two worker nodes. We shrunk that down into SNO clusters at that point in time. However, because we're still using open stack we ran into kind of other scaling issues and issues that weren't surrounding how SNO or single node open shift was supposed to be represented as a far edge cluster there. One of those was that in open stack would still create a boot a bootstrap node so we had to plan capacity of our cloud around that. But after we worked through all that with about the 64 nodes, carve off a few that piece of hardware that may have failed. And we actually got up to about 320 clusters at that point in time. And that's when we started hitting the scaling limits of what we're doing there with having the infrastructure as a service kind of layer that we had there being open stack. And that's when we made our last pivot that's been our most recent test there. And the last pivot is we've actually just removed the open stack layer and we've gone with completely bare metal so we made our our hub cluster completely bare metal. We've actually had to use similar. Now the NVMe is right there for the hub cluster, but you still have to allocate that NVMe. So what we do is we actually just use ignition configuration to make the NVMe have that CD mounted on it. We also use the NVMe on our worker nodes. That is able to actually serve as local storage. So we can solve kind of a storage solution for our bare metal cluster. And then for our spoke clusters for management, we actually have just pure rail with Lidbert hypervisors. So, and depending on which piece of hardware we have from the, from the lab, we can fit up to 17 or seven depending on our sizing that we do we size previously with capacity kind of analysis that we did with the hardware so a lot of gymnastics a lot of head banging and head scratching to get to that point. That was what a couple of months of just learning what we have access to and how can we maneuver, you know, basically more clusters in and out a more dense test that Yes. And then, and so one of the other big things, the pivoting when we change from that pivot when we started to use SNO on Lidbert is that we actually had new technologies integrated into to open shift at the time so instead of having ACM working with Hive to provision clusters is now ACM with the assisted installer with Hive. Boom. That's a big moment right. That is amazing. Yeah. So for people that don't know I mean assisted installer is That's why I use a set of my cluster here. Yeah, yeah. Next generation of technology that's coming out it's actually already available at cloud.redhat.com as a as a SAS. And that is a tech preview offering to start carving out bare metal in your data center with discovery ISOs and all this magic. So pivot point is key because now we're bringing that technology into the on-prem space. And so Alex, I mean, talk us through what that looked like and how the team respond. Yeah, so the biggest savior there was one of the other scaling limits that I neglected to mention a little bit earlier was when we were on top of OpenStack, we had to make much more planning for a hub cluster, not just at CD with NVMe. And also, Hive would create an installation pod, and that pod required 800 megabytes of memory, almost almost a gate. So if we wanted a high concurrency of installations, we had to create enough worker nodes that could host all of these pods. In addition to that, it would actually download an image file that it would then serve. So that would consume ephemeral disk space. So we had to plan around memory and ephemeral disk space on those nodes. In reality, though, all the installations happening on this remote machine. So why can't it just happen there? Well, thankfully, we had the assistant installer and that's what got us to there. We really shaved down the resources for our hub cluster. And actually, once we moved the bare metal, we had originally planned for extra nodes and we ended up having to get extra hypervisors. So that's what allowed us to end up scaling up to greater than 1000 clusters with that, with a rough roughly about 100 or so nodes in the lab. So I hear 2000 next maybe 10,000 by end of the year. Wow. Who knows 100,000 endpoints maybe we should start to call these things. Let's not, let's not start there. We'll go back to 2000. I got a little carried away sometimes, but yeah. Amazing guidance from Alex right enable us to just go ahead and test our system like to see where are the pain points and we found a lot of design choices that that we made and implementation choices that we made that can be can be improved as well as this new assistant installing technology that that help us address the spike of resource utilization right during the cluster provisioning time. So we so the customer don't have to plan for this excess of resources that only gets used during the cluster provision time and then afterwards kind of just lay there doing nothing which is wasteful. Yeah, so Alex basically set the stage for us to find a whole batch of new problems, right because if you don't have this hardware. And that had its own journey to get to there then you if you can't if you don't have the hardware then you can't actually flex the technology to figure out what's going to shake out. So where do we go next. One of the interesting assumptions sometimes like people think scalability is kind of linear. Not not actually choose their occasionally there's just one of these like at this number stuff just completely disintegrates right and these are the things that we are really not able to see until we have the resources until we have the actual clusters to play with so on is one of my favorite software developer like please follow him or I'll get up like it's awesome and there's a lot of there's a lot of lessons that we learned during this journey and that helps us to our operators help break through this bottlenecks. And Tom, would you mind rolling into And this is a big moment because this is this is the mentor. Sharing kudos to the protege because I mean I remember seeing you know Han who's been has been growing under your leadership how but seeing him take off in this space so take it away. He's a far better software engineer than me. There you go. Yeah, cool. Thank you. How and I did prepare some slides this morning and as I mentioned this release we achieved 1000 goal and we learned a lot. So my as I mentioned I mentioned my job is basically developer and I do have controllers. I've been working on controllers for several releases and this release I feel I learned the most because we hit a lot of difficulties different this release. And the difficulties because how mention we have 1000 cluster now we have to support 1000 cluster which is super hard. And the first difficulty I want to share is about our controller. They just keep crashing when we have 1000 clusters. And the reason is basically out of memory. It's pretty easy. We can just increase memory limit, but that is not elegant and that's not the solution we want. Right. So actually we did some investigation and there's something got just there and I want to share and also another thing is about performance here and I mean the speed is too low too slow. And it's always we possible that we can reflect our logic but actually again there are some very easy solutions we can choose and I also want to share. So first let's talk about all the memory and all the memory killed with some investigation turns out it's because of the cash some background about the cash. It's basically if you're using some go kind to contact to the communities. And for here we are using the control runtime and most of the go kind they actually have some cash under the background. So when you you're using a cat when you're using a client to do a watch, like committee design you you watch for some resource if it change then you will motive you will do some reconstruction logic and when you are doing the watch. Actually there are some background goes up routine is doing the cash and they will save every change seen the cash. Yeah, and also if you're doing doing a list or get with the kind. They actually also use a cash they will copy everything to cash and then save everything and that's yeah that's something we don't know we didn't know before. Oh, we actually know that we didn't realize how how terrible this it can be to affect our performance. And especially if you are just getting one resource like you get one secret in the cluster, you just want to function get and in the background, it will cash everything in the cluster. So that's not something we want it and so we figure out the solution is pretty easy because the cash is problem. And actually we don't need cash everything. Sometimes we just cash some of the results we care like for secret we only each names we only have one secret we care we don't need every secret in the cluster. So actually, we just don't cash everything we don't need. So the first first I want to recommend that if you have you can choose a namespace scope to client you just use namespace because scope client. And if you can use labels to like select the resources to reduce the cash just use some labels. And again, the third thing is don't ever cash any secrets of the whole cluster that's a lot of memory and I will show some examples there. We really, we have all most of our controllers they crash because they are catching the secrets, the secrets is super large in the classes. And another exciting news just happened this week, controller runtime which is a very popular library for controllers. They released a 0.9 and in this release there's the builder with options with this configuration user can easily configure the cash to add the labels and all use any selectors they want to just cash whatever they need. So this is this is never implemented before before we have to customize the cash but now it's very easy just several lines of code change. And I actually have a fresh example here this is actually having a disreleased. This is awesome because Chris, I know we get on your show and it's a bunch of smoke and mirrors but this is like real bonafide development stuff like right. I just did all of this team and the way they brought this together. Sorry, I didn't mean to throw you off your game on. We're also really excited with this. And also here is the example of one of our controller and catch the secrets. And because we didn't realize we should use labels or use any technique and we just kept everything. So this is before we're catching everything. You can see we have a thousand classes and for open shift each cluster will create a namespace and a namespace they will have three service account. Each service account they will have two secrets one is for a service account token and the other one is for Docker config and all those secrets that that's like 6000 secrets. All those secrets adds up they can be several hundreds of megabytes let alone other secrets will actually running for each component or controllers and actually those secrets we don't need. And then we just use the secrets with labels and boom, we just reduce 500 megabytes memory. Now it's just nothing, right. Before it's 500, nothing. So consider we have a couple of controllers are catching the secrets. We actually reduce several gig of memory. That's a lot for us and we are super happy for this result. And another thing is about the performance. Basically performance is too slow because we have a thousand classes and we we should know that there's no one size fits all solutions for performance turning. Sometimes the only solution is just refactoring, but sometimes it can be very easy because we have some of the configuration always have some configuration available. Like here example is all for control runtime. First is there's a client QPS and the burst client QPS is when you're using the community client you're doing a get a list. Or not get this but mainly apply update or patch something like that and there's a QPS is limiting the default is just 20 and there's a burst is for there's a buffer and you can do like QPS for 30 at most. But yeah, and if you are doing a lot of requests as a shot of one time you will see a lot of struggling keywords in the logs. And then in this time, maybe you can consider up just scale up the QPS and then see if it can help you reduce the problem solve the problem. And the example is we apply a thousand manifest to one cluster, the hub cluster, and we we want to apply it in one recycle recount cell and because the QPS here it takes like 30 or 40 seconds for one's recount cell which is super slow. And after we change the QPS to 200 it just several seconds and super fast now. And another thing is walk you with limiter. This is the recount cell every time when you watch a resource and they will trigger recount cell. There's a limiter rate limiter here default is 10 if you think it may help you to reduce the speed to speed up your controller. Then maybe you can choose this one. And as another one is the max concurrent to recount cells. This configuration the name explains everything. Basically you can add the concurrences default is always one so you're doing one thread. And if your task is very time consuming and the task can be done parallel. I think this configuration can be helpful. Like our example is we apply one manifest when we are importing classes by importing actually we are just apply some manifest on the remote classes. And after we apply the agent on remote cluster agent will turns up and get out the cluster imported. So we will apply the manifest on the remote classes, we will apply it every, every cluster of the 1000 classes. And because we only have one thread so they are doing linearly and because the remote cluster can be super busy. It takes a long time and adds up it takes a lot. And we have a really fresh example here. This is last month we have the experiment of 1000 classes. Here is the orange lines means the cluster is installed complete finish the SNL install the SNL single node open shift. And after, after the after every class is finished, actually we're expecting our controller to automatically import the classes so that we can manage it. So the green line is managed. But I mean the managed process is just apply manifest. So we're expecting it should be super fast. It shouldn't take very long. And but let's see the customer install only takes three and three hour about but the import actually take four and a half hours. There's a one half hour difference, which we didn't expect. And after some investigation we found this because we only have concurrency one and also because the cost of remote clusters and they are just finished install. They have a lot of things going on. So when we are applying manifest, it takes a while like 10 or 20 seconds. So as up because we only have single thread as up it's four and a half hours. It's a lot. And then we just think maybe we can just easily configure concurrency and that's what we did. And you can see here the line is super, super perfectly aligned. That means every every time there's a cluster completed installation, we just imported it. And there's no delay. We're super happy with this because we only did a one line code change just the concurrency. Yeah, so amazing. Yeah, that's super cool. And we're super happy and let me do some conclusion. So refactoring is always good if you have time we don't. And so before you're trying to increase in the memory limit, maybe think about cash. And before you're refactoring, maybe think about the QPS and concurrency. Yeah, that's everything I want to share. Thank you. Thank you. It turns out controllers give you a lot of tools. Yeah. So knowing, you know, knowing what you're working with knowing what's on the table to begin with with your container orchestration is pretty important at this level of scale. Yeah. And clearly in the community that uses control one time, right. The cash problem, definitely have to be observable. Otherwise, we wouldn't have to think that change to implement the filter cash like come up. It's serendipity. It just happens at exactly the same time we need it. How will probably go contribute to it if it was it's not there but Yeah, I was too slow. But he was too slow. He didn't get that PR at any time, but it's just wonderful. Now, the last graph to hunt ball up right and show shows how fast we were able to provision clusters. Holy crap that was 1000 cluster within three hours. Right. That was not achievable without a significant amount of resources. If we're using infrastructure provision. Sorry, install the provision infrastructure IPI which which is essentially what you do when you are open ship install with your conjunction with high just just because the sheer amount of resource that we need to pre provision and pre prepare in order to achieve the concurrency that we need right. Well, we mentioned assistance dollar already and crystal have a really well written document that kind of describe what's the what's the magic here that makes it different and that reduces the Well, the resource that we need to prepare and that allow us to achieve this 1000 plus the provisioning three hours. I'm on the edge of my seat. Yeah, no that was beautiful the DC compute drop there you could see it go from yellow to green on the on the graph and and that was just a beautiful thing. I know there was a lot of hard work behind the scenes lessons learned and it just optimize in the bits before they're out there and so great great phenomenal work. Scott would that stuff show up in the cost management you said you know knowing it's a SAS offering. Will we be able to see in cloud dot. Yeah, I don't know if it connected the dots that's a great idea that I mean if it's an AWS I think that would probably already have that. Those are you talking about open shift 4.8, which is into GA yet. Yes, definitely forward looking but that you're talking about cost savings because you're not sending as much over the wire you're not spending as much time in storage. Absolutely, all the resource consumption. So what was that picture you were talking about how that that Chris don't has. So I can't show us that how we were able to provision a thousand cluster within three hours. And I just wanted to spend some time. The lead crystal kind of shows us what's going on. What's the magic here that allow us to achieve that. So from the SRE perspective Chris, you've had your eye on metrics data gathering usage graphs all that kind of stuff. Enlighten us. What are we missing out on here. What are we missing out data gathering. I mean, as we've been doing these tasks we've always been collecting the metrics for other provisioning time like the graph that on has shown. I think one of the things that we'll have to roll back into the releases. We're generating these metrics today. It will be even better if these metrics are captured and stored within our platform so that we can query it. I think we query Bishop metrics today already. But these metrics aren't that easily accessible so that could be one set of metrics that we could roll into the product and then if we can roll into the product I guess in my mind if customers want to replay the work that we've done here in their own environment they could re-qualify our results and that could bolster their confidence in our platform, right. As we present or make all the, for example, the automation that we constructed to get to this point right that should also could also be made public and then customers can reuse that. Yeah, so help me connect the pieces we've got to this point we've got the scale lab and hardware we have improvements in the controller we've met criteria around 1000 deployments within three hours success rate with something like 98% or something really high. Yeah, it's really high. It's 3% failure or issues and those issues may be attributed to the environment. We are using a scale lab environment, but they're still, these are still virtual bare metal as well. So there could be some nuances in the environment that leads to some some failure rate. And then you get to this point where you have assisted installer which is fantastic technology working together with ACM. We intend to deliver that as a dev preview in the version 2.3 coming in July. So that gets us to this point of I'm deploying clusters and I'm going to come back and say so what. Okay, like, you did some good work but so what I want to manage at the edge. And I need tools to do that I need policy I need compliance I need to be able to configure something centrally so policy. I know this is part of the journey that crystal has been working on. But, you know, give me to that point where we were like, Okay, now that we've deployed. What's the day to looking like on the on the single note open shift. That's, that's kind of like what the slide that Han was showing as well. The fact that we can provision these snow single note open shift clusters using a system installer but then the next part is that we actually import those managed clusters into the hub. And once you have the managed clusters imported that opens up the window for the rest of our capabilities on rack and capabilities right policy management and application lifecycle management ready. Focusing on policy that the day to configuration that you were mentioning right as long as the configurations are controlled by open shift operators, you can pretty much define any kind of policy to to modify or or or or constrain those those behaviors and and by creating a policy that you can distribute across those 10,000 or 1000 managed clusters, you can. I'm jumping the gun there. You can consistently keep your, your fleet consistent right. Yeah, you have a policy for, for example, a lot. You can deploy that to the 1000 clusters and keep that a lot configuration consistent across your fleet and the fleet running at the edge as well. And you mentioned operators but this would be, you know, Kubernetes resources really anything that you can describe with a within a piece of animal, you can now start to define as a desired state model across your fleet and in this case these could be dev clusters that operate differently from product clusters and those might operate differently from West Coast versus East Coast definitions and label constructs that we have to articulate how you want those things to be configured. The lands in grass, you mentioned OAuth, all of that stuff comes into play roles role blinding users. Yeah, and the other caveat is that if you define your policies in a GitHub repo where they're your GitHub repo is your source of truth, you can use get ops connected to your hub to maintain that source of truth. Awesome. So how do we get to that point. I think this is crystal territory. We were defining policy, you know, articulating that as part of this deployment graph that that Han showed. What's the magic in that space. So the magic in that is that with policy, it kind of comes from what Brackham deploys known as the GRC framework. And it's been it's had a lot of great work done to it in that it's not only scalable across all these thousand managed clusters, but it is able to deploy all these policies very fast and very efficiently. In that sense, you kind of get to manage your clusters and know that they're compliant or not compliant, like, very, very fast. And in our initial testing, we kind of found that we started off with 100 policies deployed over these thousand SNO clusters. And it took about 90 minutes to propagate all these policies to all the managed clusters so you have about 100,000 objects from that. But after some tuning done from the rest of our team and some efficiency scaling, we were able to get that down to about 10 minutes for propagating all these policies. Yeah, this is a huge improvement shout out to Ian, who's on our team as well who has done the QPS tuning for that that have I mean that Han also mentioned earlier so with that performance in mind. It's an incredible improvement over something that was already very scalable in the first place. And that's kind of the magic of it is how it was built up in the first place to be scalable. And then from there, it was moved towards something more efficient from that point of view. So fine tuning of the configuration to quickly ingest that policy definition and ensure compliance across the end to end fleet. Like I think Han pointed out, it was a one line change. And then we did the concurrency magic. So show me a picture or do you have something that kind of describes like the journey that you went through in that policy space. Here we go. So this is kind of our initial findings document. As you can see when we first tested this out we created 100 policies on our hub, which then propagates to all of the managed SNO clusters the about 1000 or so. And that took about 1.5 hours. And with this testing we kind of wanted to see how long it would a take the policy to propagate and then how long be once we switch it from informed to enforce how long that would take to show up as compliant from all the managed clusters. And that's why you see bullet points. The first bullet point which is the propagation initially, and then the second bullet point which was switching these policies to enforce and that took 1.5 hours. So the difference there is subtle but let's hit that for just a second. One of the things that our customers have told us is that they love the ability to check and kind of use an audit type of framework to see what is compliant and non compliant. And that's why we call that inform. So there's an inform mode which is a YAML verb and says just inform me of what's going on in terms of the compliance back, but you're telling me you you can actually enforce some change that verb to enforce and now I can make changes right I can actually propagate that change across the fleet and this literally the same amount of time. And that's kind of the really crazy thing is that not only does it propagate when you change it to enforce. It also actually does the enforcement on all the thousand managed clusters, and then tells the hub hey my cluster for my manager cluster is compliant now, which is fantastic. This is just the initial investigation. And I have a graph right here that kind of shows like the testing that we did for that and as you can see the amount of time 1.5 hours to propagate and then 1.5 hours to switch from inform to enforce fully. But after our after the efficiency QPS tuning it dropped down to 10 minutes for each of those things so 10 minutes to propagate initially, and then 10 minutes when switching from inform to enforce. And I do not have a picture of that right now, but take my word for it. It's, it's there. No, I trust you because I've seen it. I've seen the team working. I just want to pause there for a second just making a change on 1000 cluster. 10 minutes. That's it. Wait, you said 10 minutes. Yes. Yes. The latest code change. This is our initial finding. It took 1.5 hours after the tuning that you and how have done 10 minutes, making changes to 1000 cluster. The enforcement at the so that's management at the edge, enforcing compliance across your fleet and getting those results reported back as either compliant or non compliant. So let's just step into one interface and see all that. I mean, we hear these kind of requests like all the time, how many different clusters do I have to log into, and where do I need to set the context I'm like no no no that's the problem we're solving is that you don't have to jump into context on all these different clusters we provide one interface for you to do all of that to set those controls from one spot. I forget. I think it was Chris Don who was mentioning the get ops part of this where these policies are actually stored in a repository. You know, and being able to have a code source and a source of truth for what that policy should look like. And then, you know, designating that policy as what you want to distribute to the fleet and what they should all be compliant towards. So that I mean that part of this story is, is what's the super powerful part I don't have just one model or one way to introduce a policy. I have multiple ways I can keep apply it I can pull it in from source, create it directly within the UI. That's I want to do it. What are we missing here I think we're down to like the last five or 10 minutes but are there any areas that we've missed in terms of our coverage here. I really wanted to for crystal to spend some time on the best showing us the magic of assisting solar and why is it what's the difference between that and I think I, but I don't know if we have enough time for that. I think we should spend a few minutes at least because that's part of the. What was your use case for assisted installer and IPI right like I think that would answer a lot of questions. We were put on this planet to help create clusters right we want to open shift to be everywhere. And so IPI was the first model the first tool that we really started with and install a provision infrastructure, Chris you know more about that anybody to take it away tell us, tell us that story. Crystal not crystal sorry the two name gets a little close to each other. Yeah, it was crystal right how crystal prepare something for that. Okay. Crystal. Yeah, so assisted installer is like the service that, like Alex mentioned, came in at the right time at the right moment that kind of helped like funnel all these things that we're doing. And as Alex mentioned before and has mentioned before, they were using IPI to with hive in order to actually create all these clusters that they wanted to scale and, you know, being at the far edge they found SNL clusters. But, of course, that came with a lot of disadvantages like they mentioned like the ephemeral storage that was needed, or just the extra memory that was needed. So with assisted installer it kind of came in and was able to take on all of these installation procedures that are required to run on SNL clusters and move that away from the hub so that way you don't need to have these extra storage spaces or the, you don't need to plan for the extra concurrency failures that would happen with IPI. You just have the insisted installer kind of take it over to the cluster you want to provision and run everything on its own. Therefore kind of increasing the success rate of these clusters, because with IPI there was failure due to unexpected, you know, memory issues. But with the assisted installer, we got so many more clusters and it was able to help us kind of provision like all these thousand SNL clusters. So another thing with assisted installer and I think how this is what you wanted me to show was, I'll give you a sneak preview exclusively for this this will come out in a doc probably a little different. This is not an official rack on docs, but this is the sneak preview of how, or what assisted installer comes with which is fantastic assisted installer and it was something called zero touch provisioning. ZTP for short. So it was zero touch provisioning. We just have these five simple steps that is done once you kind of put everything you need to configure for assisted installer, and the configuration is fairly simple, but following it has this great feature zero touch provisioning that kind of is where the assisted installer takes over and provisions your cluster for you. So your managed cluster you don't need to actually go into the managed cluster at all, or onto the actual machine to do anything it just handles everything for you. So here are the five steps, which hopefully are very simple. So first it generates the discovery ISO, which is an image used to boot the managed cluster, which you can see on the right side here. Once that's generated it gets booted onto the actual target bare metal machine or the SNR cluster that you want to provision. So on the hardware itself it boots this ISO for you. And then afterwards once it's successfully booted it will report hardware information back to your hub cluster. And when your hub cluster is aware of all the hardware information it'll then proceed to install open shift container platform on the bare metal machine. And then it'll start off giving you the SNR cluster with the single note on running on that bare metal machine and then open shift on top of it. And then, after that, you have OCP, when it finishes installing the hub will then, or the hub as red had advanced container management will then take on that new single note open shift cluster as a managed cluster. That's where you get all the good stuff from Rackham, which is all the deployments of the add-ons and all the management that we previously talked about with policy application, etc. So that's kind of the basic flow of VTP. And of course, all you need to do is log into your hub cluster and just do the provisioning and let insisted installer take it away from you for you. This really does, this really does abstract away a lot of complexity from how to provision cluster on a bare metal machine, right before you have to set up a provisioning network to an provisioning server to host a bootstrap. The setup was complicated. This one bootstraps in place, like you don't need anything external. This is it. We've reached out, boost the machine, it forms a cluster, done. ACM manages it, you deliver whatever configuration that you want to. Deliver your compliance model from that point forward, it's under management. And you make it sound so easy, Crystal. Your team has worked beautifully to, I know this has been development under, you know, the pressure of creation and you've created a diamond here out of the rough, but seeing your team work together with assisted installer, metal, ZTP, all the componentry that's come together into this package that ACM is delivering. It's awesome. It's just really awesome. The way this team has performed has been brilliant. So anyway, I should stop sharing kudos here in the last stretch, but Chris, do we have any questions that have come forward out there? The question that I haven't been able to answer is what are the current pain points in hub with using a thousand clusters, right? Well, my presentation for that much information is kind of hard and we are, it's still kind of under investigation, you know, the UX improvements. But we function pretty darn well with a thousand cluster at this moment. Now, we do just scope it to a couple of components and a couple of features for now. Like we focus heavily on policy just because Scott says so. We've also been doing monitoring and observability, right? That's really true. That's really true. Right. Right. Monitoring and alerting so that no one have to actually stare at the actual for a 1000 cluster to figure out what's going on. Like the centralized monitoring is spent. Unfortunately, we didn't have time to go into it, but we reduced the memory footprint of that component by a significant amount. And still retain all the capabilities that originally had. So providing long term metric store and the ability to alert off the fleet and bring that to a central spot. That's amazing. 28 or so might count right on the on the far edge squad and anyone we want to name drop and thank I know even today we saw like Emily demoing some new some new stuff and Randy George and others. Crystal anyone you want to name drop on the squad or how here's this special moment here is we're debuting some of this great stuff. I can then drop everything and every single help me out. I've just found the additional three but it's I'm just proud to be part of the far edge effort with the scale and performance and then connecting with all the other components as Scott was mentioning observability and GRC. Thank you. Chris, I think our time is up. I appreciate you having us. This has been amazing right like the level of effort to get to 1000 clusters in any way shape or form is enormous but the fact that you found problems with cashing and you know concurrency and like those were easy fixes that you can do. I'm going to show you the power of like the underpinnings of you know kubernetes in general and the scale that you can achieve with that. And that's just an amazing number right 1000 clusters and that that little amount of time. Yep, just baked into hub managed right like that is awesome. So we'll have the bits will be in there as a dev preview will be moving towards tech preview in the fall. And I think by that time who knows maybe that number is bigger. Maybe we'll be back here in the fall. The general improvement that we have gone to ACM is generally applicable so in 23 you will expect ACM to use less memory and less resources in general, leader and meter. I think we have a long name but I think we also bring the most people into the zoom channel here. You were beaten by the last call. Oh, of course it was I get it. I get it. Got to live up to a big brothers. Yes. Fantastic work team seriously thank you so much. I can't wait to like hear more about this, this journey and just pushing the edge further if that makes sense. Thank you all so much. Thank you audience for tuning in.