 Brave the early morning. I know computer people like to sleep in but My name is Matt Van Winkle. I work with the group at rack space. That's responsible for fleet management for our public cloud And I'm gonna talk today a little bit about Some of the interesting things we did to get where we are today Little little background on this particular presentation. It was a bit of a thought experiment You know my teams have realized over the last three years that we do very serious work and that requires us to never take ourselves too seriously and So I try to pepper some of that into this presentation. I will give you one disclaimer like most of you I was spent most of last week dealing with venom so From my own perspective, there's too many slides and they're too wordy. I would have liked to have you know more pictures less words But the polish, you know didn't quite happen while we were busy running around fixing those things So most you know who rack space is we've been around a while lots of different hosting options We give customers And now we run a very large public cloud based on open stack Few specifics about that cloud we currently have six regions around the world Tens of thousands of hosts scattered across them Last count we're well up in the you know upper half of hundred thousand, you know hundred well over hundred seventy thousand instances last time I counted Every region that we run has multiple cells. I'll talk a little bit about that in a minute Production right now. I have is up to mid 30s. Although I corrected I looked this morning and Forgot that we're actually converting a bunch of our legacy products Over to open stack. So one of my regions actually has a number of cells up into the mid 50s now So gotta go back and change that So we're gonna talk about today the decision to use cells How we run our control plan on the cloud an open stack cloud The fact that we run our computes as a virtualized node on the hypervisor how we made neutron work for us The fact that we manually wire up the v-switch for all of our instances say manually, but On each host we wire them up And ultimately that we've made the decision to not necessarily find an open stack project for everything we're doing So we'll talk a little bit about those things So first of all cells This is not nearly as crazy as it used to be By most standards. We're not the only ones using it anymore CERN and Nectar and I have done a couple presentations on it this this summit in the last one how we use cells Go daddy did a great one this this summit on how they move to cells It's becoming more and more common And it is actually upstream. So let's just get that out there now. I get asked that every now and then It's not upstream is it? Yes, it's upstream So why do we use cells? There's a couple main reasons one is it allows us to scale quickly We can grow these regions very large at a constant rapid pace It does give us some separation of failure domains. So for every cell there is its own DB or DB cluster There's its own rabbit server So ideally a failure of one of those things may cause problems in that subset of the fleet But doesn't necessarily bring down an entire region or cause you problems all over the place We also offer several different options to our customers of what they can spin up the type of hardware that's underneath it And so we actually group these things Common hardware into cells. So so we have multiple cells of different flavor classes and within those classes will offer different sizes I used our general purpose as an example And then also we have some supply chain constraints We do leverage live migrate and an increasing portion of our fleet and What we found is it works best when the CPUs match exactly and So because some of our some of our hardware types are multi-source from different vendors We actually will use cells to To you know consolidate all the stuff from one vendor in a single cell versus a different one from a different vendor To keep those things kind of clean How do we make cells work for us? We typically size ourselves Most of them now are landing just a little over a hundred hosts We have some they go up to 600 hosts on our older hardware profiles What we've learned over time is this is largely influenced by a Layer-to-failure domain. So we've seen cases in the really large cells We're a couple of instances doing bad things or causing like broadcast storms can kind of spiral out of control Especially on an internal network behind the scenes that connects different rack space services We also find that we're influenced by VM density from the particular flavor classes and how that Causes us to efficiently carve up our internal IP space So a lot of those things kind of go into it, but most of our products now We end up around a hundred plus hosts most most times It's also some multiple of cabinets once you kind of get to that rough math And then the number of cabinets that can be in a cell is also somewhat influenced by ports on an agar All that being said these are not things we expose a customer doesn't know what cell they're in Nor that will they probably ever? It's just more of a way that we can manage the fleet as it grows It's not perfect It's getting better There are still some upstream issues the gate is currently working I believe but not voting yet. It's going to be Soon there is a sense duplicate data The implementation of cells today takes everything in every cell database and copies it to the global database or the region level top-level database So you kind of have this duplication of data There is no easy upgrade path at the moment is what I would to what's coming down the road I have a cells v2 will mention that in a second And not every patch that comes down the pipe is Necessarily thought thinks about cells we do find that sometimes like different features or different patches will come in And we have to go figure out can they be made to work with cells? Or there's an interesting conversation that begins to happen on the spec between you know Tim Bell over at CERN and whoever submitted it about hey, why didn't you think about cells and where that goes and One of my fundamental problems with it is I can't turn one off And there are times as you all know when things break and portions of your infrastructure and to be able to not send builds to that would Be amazing so the code on the right side is actually a really dirty hack that I shared with our other in the cells presentation of How we apply a filter with a custom weight and that allows scheduler to not send any builds whatsoever So it's the closest thing we have to an off switch for ourselves Cells v2 is coming though. We were in several design sessions yesterday about this Here's the short version so in Liberty at the devs are trying to get to the point where everyone will have cells that doesn't today Except you'll just have one it'll essentially just be a framework That's sitting underneath Nova and Then in the M release the target is you have the full thing you can take that framework You can begin to expand it if you so choose to multiple cells and those of us who've been running v1 to now Have to figure out how to do the upgrade path to the new version Some of the things you're looking at doing are trimming down that global database to only retain the information You need to associate instances with their underlying cells and allow More direct communication from the API is down to the child cells today There's a there's a double layer of RPC calls going on between multiple rabbits and a service called cells Which augments the scheduler? So the next thing is we run we run all of our control plane. We run our cloud By with instances on a cloud again. This is less crazy than it was Originally, there are several people who do this and in various forms triple. Oh is out there Which talks about the same thing I've found this picture from the San Diego summit in the fall of 2012 Where Troy toman Was up on stage talking about several things and this I thought he had had a great diagram of it He didn't so this screenshot is actually from the time in the talk. He was going over this but So we built a small Open stack deployment small by compared to our public cloud in every one of our data centers and in that we spawn Instances that represent our API is our databases our rabbit nodes Pretty much everything we need to run open stack Yeah, one thing I will add is we do have cells in our We have cells inside of our Small open stack deployments. That's for two reasons one. We have started to introduce multiple hardware types there, too So we gave our customers Performant hardware. We're like hey, let's do the same thing for our control plane Also, we have over time allowed other groups at rack space to to use that same Capacity for internal projects and so we like to separate the tenants so that we know certain parts of that are only for our Instances to run the control plane for the cloud and then the other folks have their space Here's a relatively accurate picture of how things are today that I pulled off one of our wiki pages We essentially manually configure a small number of hosts with we set VMs up on it But we do it manually those VMs become if you will the control plane for The small cloud Typically we have one that represents kind of the global API level one that represents glance for this deployment And then some number of them that will represent the various cells underneath At that point it looks just like any other open stack installation with cells We have a bunch of cabinets out there that are organizing the cells running their computes on those computes We spawn instances which in turn become API nodes rabbit nodes, etc. That run all the Many many hypervisors out in the production fleet So it's good. It's easier to automate the distribution of nodes Tearing them down spinning them up moving them about Reacting the problems one of the real common ones that we've we've benefited from this is if something breaks in a way That causes the rabbit queue to spike out of control Then we can just go spin up a whole bunch of global sales workers until it pulls them down And then tear them right back down. So it's an easy way to react to some of those problems It gives us a little more insight into the customer experience because we're essentially running our own software on ourselves And so, you know, if there's a problem that's hurting us that bad. There's a chance that it's it's affecting our customers There's a little bad to it Exposes us to any bugs that come down the pipe as well. So our control plane now is susceptible to any problems You know in Nova that might cause issues It's more clouds if you will to maintain so You know now I've got two clouds in every physical region not one And it's Reminding us that we're still kind of we still think like systems folks like when it comes to ha and those things We still kind of try to do the stuff you used to do when you had a couple cabinets of gear And you wanted to cross-connect them and and make a pair and it's stretching us a little bit to To think more like a web app developer I know this is infrastructure, but how do we deploy it like a web app? How do we treat it like a web app? And we're still excuse me Still learning that There's a few really ugly parts so we do share it with other groups like I said and so there's some capacity wars there You know people want more and more and more and we're trying to build more and more control planes and unfortunately Unlike the public cloud where people are buying it and paying us money and our finance department is happy to just keep pouring gear into it It's a little bit harder to go say now. I need a quarter million dollars to Expand this thing that's not going to make us any money So that's a challenge and then we right now do have some version drift from our software perspective There's some features That we're waiting to convert For the purposes of our internal cloud that has us a few revisions of our own software behind the public cloud So that's probably the biggest challenge we're facing right now Where are we going with it? So this does open the door to to Containers can we use containers more easily for some of our control plane? What are different ways we can think about clustering going back to if we treat this like a web app, what would we do differently and Where can we stop thinking about pushing code to a node and think more about fleet management so apis for example Instead of updating the code on the apis Why are we not just spinning up new ones adding them to a load balancer dropping out the old ones things like that that we Can spend more time digging into So that brings us to the next one, which is that we run the actual Nova compute as a VM. This is actually something that We back up using Xen server is kind of an exception in the community We're one of the very few people who do it Which adds its challenges most folks are developing for Libvert. We use Zappy however Using Zappy allows it to it handles being remotely managed much better than Libvert And so essentially that allowed us to take all the compute code stick it in a VM that we manually create on our hypervisors and Have it externally manage Zappy if you will for all the tenant creation deletion and management So it It's a little wacky Effectively doubled my node count if you think about fleet management So now if I have 5,000 hosts in a region, I also have 5,000 computes and they're distinctly different things so however It gives me some flexibility right so if I need to reboot my compute node That doesn't affect the underlying instances on the host worst case if I need to blow that compute node up and start over It may affect being able to send instructions or commands or reboots or whatever to that instance, but it doesn't actually affect the instance Conversely it gives me a little bit. It's not perfect but it gives me a little bit of isolation of The compute from the hypervisor from a security perspective The biggest place for that helps me as we have a bunch of great support people who some the other ends of phones and tickets and chats Helping customers some of them do have privileges that allow them to log into our hosts Because as you all know sometimes a migration goes bad or you know an instance get it gets in a state that only a human can unstick And so in order to do that they have to log in so this allows me to grant people the ability to log into my host But they're not necessarily getting into the compute node That has credentials to my database and access to the other parts of my my infrastructure It's also got us thinking about maybe crazier ways we can do this in the future so One of them is can we use containers for it? I'd be good Could we extract the compute node from the host itself? a little bit wackier, but you know when you start thinking about Pushing code right and you have to touch seven eight nine thousand things With that code what if you could cut that by a factor of ten right? What if I could pull my compute nodes out run them as instances in my internal cloud and have each one manage ten hosts now There's a lot of upstream things that would have to happen for that to work But that's the kind of stuff we're thinking about like if we maintain this separation Can we go a step further? Ultimately could we have a small number of compute nodes sharing a bunch of hosts? So you even have some redundancy built in we're watching the work we're doing right now with our on-metal offering So ironic basically does this in a very simplified Form or a simplified answer it kind of does this already. So what are we learning from that? Can we extrapolate it to what we're doing? on this side All right neutron this one always gets interesting We do use neutron and we use neutron with cells and the way we did that was we built a plug-in This actually allowed us to do a lot more than just use cells, but Why did we build a plug-in? So when neutron first started we looked at all the things we needed the network to do for us or we needed the network stuff If you will to do for us We had to assign IPs to an instance, but we actually needed to sign it on multiple networks So out of the box every instance in our cloud gets two networks one for the internet and one for other services at rack space We also will assign the MAC addresses to the instances as they're created So we keep a pool of Macs that we allocate out We also want to support and since we've launched this cloud or shortly after we launched this cloud We have supported overlay or software-defined networks between instances in the cloud so you could spin up to Our several instances all over say our DFW cloud and build a private network between them So Whatever we did had to still interface with an upstream controller to work with that And ultimately this goes back to the choices we made before all this has to work was in server and even with ml2 There are a few things today that don't work. Well was in server So we built quark Quark is readily available. It is a full-fledged plug-in for neutron The name came from the concept of it being neutron light and sort of the you know a neutron contains quarks It essentially allows us to run the neutron API and all those things but then We can manage our needs by this plug-in It does allow us to also use vendor plug-ins So for example, we have nicer controllers that help us run our server to find networks and our overlay networks within the cloud We're running both bridged and tunnel networks So when you get your public IP address on an instance in our cloud It is allowed. I mean you're not going through an app or any kind of thing here You have a public IP address that's routed out to the internet We can talk of later if that's good or bad, but that's what it is Along with the tunnel networks that help support SDN and all those things and It does support cells and we'll talk a little bit more detailed about that in just a second How do we do the IPAM with it? essentially When we started there wasn't an upstream a concrete API for IPAM. It was very dependent on vendor In some cases you still got it, but not always I will say in fairness. That's being a pluggable IPAM is being worked upstream so Full disclosure they're fixing it Clark however the way we do this is we create two ports per instance. We add an IP to each And we allocate the Mac and then there's kind of a rough drawing of that network set up Instance on the host is attached to both networks Maybe here we go. So how does it support allow us to use cells? And I probably should have changed this when my engineers called me out elsewhere. It shouldn't be supports probably awareness more than is a better word, but With cork we've basically added a column to the database that represents a segment right so we had a conversation yesterday with the large deployments team that you know, we all kind of agreed like The real problem here with neutron isn't so much about cells, but about not being Aware of network topology and so the way we've solved it is we've got this column that allow adds a segment So we define a tenant for each cell each cell becomes a segment The provider subnets the public and the internal network are scope to that segment and then when Nova requests its information It just selects the available from the available IPs on those segments on the networks for those segments So works pretty well. I invite you to definitely go back and check out The github take a look at it if it can help you solve some of your problems of moving the neutron All the better All right on top of our own plug-in for neutron we went through a lot of pain and struggle to sort of End up at a point where we decided it was better for us to effectively wire our own V switches Then allow a controller to do it Let's talk a bit about why so when we started the public cloud Or shortly thereafter when we introduced software to find networking Our controllers were doing everything an instance would spin up and the controller would give it a public IP address It would give it its internal private IP address. It would allow for software to find networking We'd manage all the ports. It would do all the things. It was great And that worked pretty well up through the first few hundred hosts per region Then we started running into some weird problems For one we would actually grow faster than the product could support the number of ports we needed So we reached these really interesting times when we were watching the capacity counter click closer and closer to a hundred And we physically could not add more more servers until we could upgrade our controllers So we were back and forth with the vendor. Hey, we need the upgrade. Oh, you were testing, you know So there was a few of those rounds of tense moments The upgrades themselves were impactful meaning that the process of upgrading the controllers would actually introduce datapath Disruption hopefully for a brief period of time, but it's still this odd overlap of here You are upgrading a control plane item and you're introducing datapath interruption And that's kind of the whole reason you run clouds is to not do that right you can And then unfortunately several of those upgrades went bad And what was going to be a long night anyways turned into a long morning? And in general any time the cluster had a failure So whatever every nodes it was any time one of them would fail We would see the time that it took to sync that much data between the other ones on the order of two three hours, right? It was slow and on top of that We had no real insight into where it was in the sync process or any idea who was actually affected at that time So some number of instances we knew Were without flows at the moment all their flows had been dropped But I couldn't go tell you well, it's these over here and not these over here And the best guess we had was when all of our own IRC bouncers started coming back up We knew we were making progress, right? That's just not aware where you want to be as a service provider So very clearly we had to do something different and we actually went back to a model We used with our legacy product where we did wire up the v-switch now it wasn't as complicated back then We didn't do as many things with the instances, but we did wire up our own v-switches back then So we said okay, let's do it again, and that's what we did So right now we run a set of scripts on every host that are involved in the process of provisioning an instance It gets the it gets the information it needs from neutron It gets the information it needs from our controllers, but essentially it plugs the v-switch it plugs everything into the v-switch as The instance comes up So and it's scaled much nicer. It's not perfect, but it's scaled much nicer the controllers now Handle the software to find networking the thing. They're really good at and From time to time they still have issues and there is impact that impact is restricted To the people using the software to find networks and usually the recovery time is much faster because the amount of data That has to get re-synced across the remaining nodes is much smaller There's a few caveats to it So Because I said we have multiple flavor types and we have multiple vendors within those flavor types And in some cases we have slight variations from the same vendor in the same model of computer that we bought six months ago from now We've had to sort of figure out Every one of those cases as we've gone along One of our engineers Andy sitting over there. He could probably tell you in much more painful detail some of those those gotchas This also adds an extra step if for some reason you have to restart the v-switch you'd lose your flows So there's some extra things we have to do to make sure that it effectively read runs that script on the instance To regenerate the flows From my perspective if I think about fleet management generically and all the different bits That I want to validate on a particular host as far as if they're in the right version or not This is one more thing right so now I've got the version of Zen server that's running the which patches are on it What you know what state the compute node is in what version the codes in the compute node what version of OBS Am I running it? Oh, by the way, do I have the latest network scripts too, right? So it's just one more thing to sort of audit against on a host And it has definitely uncovered the edgiest of edge cases that I've seen I'll tell you a specific story We had a case about a year and a half ago Where randomly on a Friday night about a bunch of hosts in one of our small regions just started spontaneously rebooting? We're like, okay not cool, so Through the process of troubleshooting over the weekend We had actually isolated everything down to about three instances in the entire region that would if soon as they tried to Access their CBS volume would crash the host Still weren't quite sure why what we found out was A bug in our network scripts had uncovered a bug in Zen server and basically what it happened is we had gotten these the hosts in this This particular cell at two different shipments and there was a subtle difference in the way that These which was wired and one versus the other because of where we were in our evolution of things at the time and we just hadn't caught that and somehow the if you had the The wrong wiring on the new host it was effectively putting some traffic on the wrong virtual interface That's not great. It's nothing of the world however The bug in Zen server did not react well to that traffic being on the wrong interface and that's what crashed the host So, you know here. We're talking about at that point in time. We were running probably Roughly about two to two thousand to twenty five hundred instances in that region So you're talking literally about a one in one thousand problem That caused larger downstream effects and that's just one of the sort of the edge cases we found with this So it's a great solution, but it has been Tricky to get to I will say that in talking with a couple of the developers We're working on a floating IP solution and their hope is that once that's done We can spend a little bit of time trying to push some of this network script stuff up and out And let other people take a look at it So I can't promise a date, but I know people are thinking about it So hopefully at some point in the future cork will be out there. These will be out there if anyone's crazy enough to try both of them You can join us in the madness, right? All right Talk a little bit about some of the places where we don't use open-stack projects On purpose We definitely love open-stack for what it its core functions are, you know to build clouds So compute resources storage resources network resources all those things The reason we got involved in this is because we wanted help to build really good software to do that what we find is a Lot of the other things and there's plenty of projects that get submitted every day to stack for each other places And there's lots of conversations on the mailing list about hey I've got this thing I want to do an open-stack that does this kind of monitoring or this kind of whatever and those are great But there's also a lot of open-source stuff that's already out there that is well formed and good to go And so what we've decided to do is when it comes to a lot of the fleet management components and sort of the everyday stuff Let's grab what's off the shelf and let's focus our open-stack expertise on Those core components that really matter And in other areas we just went out and built something and someday we may push it up or we may clean it up and hand it to other folks But the response we were like no we just want to go build a quick dirty little service to do this thing So one concrete example where we've purposely made a choice Is we just haven't and it's not it's not an offense to the salamander folks We just found that we didn't need to go down that road The main thing we would use something like salamander for is validating usages so we build customers for the time that they use their instance and the amount of bandwidth they use and Honestly stack-tack does that for us it validates that we've collected all the usages and we've sent them downstream And then when there's a problem it alerts us and we go fix it It also is a great source of data if I'm trying to troubleshoot Specific instance and what it's gone through So I did pull up I just went in randomly and grabbed one of the Regions and happened to see where it was logging the start and stop of a instance image creation so There was a really good thread like last week I think it was about cmdb's on the operator thread And I was debating heavily how I wanted to wait into that I didn't get a chance to because of the other shenanigans, but you know We we realized early on that we needed a solid inventory management And it had to be really really custom to us and how we do things the reason why is On one hand our infrastructure is treated as a very large managed hosting customer rack space, right? The rack space started by taking physical boxes and sticking them in the data center and Leasing them to customers and so there's all these systems and tooling inside the company designed to track and Manage physical host in a data center That all of our you know thousands of machines are in and we need a lot of that data to run the cloud We also have two sets of nova databases. We need data out of we have our production nova databases We want all the compute node data. What's on what's off? What's going on with those and then we have this other database that goes along with that internal cloud I talked about that we want a lot of tenant information out of right because now all the all the control plane stuff's tenants in that database So how do I know which ones I care about what I want to know about them? How do I pull that out? So those are Kind of the key data sources. There's also You know just like everywhere else some of the data We really cared about was in wiki pages and some of it was in people's heads, right? And it's just meta information that had nowhere else to live and so out of that was born an effort that we call galaxy and It's not it's not even a data source as much as it's a data aggregator It goes out and on a regular basis harvest the things we care about out of all these other systems Puts it in one place with a bunch of key value pairs sticks a for an API on the front end of it And then allows us to go right all of our tooling to just query that say tell me everything I need to know about this host or this cell I took a screenshot and sanitize a little bit, but you can't get an idea of some of the data that's easily presentable with this You're looking at a portion of our performance one. We call it hardware in our Chicago data center Had to hide the host and all that but you can see there shows you the host It shows you the cab count in case of that one there's five cabinets if I clicked on that link It would actually show me the list of cabinets. I could go down to a list of hosts I could see which hosts were not only in heads in server installed and ready to go But also who had a compute who didn't who's disabled who's not disabled just a real quick way to sort of See how stuff's going on in the fleet. We also started building some microservices to go along with this We started with one called resolver It does just that it's a real simple service We you know we looked at our ops guys and we looked at the flood of alerts coming at them from tens of thousands of hosts every day and We realize that you know 30% of them or I have to just log in and restart this thing Or I have to just do this simple little repeatable task. So let's stop having them have to worry about that So resolver was born to do that. We've since expanded it to Take any kind of messages So now we can drop things on a rabbit queue and resolver can react to it just like it could a monitoring alert And then after that we built another service called auditor It does what it says it's supposed to do it goes out and given a set of rules Looks at the fleet and says who doesn't follow the rules And so what this shot is is actually we deployed about four or five weeks ago I'm sorry, this is a this is We deployed four or five weeks ago a Mechanism where auditor will detect a compute node. That's not bright version because let's face it if you have Five or six thousand of these in one data center. There's a really good chance one of them doesn't have the right code on it Right either it was down when you deployed or you know, whatever And so auditors constantly crawling through the to the through the environment Determining which ones are out of sync and then it just drops a little message on the queue the resolver who goes out And runs a playbook that updates the code So that's kind of the first like this is all brand-new stuff. This is I Know everyone's gonna come back say when can I have it and we'll figure that out But like this is all stuff. We've we've just kind of really started diving into So where do we go from here What are the next sort of crazy thing like what are these are the conversations We're having based on all the stuff. We just talked about like what can we do and we're talking about things like From the time the cabinet rolls in the data center gets plugged in Can I get it from that state to taking customer instances without a person touching it and the reality is we have all the pieces already built It's literally just Stitching them together at this point. And so that's one of the things we're talking about More self-healing what are the other things you know I mentioned earlier that when I think about a host In its own there's six or seven items. I care about the version of the state of all those things You know, can I have more jobs out there auditing all those things patch levels? those kinds of things or If live migrates a thing Do I just shoot first and ask questions later like you know Should a should an operations person ever have to log into a host if it's not on all the current versions, right? Probably not if it doesn't have all the current versions of all the things move everyone off Shoot it in the head start over test it make sure it's good pat on the butt put it back into production, right? If it isn't all the right things and it's still doing something weird, okay, let someone look at that Right, that's something we care about because there's a problem. We don't know Let's go find that out We want more live migrates we actually have orchestration built that'll move everything around in the cell so Let's start thinking about future security patches, right? If we have the time Can we put the patch down and then sort start to play the you know the show game as we move everyone off Of a host patch and reboot it move everyone else around What we're finding at least when in our environment with live migrate is there's a little bit of IO Blocking right because you're copying this is by the way. This is without attached storage So there's a little bit of IO blocking as it copies data between two hosts and The network scripts have been updated so the packet loss on the actual switch over is like three to seven packets so In ninety nine point nine nine percent of the cases this instance will freeze for a second and then it comes back TCP session like you can be logged in and it'll move and you're fine You like so they'll freeze for a second and come back now that does not work across my entire fleet So don't think that I'm out there like mad scientists moving everyone around just yet We have a lot of other work to do to upgrade to support that But this is the kind of stuff we're thinking about and then the one I like to joke about is like I have all these tools that react to alerts and I'm building all this monitoring around capacity and utilization and like Stuff and I have this system that I have to log into every now and then submit a request for more gear that Finance already has approved it has a plan for but I still have to submit the request so why can't the cloud order its own gear like That's just another the kind of level of like weird thinking that we have is if we're building all this stuff And it knows what's going on. Let's just it's got an API to that system. Let's just have the You know DFW order its own cabinets, and if they want to approve it or not that's a finance decision But you know the cloud can at least say hey I need more so that's just a few of the things that That we're sort of starting to think about based on all the things that we've already done. I Think I have like four minutes left So I will take any questions ideally from the microphone, but I'll try to repeat them if you don't go there This may be a dumb question. Could you describe a bit what a cell is sure? It's a new concept. Yeah. Yeah, no we Happily Think about this ways a cell should Should be thought of as an internal construct that you as an operator used to organize your fleet All right, and you may choose to do that based on hardware type I don't think Sam's in here nectar for example actually has a different cell in each geography because the way that their particular cloud is funded and Locations get their own funding and so what they've done is they've collected Host in a region based on what what the source of funding was but they have one API overall of it So, you know different reasons like that like I said it does offer you some Separation of failure domain and in a sense horizontal scaling of your database and your rabbits and all that because if you just kept piling rabbits For our nodes for example into one rabbit server. It could get a little overloaded at some point, right? So that's really kind of the the core of it What you do beyond that is really kind of starts getting down into operator preference And if you if you talk to the different groups that use cells we all actually do things slightly different and for different reasons some people will Group cells based on when they bought the hardware, but then they have host aggregates that span multiple cells based on the hardware type Like I said, we'll actually not only get the same hardware profile But the same vendor and put them in a cell and then we'll grow them like it also based on What the ports on our aggregation routers allow us to do so there's only so many ports on every router pair So I know that each each cell will have X number of cabinets a router pair will support You know so many cells and I can plan sort of my growth that way to help Do you use local storage for your instances and why or why not so most of them? Yes We do have Boot from volume both in our before our general purpose and high-o offerings that can also do local storage And we have compute and memory optimized flavors that are only boot from volume Why that's just how we started and The boot from volume options are only a few months old So we do actually have the conversations with respect to live migrate a lot of things and and customer experience You know is moving towards a volume based only model a better way to go But that implies a lot of costs and customer change But it was just the way we did in our legacy product and when we first jumped over to use open stack We just kind of built the same thing and in a lot of cases that works great for people and and Most of our hardware now that we're ordering that is local volume storage is at least SSD back So it's still fast and it does those things. Thank you Think maybe one more and then So in the paradigm of service and tenant cloud that you started going towards so you have this cloud You know all your controllers and VMs Could you talk about the management of the service cloud the stability over that or do you ever worry about that? the riskiness of that and second is the HA of the service VMs now within that context, you know So so we worry a little bit about scheduling and all that I think we've we've solved most of our scheduling worries by isolating capacity to ourselves So we know we have this capacity. It's for our control plane items It's away from other internal users and honestly if the API for our internal cloud went down It's not the end of the world because we're not constantly scheduling new Instances from an HA perspective the one place where we do HA right now is our databases. It's not the best solution It's a it's a dual database master pair using the RBD and coral sink And if one of them goes down and basically the other one will fence it It'll go into the API and shut it down until we know what's going on And then we sink to some slaves for for backups and those kinds of things. We are looking at Galera and all the other things It's just you know one of those trade-offs where You know you only have so many fires and so many new things to do And so that is on our roadmap of a better way to do the database clustering That's probably the main place where we cluster I'll be honest our rabbit server version is old and we need to look at a newer version and kind of understand What we're going to do from a clustering standpoint there But we have been listening to a lot of people talking about how that's still not ideal like it's not perfect, right? So Yeah, I think for us it's going to be a matter of can we cluster the global rabbit and Kind of deal with the cell rabbits being single nodes because there's less things to clean up of a cell rabbit falls over But if the global rabbit falls over, it's not a fun day. It takes a while to sort that out So cool. I think that's the time if you have any questions. Let me know afterwards, but thanks a lot