 Hello, everyone. Welcome to our talk. Today we're going to tell you how we moved our running OpenStack Cloud into a new data center. This is a big talk. This is a big project. So some of the stuff you're going to talk about today is not super detailed. Craig and I would be happy to answer any questions you have about any part of this talk. Yeah, it's not advancing. There we go. Okay. Just a brief intro. I'm Matt Fisher. I'm a principal engineer at Time Warner Cable. This is with me today is Craig DeLat. He's the lead engineer at Time Warner Cable. Here's some contact info if you want to get ahold of us later. Really brief background on our cloud, OpenStack Cloud. We run in two national data centers. We're looking at hundreds of nodes in each data center. It's a private cloud. We're running business critical applications internally for Time Warner Cable. So when you saw this talk, your first question is why would anyone do this crazy idea? So the main reason is we are just out of space. We're out of compute capacity. We're out of storage capacity, especially for SEF. And we had nowhere to plug new boxes into. Data center is full. So we had a new brand new data center literally across the street. But we wanted to co-locate everything. So we decided to pack up the moving cloud and move it across the street. Second good reason is when we first stood up OpenStack, we had some environments, pieces of our environment shared between staging and production, specifically VLANs and hardware load balancers. So it made it kind of hard to make a change to one of those and know what was going to happen to production. The third reason is in the new data center, we were going to be allowed to own the whole environment, including the network switches. This will be important when we get to it later. Finally, we really needed to do a network redesign. We needed to improve the resiliency if we lost a switch, for example, and we needed to get ready for some IPv6 work. So I want to set the difficulty level for our project. These parameters are all different at different companies. The first thing is they don't let us in the data center. So any simple thing you might want to do, like move a cable, we have to file a ticket for and goes into a ticketing system and it gets put in a list and then it gets done when it gets done. Secondly, we're not going to not allow to touch the network switches in the current environment. This is going to be a recurring theme. Switch changes took days or weeks. Required some to be up at two o'clock in the morning to see them through and they were painful. Next, we're kind of a little special DevOps team at a large corporation and so we don't use some of the tools the rest of the company uses and that causes friction sometimes when we want to move fast and their systems don't allow us to. Like any big company, we also don't get to set anyone else's priority. So we file a ticket for the data center. We say we need you to check this cable. It's super high priority. Every other ticket they've got is also super high priority for the team that filed it. And lastly, probably most importantly, our customers consider their VMs pets. I mean seriously pets. They should be created and they should live for years and years and years and you should never lose access to them. This was probably the primary reason that we just didn't build a second cloud and tell people to move. What you're seeing here is our first technical planning session. This is Vancouver out on the balcony if anyone was there. Here we kind of finalized the network architecture. We came up with a list of to-do items, things to investigate and basically as soon as we got back from this we started on this project in the lab. Lab was finished in about a few weeks. Then were some delays in there afterwards to get the production and staging hardware in, burned in, fought through some vendor issues, and then some holidays. Staging finally took place, our first staging environment right before Thanksgiving. Production followed soon after. We all took a break for the holidays. Came back to our final staging production environments earlier this year. Our final final change for this project landed in February. So looking at May to February, a long project. Thanks Matt. So Matt covered the why we're moving. After we got to Vancouver, the mandate came down, hey guys, we're going to do this. Now knowing it's a reality, our first task was to figure out what do we need to accomplish and what do we want to accomplish. These are two big sections which are condensed down. Basically our need to accomplish is changing the whole network layout. This allowed us to future proof going into IPv6 and some other additional features. And then upgrading firmware. We fell way behind on firmware because again, our customers' VMs are pets, they're not cattle. And then what we wanted to accomplish. We had a team member that really wanted to get, burn in testing to get all of our hardware issues kind of taken care of prior to ever being in production. And then also fixed the server hardware layout. This included changing NICs, changing OSD to journal ratio on SEF and other various hardware components. Okay, so the first part is the physical move. So when are we realized like we had our whole order of when we're going to move our nodes? The first task is evacuate to node, of course. Then at that point, it's wipe the boot drive. We didn't want to risk anybody sitting there and saying, why is this off? Let's turn it back on. Next thing you know it's hosting VMs and then it gets powered down by DCS to actually move and take that outage. The second step is to make sure that you check that you wipe the boot drive because accidents do happen. At that point, we actually open our ticket with the instructions for our DCS team, they physically move the server, swap any hardware components, recable it. At that point, they powered on, it can't boot anymore. And we upgrade all the firmware make everything make sure everything is standard across the board. And then we actually do a cobbler updates to allow it to boot and join the cluster and start hosting VMs. Now we did run into hurdles. Standing up infrastructure is very difficult in our environment. Again, a lot of documentation, a lot of you're blind and can't go into the data center. And then one of the things is firmware. So we have a bunch of different vendors, a bunch of different cards, a bunch of different stuff. So we had to start standardizing across the board of what different firmware versions we should be on. And then there's a hardware config. Again, standardization here makes everything easy. We can automate our builds, we can automate how things join the cluster. And then burn in testing, we found a bunch of issues with cabling. We found bad dims, we had other motherboard issues and burn in testing is what really kind of sorted this out for us. Okay, so this was the comment I made to management when I got assigned to do this task. So we have a really great story to automate servers and OpenStack. I'm sure everyone here has that. But what about the rest? Because you can't just run OpenStack on a server. You need other parts of your environment to also be automated. So we designed for this project. I mean, we didn't originally design for this project. But our design allowed us to do this project. We have a full node server build automation with Pixi, Cobbler and Puppet. So we just turn the box on, configure it in Cobbler, it comes up, joins the cluster, everything's happy. We also invested a lot of time automating our hardware low balancer solution using Ansible. This eliminated some manual steps, and I'll cover that here in a minute. Same with the network switches, I've already mentioned the pain of the network switch configuration. So we automated that also with Ansible. We also designed some API quiescing with xine at D. This allows us to do a friendly sort of maintenance mode on API services, so that we can power off the box after quiescing all the connections. We also spent a lot of time investigating, designing for and tooling for guest VM live migration and also for virtual routers. Okay, so let's dive in a little more. First load balancers, HA proxy for software load balancers, that's easy. Just use Puppet and it works great. Hardware load balancers without automations are super, super painful. So I want to move a server and it has a new IP address. I have to call our ATN guy, Kevin, and say, hey, Kevin, I need you to log in to all the ATN boxes and change the IP addresses. And then the validation is something like, hey, the GUI is now a green dot instead of a red dot. With automation that Kevin and the rest of the team wrote, we now just do a deploy. And the deploy updates the ATN. It does post deploy validation. You know everything is good. The net result of this is not just speed. It means when we did some of this at 3 o'clock in the morning, Kevin could sleep, which I think he appreciated. Network switches was a huge win for us. Before automation, it required approval from three separate teams. One team wrote the config, they put it in the wiki, the other team took the config from the wiki, pasted it into the switch. You're looking at days or weeks here depending on where you're on the priority list. With the ATN system we have, we can deploy updates with Ansible and Jenkins, but better yet, we're doing all the switch config work in Garrett. So we want to move a box from one VLAN to the other. We propose it. The same original teams don't lose control. They still have plus two on it. It gets merged in, and then we deploy with Ansible. And getting the network engineering teams using Garrett was probably one of the bigger wins of this project. Caveats. So you may buy a very expensive piece of hardware and find out that the API and the documentation for the API might be lacking. And if that's the case, you're probably also for sure lacking Ansible and Puppet. I believe we wrote our own Ansible modules for this project. And then if you go ask your vendor rep, we know how many bits you can push. What's your automation story? You're probably the first person that's ever asked them that before. But the good news is vendor reps will usually buy you a beer if they can't answer your questions. So give it a shot. Okay. Let's talk about the actual move. Before we started here, we had a goal. Let's not disrupt customers with this. Let's pretend it's a normal weekly deploy, which might cause a little bit of APIs to go up and down. And to them, let's just live migrate their VM and let's hopefully not have anything else. I think we mostly met this goal with a couple exceptions, which Craig will go into later. The general process for a node, it's pretty simple. Drop the DNS TTL. We're going to reuse the hostname. So we need that lowered. Evacuate the node. If it's hosting VMs or routers, you've got to move those off. If it's an API node, queues traffic. I'm going to cover that on the next slide, I believe. Wipe the drive, as Craig mentioned. Power off the box. Physically move it across the street to the other data center. I have a star here, because I was in charge of the control plane. I had a cheat code and my cheat code was I had new hardware in the new data center for the control plane. So I didn't have to move my stuff. But computing storage definitely were physically moving things. So anyway, update the DNS record to the new IP address. Boot the box with Pixie. That's all automated. Test the new node. That's super important. I'll go into Y later. Then once you know the node is good, update the load balancer config so it joins the cluster. If it's Nova compute, re-enable Nova compute so that it can host VMs. And if it's stuff, allow it to join the stuff storage cluster. Okay, traffic queuesing, I probably mentioned three times already. What is this? So we added a special health check port for all API services, including MySQL and Rabbit, so not just things like Cinder. Basically, you go the box, you drop a file in the file system. This tells back to the load balancers, including the A10 and HAProxy, don't allow new API connections to this box. It's in maintenance mode. You then can go on those load balancers and watch and see connections fall off. And when the connections are clear, the box is now safe to either power off, move, reboot, whatever. We actually eventually, initially designed this for deployments and regular maintenance, but it was super, super useful for this project. Okay, actual nodes. So this was the order that we moved nodes in. The first to go is PuppetMaster and the rest of the control plane storage kind of went in parallel once most of the control plane was done, and we kind of held compute to the end. And when we did this, we ran into a couple problems which Craig's going to cover. So our second move was more like this, storage first, and then the control plane and then compute. And the point of this slide is not that this order applies to you, it's that your plans for this may change as you run through them and try new things. So we were pretty flexible with this. Okay, specifics. The first box to go, the PuppetMaster, also the build server, it built our boxes. So we had to install this one off an ISO, kind of bootstrap it, and then we did what's called basically, I call it brain transplants. So we wrote a bunch of Ansible scripting, Ansible playbooks to put it simply tar up everything on the old PuppetMaster that was puppet related, push it across, and then switch the identities of the boxes. This meant we didn't have to redo any Puppet certs or any sort of that confusion. The downside to this going first is that now we can't automatically pixie boot stuff in the old data center. So this was kind of like we've crossed this bridge and we're not going back. We did develop manual processes for doing that in an emergency, but we ended up not using them, which is good. HAProxy load balancers are next. So this is a simplified view of what it looked like. Everything's in the old data center. API calls come in, they hit a DNS record. The DNS record points to a VIP that one of the HAProxy pairs hosting that proxies calls on to API services. So the first step here is we have this backup HAProxy node not doing anything. So it's going to be what we move, but the new data center has a new subnet. So we now have to have a new VIP. So step two kind of looks like this. So in step two, customers are still making API calls old data center. But we have a new VIP running on the new HAProxy node and we did extensive testing here. When we first did this, we found the new HAProxy node and the new VLAN couldn't talk to certain ports and API services. So we caught that in staging and got it fixed. But that's why every step along the way here we tested the new boxes before we integrated them back into our cloud. Next step, just simple, just move the DNS record and everything should work. This is true for short-lived API calls, but it's not true for long-lived API calls. And by that, I mean RabbitMQ connections from API services. Those connections basically stay around forever. We waited an entire day and we didn't see the number of connections falling off that original VIP and we have to move forward. So at that point we powered off the first VIP. If you're going to do this, you need to know how your cloud is going to react. We knew we had specific services that don't react well to the rabbit connection going away even in 2016. Designate and heat were the two that stick out in my mind. So we had Ansible scripting in place basically to bounce anything that didn't reconnect to Rabbit at this step. Final States, basically the original state except load balancers went first so they're talking cross back, back across the cross data center link that we had and load balancers were done at this point. Keystone was next. Keystone Boxes also hosts Horizon. This was my favorite box to move because it was really easy, had no customer impact. Unlike other things like Rabbit, Keystone connections are very fast. So if you queues the traffic within two or three minutes there's no more API calls in flight to the node. Once the traffic's clean off the box you stop all the services powered off, rebuild it across the street, and then you do testing. Another key thing here. So before you add this new box back to your API cluster you need to make sure it works. The first box we did, all the tests pass except it's on a new VLAN, can't talk to Active Directory. Go file a ticket, wait for that to get fixed, and make sure you haven't added the cluster first or one third of your API calls are now going to fail. Control nodes next. My least favorite thing to move. Control nodes at the time hosted virtual routers. Virtual routers are the most customer impactful part of this process. We have customers that have one router with 40 FIPS. They have one router for their entire project and when that router is offline they lose every single VM and they tend to get mad. So this had to be done at night but even then I think we had more impact than we should have. Don't evacuate all the routers at once. We did this. It was a bad idea. We no longer do this. If you evacuate a node full of 100 routers it takes Neutron and OBS a long time to rebuild all the routers and all the flows. You're looking at 10-15 minutes in some cases. The better way is to do them one at a time and check in between on status and that's what we do now. Also after this adventure we no longer host routers on control nodes. We have colleagues giving a talk on that tomorrow. I think in the afternoon. Okay so we got all the routers off the node. Now we're ready to deal with API services. API services are great except Rabin MQ. Again we can queers connections on this node. You do occasionally have like a cinder or glance connection that lasts a long time. So after 10 minutes we basically said it's 2 o'clock in the morning so I'm sorry but I'm interrupting your API call. It was in the maintenance window. But Rabin MQ is a special snowflake of course. So the goal was to drop Rabin MQ connections off this node and move them to the other two. So the first thing we did is we've already queersed OpenSack here so let's stop it. But we still have OpenSack on other nodes talking to this node through HAProxy. So we've got to bounce them. And then we've got to bounce Nova Compute which is fairly harmless operation. And even then you're still going to have connections inbound to Rabbit. Other things like OBS talk to Rabbit and don't restart OBS. So the one good thing is that OBS actually responds really well to Rabbit going away. It's one of the best-behaved services in that respect. So you've done what you can to minimize the impact. Stop Rabbit. Stop MySQL. See what blows up. You might have to restart stuff again. Heat again for some reason likes to not restart itself or not reconnect to Rabbit. Powered on the box. Do all the same testing. Make sure the Rabbit cluster is good. Make sure the MySQL cluster is good etc. Then join it back to the cluster. Compute nodes. The plan was pretty simple. Get as much spare hardware as you can in the new data center. Evacuate as many nodes as you can off the old data center to the new data center. And repeat. In practice this took four days just because of the amount of spare hardware we had and the amount of number of nodes that could be physically moved and physically cable etc. We did invest a lot in our Ansible tooling for this. One of the things we do in our playbooks is we have a Canary VM. So before moving any customer workloads to a new machine you put the Canary VM over there. You do a connectivity check. We have found that the tenant traffic on a new machine maybe there's something misconfigured in the bond or on the switch and you can't talk into it. And you don't want to find out, you don't want to find that out with the customer VM. You want to find out with your VM. But my migration is not guaranteed to work. As anyone who's done it will tell you. The first thing we found out is if you do a whole lot at once it's a lot more likely to fail and if you have 20 in flight and the fifth one fails everything after that goes to error state. I don't know if that's been fixed. That was certainly true in Kilo. So we basically limit our parallelization in this with Ansible. We started with one VM at a time. That took way too long so I think we're up to five and it's been okay. But it might be something you want to tune based on your customers and your networks etc. Finally bigger and busier VMs may never live migrate. So that means you've got to call the customer and tell them we need you to basically shut down your box and do a cold migration. So that just adds a little bit delay here. Okay I'll Craig I'll tell you about storage. Okay so what about all the storage you know we're using sep and Swift and we have all of our boot volumes on sep and as you're moving stuff across you know your two different environments you're bound to run into issues and we're going to talk through some of these. All right so Swift was the easiest part. We have a guy on our team doing puppet Swift work and basically the whole plan here is we wipe the boot drive, power it off, do not touch any other drives. We physically move it, it powers on, Swift gets installed, it does its you know Swifty things and realizes it's all you know all these drives have object stores on them and actually just figures out what it needs to replicate. The bad part of this is it does it all at once. So if you bring up a node it's going to actually try and move data back and forth for all the all the object stores. Okay and the other issue we ran into is as we're building out our new control plane we had different network ranges so we actually had to do a lot of static routes which included static routes across our way and to our other data center. The most difficult piece was our sepman. So whenever I mean sep is reliant on sepman if anybody here is not very familiar without that sep just does not work. So the first thing is we knew the IP addresses were changing and we didn't necessarily want our instances to have issues connecting to their drives. So we tried to virtualize the IP address. Sep does not like this, do not try that. So basically says it's a security breach I'm not going to you know validate this request and basically just you know you're again you're kind of dead in the water. So we came up with a plan and we tested this through multiple scenarios. The first is we actually have to do a nova reboot or I'm sorry a nova stop or in start or a nova resize or the equivalent of a nova cold migration. So and this is for the instance drive. Nova live migration does not update them it keeps the same ones. Nova reboot did not update them so you know you actually have to play with this process I'm sure things get fixed as sep keeps growing and Nova and all the other patches keep coming along. The other one is attached volumes. So with attached volumes the live or the nova stop start and no resize didn't even touch them. So then you actually have to do a live migration in order to update those which is easy enough because we got to migrate them from our old environment to our new environment. Okay so for sep OSDs this is a majority of our storage. So again the plan here is we have to evacuate all the storage you know weigh out all the OSDs and you have to do it in an orderly fashion. So we were lucky enough to have you know a handful of nodes on the other side that we could actually bring up some new OSD nodes slowly weigh them in, weigh the old ones out and then just keep following that process as we're going through making sure that you're removing the old nodes from the crush map and you're totally wiping it out because you want to be able to reuse some stuff and when it has new IP addresses you don't want to confuse sep. So then you know when we're done we power off physically move pulled up and just keep repeating in a slow pace. If you try and like you know move it too fast you're going to have some definite pain points. As you will see in this graph what people I let me try and so over here you see that we have a throughput of 20 gigabit and we're maxed out at 10 gigabit. So we kept going back to our network team saying hey there's a bottleneck and they're saying no you have 40 gigabit on one side and 20 gigabit on the other and after like three or four days of us saying it's a bottleneck on the network they said oh yeah you're limited at 10 gig in between so it kind of matches up. So then we figured out how can we truly demonstrate this in a way that everybody's going to understand and so we searched the internet and we found this. So this is your all your services your sep you're sitting you're Swift your Nova and Rabbit they're all fighting through that pipe. So as you keep exhausting this pipe basically all your services are just going to stop and just start failing so you really have to take a stage and careful approach to migrating your data. Okay so there will be problems whenever you try and do this. So to start going down a list of these problems so with networking ACLs firewalls as you're adding in new networks and removing old networks you got to make sure that everything is the same across the board and this is definitely difficult when you're navigating multiple network teams and everything else. Incorrect cabling again two different qualities of service from different data center people some cabling was a hundred percent some was you know not a hundred percent and then bottlenecks as demonstrated in the video or the little animation. So and then with software we ran in some VTEP issues which I'm going to get into and then we had a keep alive the problem which I'm going to get into a little bit deeper. Okay so vendors we had a lot of different bugs with vendors whether it be firmware or just bugs in software so our deployment process we're very strict with deployments when we deploy one region we deploy to the next region it's like one after the other when we're doing this and we move that brain across to the new side we actually ended up with one region on one deployment and the other region is still lagging behind and customers. Getting customer buy-in is definitely a hard thing to do you have to have a reward for them going through and scheduling the downtime of their app especially when it's a customer facing or a money making application. Okay so with our vendors our biggest problem was a firmware we found issues with pixie booting we couldn't pixie boot any of our boxes this was a huge problem as you're trying to you know build out your new new side VLANs and lack piece stuff so we had bonding issues I think the VLAN issues that we that we were facing that really made life difficult in space so not only did we have to build out new racks we had to build out a new data hall which was in a brand new data center which we actually started building out as soon as we got occupancy. Okay so one of the things is you have actual verse perceived issues actual issues the VTEP overlap basically whenever we migrated our Nova compute over we had VTEP two different VTEP in the same Nova compute zone across two different like network settings so we ended up with almost like a duplicate IP address thing so your instance would all of a sudden intermittently just start stop responding we accidentally upgraded OBS and this goes back to that two different deployments across you know our regions and it gets kind of scary when you're running in that situation. Then you have a file descriptor limit for QMU basically on SEF it needs a file descriptor for every OSD it's touching and as you grow out your OSDs you're actually going to run an issue where you can't even attach a volume to a running instance and that's a big issue if you're running SEF like we are perceived so we have a customer calls in says hey we have high latency it's it's it's your storage it's your networking and we're like well you know it can't be we're not showing any latency and then it turns out they actually you know open up a new ad campaign which ended up with six million customers hitting an instance that wasn't necessarily you know scaled out for that. Okay thanks so if you're gonna do something like this if you're gonna move to a new data center or new data hall what are things I think you ought to consider. First is our cloud has a lot of interdependencies services and whatnot and I was in charge of developing this plan and I should have known better but I forgot we're running caching DNS on the load balancers so the first time we did this in our lab I said hey look we move the load balancers everything's fine and then every single I sing a check turn red because nobody could talk to DNS anymore so the plan was adjusted for that you may have little things like that that you've forgotten about you'll need to go remember what they are. Next our plan relied heavily on basically reusing hostname so swapping IP addresses there's things that we still use IP addresses for Galera MySQL config is one Ceph OSD or Monmap is the other and HEProxy is the third and so anytime we wanted to change those in the middle of this process we had to do a deploy to push out new IP information to the services. The other thing is you need to ask yourself what resources in your company are protected by VLAN specific ACLs when we first stood up the new data center nobody could talk to the data center DNS servers and so nothing worked that requires to file ticket and delay of several days I already mentioned I think the Keystone nodes not being able to talk to Active Directory that was also a delay of several days so if you have those enumerated and can test them beforehand you'll save yourself some time. Another thing is you need to have good maintenance plans in place for all your nodes and all in automation if possible this really helps in developing something like this. Next thing is don't over communicate with your customers you need to tell them what you're doing but they get really nervous if you tell them like your crazy plan for moving HAProxy so you just tell them something like we're doing this and it will improve resiliency and scalability and things like that they like to hear those things so that's what I recommend you tell them. Don't get aggressive with your timeline if you're going to promise something to a senior VP and it's going to be on your annual goals you might not hit it and it might not be your fault because of all the sort of friction you're going to run into outside your team even if you move fast so just be cautious. Practice practice practice that's this probably obvious. The first time we did production was our third run through of doing this move not our first every step along the way we modified the plan. We improved the plan in terms of reducing customer downtime and also in terms of making it faster especially for the parts of this that took place at two and three o'clock in the morning. Okay this was a super long talk and we touched on like a million things so hopefully you guys took a little something out of this. My key takeaways are probably that you know if your cloud's designed for maintenance and maintenance is planned and then something like this is possible moving a data center across the street. If you are going to do this have good planning expect that different parts of your organization move at different speeds. If possible there's things you can do to work ahead. We had a problem one night with control nodes not being able to move but we went ahead and moved other parts of the cloud even though that wasn't the original plan. Finally as I just mentioned you need to practice this. You need to work out all the bugs in your plan and see how you see what all you can do to reduce customer impact. And I do think we have a couple minutes for questions. There's mics in the middle. Yes can you describe your rabbit configuration? Do you run in a cluster? Yes we have a three node rabbit cluster per region and it's fronted by HA proxy and I know. And it's not the rabbit cluster. It's a rabbit cluster. Are they connected all independent? The rabbit instances are they connected in cluster? Yes they're clustered. By rabbit means. Yes. And it has persistent queue or no persistent queue? Durable. I'm not the rabbit guy but we're running the durable queue. Because it's usually the pain point all stuff with rabbit and you kind of skip it almost. Yeah so and rabbit's interesting there's a lot of opinions on not running it through HA proxy but we actually found it worked better to go through HA proxy than this sort of do this what older versions of our code did which was to specify every box in the config. So on one of the slides and I kind of missed it. So we had a keep alive D issue and this was the biggest pain point with rabbit and any of that stuff. Basically keep alive D processes between our two HA proxy nodes got into a weird state and we went split brain. So this caused rabbit to basically say I don't know you know what I'm supposed to actually use and and so we're getting like we run sender in a not HA but in a in a fashion that we're spreading across all our control nodes so different things like that get really confused. But yeah from I mean we're not doing any persistent queues that I'm aware of I think our whole setting is HA durable and as the service stops or dies it goes away and built a new one. Thank you. I have a two-part question here. You guys mentioned that there was a change in the network architecture right? Could you touch upon a little bit on that and Craig would love to answer that. Okay so I'm actually a storage guy I'm not a network guy but so we went from an end-of-road to a top of rack so in every two racks we had a switch pair which which basically were redone into each other and we cabled up to those and then it went to a leaf spine situation so again it's kind of future-proofing against IPv6. We should be able to take down a rack pair or a rack in when we're doing switch upgrades and really not affect anything else so we're just building it a lot more resiliency. Okay and you also mentioned that due to the delay in the process of fixing networking you guys had to take over the networking pieces. I know it's an organizational question but I'd like to know to what level you guys have taken over managing the network infrastructure. So we definitely do all the network config for this automation thing I told you using Garrett and Jenkins and what not but we politically cannot basically take over the other teams jobs so the way it works for that is we want to change it requires three plus twos one from each team who's got a say in it and even though it's just an automation it's literally a button press and Jenkins it's done by another group and that's way better than it was but and that's about as good as we could get it. Can you go a little bit deeper on the network architecture you had before the move and after the move and if that affected life-migrated instances that's a network changed. So once the instances were life-migrated if they were affected by the network architecture being different anyhow. No I don't think so not to my knowledge. You want to answer? So the team that we have doing Neutron they basically extended the old IP block across so anything that already had a floating IP address automatically was still able to use it just really the physical layer is what changed a lot and how stuff was kind of being routed but you know for our end users they didn't have to get a new FIP or anything like that and it also allowed us to actually add in additional like resources like floating IPs and other stuff. Yeah I think a super big part of this was being able to keep the old floating IP block when we moved like without that that's really disruptive at that point you may as well as tell people to go go use a new cloud or something. Thank you. Hi you mentioned hitting QEMU file descriptor limits for Ceph so I'm assuming that's on your compute nodes. Yes. Do you typically how many VMs do you typically run on these where you were seeing those? Yeah. You know is it hundreds is it 50 is it? So I mean it's it's definitely a lot so we have a lot of compute nodes I'm not going to get into the specific numbers. I will tell you that it is based off the number of PG's which are allocated per volume and how many OSD's that you know all those PG's are spread across and we didn't have this issue to begin with and then we ran in this issue as soon as we were in the 800 to 900 OSD range and we had a couple people that were using like nine terabyte volumes and different things like that so in whenever you start attaching more volumes it opens up more just more sessions so a lot of our customers do like just the instance they don't necessarily do attach storage so it's not bad it's the ones that we're doing attach storage and then have five you know three to five you know attach volumes and they're big they saw the pain the worst. Okay so it wasn't so much how many VMs you run on your compute hosts it was more. It's a Libvert setting so it's it's set to yeah it's set to like a thousand twenty four I believe or something like that in in the file and you can just tweak that Libvert config and raised it up. Yeah we use PR limit to troubleshoot those instances just you know until we could get a change through POPPET. Alright cool thanks guys. Could you quickly describe your practice environment? Our practice environment we have multiple so we have a virtual development environment which is I won't say I won't say it's not open stack and open stack but it's literally our cloud configured and deployed our way running in virtual machines so that environment for things like how do I move one POPPET server to the other and not run into key issues and what not. Once we can't do anything anymore in there we have a hardware development environment in the lab most everything in there is sort of just hardware to six six compute nodes and control plane. Then we have a staging environment it's in hosted in the actual data center and this is the ones we can't get into and so we would do that it's a lot smaller than production in terms of scale it's like a hundredth scale but it was a way for us to practice the plan and figure out how we're going to monitor and monitor VMs and see what potential customer downtime we would have and then we go to production. At the time we chose to do lab first then one staging then one production and the thinking there was we still want to have one old sort of architecture staging environment in case we need to diagnose the problem there. Yeah thanks.