 Hello everyone and welcome back to Open Infra Live, the Open Infrastructure Foundation's live show. A few weeks ago we hosted Open Infra Live Keynotes, a special edition with key speakers and announcements. We took a short break after that, but we are now back to airing every Thursday at 15 UTC. This show is met possible thanks to the Open Infra Foundation member organizations. So thank you all again for making it possible. One recurring episode on the show has been the large scale OpenStack show, organized by the OpenStack large scale SIG. We invite operators of large scale deployments and get them to present how they solve a given operations challenge and discuss life between themselves. There are different approaches. For today's episode we decided to discuss operators' tips and tricks. Every operator has custom tools and tricks that they use to keep their OpenStack clusters ticking. And today they will share them. So our guests today are Adrien Pensard, Site Reliability Engineer at OVH Cloud, assisted by Arno Morin. Gene Quo, Infrastructure Software Engineer at Line. Shadathru Bandhya Padhyay, DevOps Engineer at Workday. Axel Jaquet and Thomas Guaran, Cloud Administrators at Infomaniac. And Ben Miro Morera, Cloud Architect at CERN, who will drive this discussion. As I mentioned, this is a live show, so feel free to drop comments and questions into the comments section throughout the show and we'll try to answer them live as many of them as we can. So it's time to get started, so take it away, Ben Miro. So, hello everyone. I'm Ben Miro. I'm a computer engineer at CERN and today I will be your host. And I'm really excited for today's episode. We will present and discuss different tools and tricks that OpenStack operators developed to help on their day-to-day operations. And without further ado, let's start with Axel and Thomas from Infomaniac. They will give us a cool demo about the neutral tool that they are developing. Go ahead. Hello. Thanks for having us. Hello, everyone. So before we start, I have a quick introduction of Infomaniac itself. Okay, so Infomaniac has been using OpenStack since the Grizzly release. So that's 2013, if I'm not mistaken. We're a fully Swiss company, completely independent from big financial systems. We are nearly reaching 200 employees. Since we're in Switzerland, we also have 40% of our income from Europe. And Infomaniac has been providing hosting services for the last 25 years. We operate two data centers and we are building a third one currently. And we have a public cloud that is open since last summer. We have implemented the basics of OpenStack, so Keystone, Barbican, Yoki, Cloud, Kiti, AODH, Neutron, Heat, Glence, Swift, Octavia, Nova, Cinder. We will implement more things like Magnum and Designate and Malilla and everything we can, if possible. We are very happy to provide that on top notch, a very recent hardware with NVMe for self and latest AMD CPUs. And we are also very happy to provide that with the cheapest price on the planet. So today we are going to present two tools that we use for everyday operations, two tools that Accel wrote. So the first one is a tool to manage HL Virtual Routers follower. So for a little bit of context, if you want to empty a network node and for make a maintenance, for example, and we don't want to break active TCP connection of our customer. So the simplest way to do that is to install contract day on a contract instance for HL Routers on network nodes. And this tool acts in a network node and the namespace of virtual router. So the solution is to implement a contract day inside the inside the HL router. So can show a little demonstration of the tool. I will explain after all the steps of this tool. So what you see here is on the right, there is two HL routers. On the left, you can see hyperferring on two VMs. And we just do the HL router failover. And as you can see on the left side, the TCP connection drops to zero byte per second, but the TCP connection is actually not dropped. And it's restored. And as you can see, it has failed over. So the time to take the connection in return is so we just made two seconds for the HL to change the HL VIP IP and five, six seconds to the contract to have the possibility to reactivate the connection between the two servers. So the implementation is basically simple. So at the beginning, we need to find the network node under which router are located. After that, we need to have some information about HL router. So we have the HL interface name and the IP of the HL network. And with all of this information about virtual routers, we can set up the configuration of the contract day and create it and transmit it to the network node. And after we create the contract day configuration file with HL IP and the HL interface, we can make the keep alive the script. So it is the role of this script is to do the basket between active and passive at the level of the keep alive in the network server. And after that, we can start contract day for each HL router and we can trigger the failover of that. So we stop the interface of the active router. And after that, when as you can see in the demonstration, the failover is done and the connection is kept. So after all of this stuff, we tear down everything on all servers. So it's basically really simple. And it was only right in the Python script. Python. Sorry. That's it. Okay. So the next one is connectivity check script. So the context for that is the customer contact our support center for connectivity problem about the onset of this cloud. And we don't know if we have an issue or if your user have an issue with this configuration. Sometimes it's maybe security groups or internal firewall problem or maybe a misconfiguration in the cloud, maybe DHCP or another one. So the solution is to check from outside the connectivity with the customer cloud. So for that, we see the tap interface directly on the compute host and we trigger some repeatable rules. I will explain this after the demonstration. So what you see here is the script connects to the host of the VM and then it inserts some repeatable rules for that VM. Then it does a ping from outside from another server and then check if that ping's reached the tap interface of the virtual machine on the host. At any given time that ACMP packet is reaching the inside of the VM. Okay. So the implementation is simple too. Some thoughts we need to identify on which compute node VM is located. After that, we identify a repeatable incoming traffic rule of the VM in the compute node directly. It's composed of the tape name interface. So it opens this switch comma number of tape interface of the VM. And so we introduce two rules on the top of the rules. So the first is a drop packet from a ping from outside. Yeah. So like the ACMP echo reply packet that we send to the host will be dropped from the point of view of the VM. Okay. Yeah. But the host sees it and logs it. Yes. That's it. And we have a second rule. It's just a log about the first rule. And in this, in this log rule, we inject a random hash. And after the check, we are, we see it's arch in our log. And if we find it, it's been the packet is correctly arrived in our infrastructure. So we don't have a problem of connectivity. So it may be a customer issue. And for an analogy, this is the same as, as provided that if a mailman gets a letter in your mailbox, you will verify if the letter is, is good delivery in our box. So it's a little analogy to explain this script. So it is a batch script. And after that, we turned on everything like the first. So no trace. And it's totally transparent for the customer. And it works in the EPA v4, six with a hostname of cloud or just IP. Yeah, simply if you give a hostname, then we just resolve it to the IP that is supposed to be in the cloud. Well, okay, that's it. So both of these tools are currently waiting in a Debian new queue. So hopefully it's going to be approved by the FTP masters soon. And if you want to contribute, it's in the GitLab instance of Debian, wherever I always do the Debian packages for OpenStack. Contract Day is just, in Neutrarch at all. Contract Day is just one thing. You have a lot of things like migration of old HTTP agent, list of old agent, delete namespace, migrate namespace. All action about the network nodes are present. A lot of action about network nodes are present in this script. And the, the principal goal of this script is to facilitate the life of operator in an instrument, for example. One, just another thing which I'd like to say. So we took the packet, the package from Susie, and I desperately searched for their Git and didn't find any. So if there is one, we would be very happy to merge back what we've done or whatever. In the meantime, we also accept contribution over there. Okay. So thank you for everything. So thank you. Thank you both of you. Axel and Toma. It was a very interesting demo. Thank you so much. I think this will be very helpful for many other operators using Neutron and OVH plugin in particular. The other fellow operators here, you have any question for them? All right. So thank you so much again. I'm sure that we're going to have questions also from the audience at the end. So I'll suggest that we continue to move on. And then next up, it's Adrian and Arno from OVH. And they will present a tool that they have been developing, they call it Unity. So for is yours. Yes, thank you. So a little presentation of OVH first. We are a big hosting company, a big hosting French company. We deploy OpenStack since eight years, I think. And we offer VPS dedicated servers, web services. And our public cloud with OpenStack is composed of like 20 regions. So we have a lot of hosts, a lot of flavors and aggregates to manage. And we have to integrate all these entities with the OVH cloud world, billing hardware, etc. So we have a lot of tools to manage all these parts. And I can present some of these tools, like our GitOps tool, which is named ProdClick. And it's a simple Python tool to have the diff between what we have in our dev infrastructure and in our prod infrastructure. And when we are ready to prod a commit or a new package, we are simply creating a big diff between all of our repositories and prod that into our puppet deployment. So it's then logged in a change log for other team members to read what's changed in the day or in the week or in the past week. And how it is executed, it's executed with Ansible playbooks, like when we have to prod something or to upgrade the DB when we are upgrading an OpenStack version. And we did that when we were prodding Newton from Juno or Stein from Newton. And we have a system which is composed of Mistral workflows, which are triggered by our self-leaning, alerting and system. And those kind of workflows are very useful when we want to join some hosts or to move a host from a region to another. And all this is triggered automatically. And these are very independent tools. But when we need to execute them manually, we use a tool that I developed called Unity, which is a kind of glue between all these components. And it's very useful when we want to apply some operation on certain region, on certain hosts, on certain aggregates, and communicate that to the team. So I integrated a lot of components like Webex teams, Obgeny for alerting, email for our old mailing list, and our new status page, which is replacing our travel page. Like I prepared a little demo, which is showing a host maintenance when we want to upgrade a host or upgrade the BIOS version. So it's a simple tool, which is draining the hosts one instance by one. And there is a lot of options. I cannot show them all, but you can see that I'm not trying to ping the VM before it's migrating. Like normally, we try to ping the VMs before the migration and after the migration to see if we broke something. And this tool is usable with a multi-region, multi-host, or multi-rack, or you can put as much arguments as you want, and it will try to do as many things as possible. So you can see that it migrated the two instances on two different hosts, and the maintenance succeeded. So like I said, it can manage a lot of open stack entities in the OVH lifecycle of objects, like flavors, aggregation, multi-region, multi-host, and now we are trying to manage ironic nodes we are deploying. And specific OVH entities like our data centers, racks, switch, and save clusters. So that's it. If you have any questions. Thank you so much, Adrian. Very interesting demo. Is there any question? Yeah, I have a question. So in the demo you have showed, was it like a live migration or something similar to a host of a question? It's the live migration of instances hosted on the host. If the VM is active, it will be live migrated. If the VM is stopped, it's called migration. And I have options. If we have an incident, I can call migrate all the instances on the host. So it will shut down the instance call migrate it and restart it on another host. Okay. Thank you. Sounds good. I have also a few questions for you. It was a pretty cool demo. Actually, the tool that I will present later, it's very similar to what you presented. You mentioned what I found curious is that you mentioned that previously and after and before the live migration happens, you ping the instance, but not during the live migration. Why not? Because we also do it during the live migration. Okay. Because by experience, we know that if the VM was pinging, it will only lost maybe one ping or maybe two. But it's not critical for our customers. The main thing is to succeed to migrate the instance and not that it pings at each millisecond. Right. Yeah, it's true. When we do that, we are not trying to do that. But what we want to try to achieve is if the live migration fails because we had some cases that the machine is unresponsive for not one ping but for 30 minutes during live migration, we wanted to be notified that that happened and then to try to figure out with the user if the machine is still unavailable or not and act manually on that. I don't know if you observed that with your question. Our main issue is not that it doesn't ping. It's that the migration itself failed. So we are alerted when that happens and usually we are resetting the state and rebooting hard dvm and notifying the customer that something went wrong. We have a question. Yes. I've seen that you see some kind of progress when there's live migrations. Do you use Nova API to do that like OpenStack server migration show or something? Yeah, we are searching for the destination host because the scheduler gives us a host and we want to know which one is it. And yes, there is a progressive timer which is reseted each time we see a progress in the Nova progress or if the disk migration is stalling or not. So we are resetting this timer each time we saw a progress. And there is a timeout. It's like one hour maximum. And if this timeout occurs, we are alerted. There was also a quick question from the audience. Is any of this tuning from OVH already open source? Not yet because this migration, this instance migration part is a very little piece of all this glue in this tool. We have a lot of specific OVH concepts inside it. So I don't think it makes sense to open source it, at least globally. But maybe some part of it could be open sourced. We are already thinking of it. Thank you. And also we have this repo for the operators where we have some tools there. Maybe that could be an option for OVH to put their tools there, right? Yeah, we have a lot of generic tools that could be used by a lot of... I think we have a banner with the URL of this repo. Here we go. All right. I think we need to move on. Thank you so much, Adrian and Arno. The next one is Ginny from Line and will give us a lot of tricks that he uses in the line clouds. Yeah. So my name is Ginny and I come from Line. So Line is a messaging app company located in Japan and we're currently expanding to other fields like mobile payments. So today I'll introduce some tricks that we use in our opposite clusters. So yeah. So the first trick we use is basically to do zero hypervise upgrade. So what we do is that we split our hypervisors into different groups and upgrade them in serial. So the reason that why we try to upgrade the hypervisors in serial is to prevent multiple neutron agents or Nova Compute Service to restart at the short time. So after we start those agents or computes will send a lot of RebMQ messages. So even though we have... There's an option called periodic fuzzy delay stats in postal messaging, if I remember correctly. But in our case we have more than 4,000 hypervisors in our largest cluster. So even though it's set to a high number, the message rate is still very high, which will sometimes cause RebMQ outages. So basically this is how we reduce the load of RebMQ cluster during upgrades. Yeah. So the second trick we have is basically we added a custom field in Keystone. So this custom field we call it Retired. So in some of our production service we use this EC2 compatible Keystone credential generated with Keystone accounts. Well, some of the users associated with those Keystone accounts may retire or leave our company. So basically a simple way to disable the user is to set the enable equals files in the Keystone site. But this will basically make those credentials invalid, which will cause outages in our production services which use those credentials. So we basically added a custom field in Keystone called Retired to identify if the user is still in the company or not. So to prevent outages in our production services. Yeah. So the next trick is dynamic hypervisor disabling. So most of the operators have faced issues with noisy neighbors. So if your opposite cluster is configured with over provisioning, you may have VMs using more virtual CPUs than the actually physical CPUs it have. Well, if a lot of VM on the same hypervisor is having high load, it may cause noisy neighbor issues. So we basically have a pyrocus script to scan the metrics of the hypervisors exposed by node exporter. So when we found out that certain hypervisor is having high load, we basically disabled them to prevent any new VMs being spawned on them. Well, after a certain time where the load is frottled down, we will enable them again. So basically we return a script to scan the metrics and then temporary disable hypervisor to prevent new VM being scheduled and then causing more severe noisy neighbor issues. Yes. So these are the three tricks that I would like to share today. So feel free to ask any questions. Thank you, Gene. Very cool presentation. Do you have any questions for Gene? Well, actually, I will take the floor because I have some questions for you, Gene. So you have 4,000 nodes Nova and Neutron behind a right in Q cluster. Yep. So we have that is brave. Yeah. Yeah, we did sales in our cluster right now. Okay, pretty cool. So regarding the last thing that you presented, the dynamic hypervisor load check, that is pretty interesting. We suffer from similar issue. So the CPU still time in the VMs because, of course, we over provision CPUs. However, when that issue happens, actually, we need to act. So because if the node is in I load, probably there is, maybe there is CPU still, but maybe there is not. So we are not among the compute node. But when we detect or the users tell us that they are CPU still, we need manually to intervene to migrate some of the VMs there. And we don't see an easy way to do this automatically, especially because at the hypervisor level, we are not able to see if the compute nodes, it's in CPU still time or not the VMs that is hosting, right? Yeah. So yeah, this is an interesting problem. Definitely what you did probably helps for your use case, but this is something that I also like to hear others experience on this, the CPU still time if you're over provisioning CPUs. There's also a good question on the YouTube chat from Arne Asking, Gene, which metrics you use for to actually measure that overload? If I remember correctly, as this project is done by another member, so we use the CPU load, so that's the context switches right now. So it may be good to look into context switches also. Yeah. We found that context switches are the metric that give us the better approximation that there is CPU still happening in the virtual machines. All right. Any other question for Gene? Or can we move on? Right. Thank you so much, Gene. So let's continue to shut out rule from work thing. We will present us the cloud map tool and a different tool to clean up nobody's. So you have the floor. Hi. So can we have the presentation? Thanks. So hi, I'm shut out rule and I work at work day as DevOps engineer. So next slide, please. Yeah. So just like little background to work day is provides leading enterprise cloud for finance and HR. And so I work at work day private cloud, which basically is open stack and it provides a large, it basically holds large amount of work day services. Next slide, please. Yeah. So we have 66 production clusters throughout the globe in five different data centers. We have about 16,000 compute nodes in total and we have about 51,000 instances running which serves about 72 different work day services throughout our entire open stack footprint. So having such a large deployment comes with like its own set of challenges. And one of the challenges we have seen early on is we were not able to solely rely on horizon to visualize all our cluster resources because we have such high number of clusters. So we have created a tool called cloud map, which is which basically provides centralized reporting and search for all work day VMs across all the data centers. It is designed to provide like simple visualization and it can also answer like user queries. And it also helps us with the operation and troubleshooting to see, for example, like some kind of commonality between like some, you know, VMs, which are probably having issues. A little background about the tool is it is written in Python. We are using the Django framework and to be specific, it is using the Django admin interface. And we, so the way it works is so basically all our open stack clusters, they run a script that is also a Python script, which basically calls OpenStack API to aggregate all the data regarding host hypervisor and all the different OpenStack objects. And it basically pushes that dumps the data in a JSON format, uploads it in S3 bucket and cloud map tool basically, you know, imports the data. All the cluster basically sends the data in like a five minute interval and same for, you know, this cloud map also imports the data in a five minute interval. We use MariaDB as a database for the tool. So we use different like Django frameworks, for example, Django, for LDAB authentication, Django, DZDTec for detecting user time zones and so on. Due to security reasons, I couldn't show the actual instance of cloud map, but I have loaded a test instance of cloud map with sample data set, which I will demo. So as you can see, this is like the main interface, which shows all the different things we have, clusters, flavors, host, control plane nodes. And on the right side, we have like different reports, which is used by different stakeholders and users. So for example, if you see we have like different host and you can see if the host is up or if it is enabled, which cluster it is, which kind of hardware it is using, what is the next page, this kind of information. So some of the information we capture for OpenStack, we also use Chef. So some information is basically captured from Chef. We also provide like add once search kind of feature. So if you would like to search some, like for example, compute nodes for a specific cluster, and which are probably disabled, you can basically write a query to do that. So again, we are using a Django feature called DjangoQL query language to implement this. So, you know, and all this data, which we basically select can be exported to a CSV format. And you can basically use it for reporting purpose and Excel and stuff like that. But we also have some integrated reporting, you know, built into this tool, which I will demo. So, yeah, so let's go back to the, yeah, so these are, for example, our, yeah, so yeah, these are basically our control plane nodes, like I'm just like kind of demoing how you can like search and stuff like that. So, yeah, these are virtual machines. We have integration with our logging tool and our metrics tool, which basically kind of provides links. We also have integration with the JIRA. For example, if we have a disabled node, it basically, you know, points out to a specific JIRA. And it's kind of like very intuitive. So, for example, if you look at a host, you can see all the virtual machine running on top of it. So, this is, I will show one of the reports. So, this is like a capacity utilization. We are using C3GS for making beautiful charts. So, you can check utilization in different environment or a specific cluster. You can see the utilization even in the host level. So, for example, if you have a no-valid host issue, so we can see, you know, in a host level, which host has how much your source has sent and so on. So, I will show you a couple of more reports. So, this is for checking how much capacity we have available. For example, if we want to deploy a certain flavor in our environment, how many VMs we would be able to deploy in different clusters we have. So, this is for that. So, we can kind of predict how many VMs we would be able to deploy in which data centers and clusters. Next is like this is another reporting, which we use very frequently to see how many like computers we have enabled and disabled per data center and per clusters. Similar to that, we have like how many VMs we have per data center and per cluster. So, this kind of like reporting, we can basically do using this tool. Yeah. So, yeah, I think that's pretty much about it. And yeah, this information is not coming from OpenStack. We have like a Chef integration. So, this information shows which operating system we have, which hardware model we have, which vendors, and you can obviously do AdWords search to search specific things. For example, nodes which is running in this cluster and have this kind of hardware. So, this kind of things you can do with this tool. So, yeah, that's pretty much about it. For the first tool, if you have any question on this, then I will move to the next one. So, I'm sure it's a very opinionated tool that you developed for that is in use on a work day. But is it available somewhere that can be like inspiration for others to use and to adapt for their infrastructures? Yeah, it's not yet. So, this kind of started as like a pet project and we never kind of, you know, intended to use in even our production. But then it actually eventually grew and grew, and it has like a lot of hard-ported, you know, work-day specific things like cluster namings and stuff like that. So, but yeah, we do plan, we do have plans to like, for example, like, remove those stuff and probably like make some part of it available by, you know, cleaning. But yeah, we have like a lot of cleanup to do. But once we do that, probably we can make this available. Cool. Thank you. And then if there is any other question for these projects, otherwise, I think you can go ads to the other tool. Yeah, the next one is like pretty simple Python script. It uses, yeah, I will demo it first. So, as you can see, I'm passing a configuration file and it is basically doing a purge. It's moving the deleted instance to the shadow tables. So, the reason we use it is like, we have different cluster which has like completely different types of workloads and the rate in which the VMs are created and deleted is completely different. So, we needed the tool which can kind of fit into like, you know, this kind of different workloads. I will show you one of like the configuration. For example, you can do try runs and stuff like that. You can select the, you know, number of days to, I will explain those and like, number of VMs to delete. And if you want to say clean up the shadow tables and stuff like that. So, we use this tool to kind of like clean up NOVA database. We have seen some performance issue, you know, if the database kind of, you know, kept unchecked because some of the environments we have like hundreds of deletes and creates per minute. And we have not used, say the NOVA managed tool which the community provides because we had some issues with it and including some performance issues because, you know, like, so, when you delete the increase from the database, sometimes we have seen the database getting locked and like, you know, performance issues. So, we are kind of like super careful when we like delete. So, this tool has a hard coded limit. It only delete, say, 250 entries in one batch so that, you know, it does not impact the performance because we cannot like take regular maintenance windows in the cluster. So, we have certain, say, times when we can run certain things. And so, yeah, that's pretty much about it. And it, you can kind of tune, say, how like, you can say you want to delete like VMs which has been deleted one month ago or say one day ago. So, this kind of like granular tuning is something we required. So, we kind of, it's like a pretty simple Python script which uses a SQL alchemy to query the VMs, query the database entries and then just deletes them. So, yeah, let me know if you have any questions. Yeah. First question is, did you share that script? Is it open source somewhere? Yeah. No, not yet. So, this is like a pretty new thing we have done, like we have seen some issues recently with the database growing really, size growing like in GBs. So, this is like a pretty new tool, but yeah, we probably have plans to make it available somewhere. Yeah. Actually, when I saw this, I was a little bit puzzled. Because we are using, for this use case, the Nova Managed Tool that is provided by the community. And for us, it's working right. Maybe that doesn't have that granularity that you are talking about. But the way that we do it to overcome that is basically every day in all databases, we do this cleanup. And that is helping us a lot to keep the size of the databases on check. And if I remember correctly, we also have a tool, a similar tool at OVH called OS Archiver, which is in the OS subs repository now. So, I was wondering how that compared to the tool that you are using. Yes. Maybe we can do a unified tool for all these programs. Yeah, that sounds good. Probably we should go in Nova directly. Well, maybe improving the Nova Managed Tool that is already there, right? Yeah. Yeah, I think that will be a great proposal. So, any other question? I think one difference for the OS Archiver Tool, if I remember correctly, is that it handled more than just Nova. So, it also went to the other. So, it was more generic. That's why the reason why it was separated from the Nova Tool. Even if it's for Nova, that it's the most urgent, I would say. It's for Neutron and other components here. Yeah. Yeah. Even we have like a similar tools for cleaning up images and stuff like that. But yeah. All right. Any other question for Shatadru? Okay. So, thank you so much. So, so now it's my time to present two tools that we developed at CERN that help us in our day-to-day operations. I'm not brave enough to give you live demos, but okay, let's discuss these tools together. So, CERN Cloud by Numbers. I always find this slide very interesting because it gives you a glimpse of the CERN Cloud sides. And actually, it changed a lot since the last time that I talked in opening for a live and I presented these slides. You can see the number of users, the number of projects that we have in our clouds. But what is really interesting and what changed from the last time is the number of instances that we now have. So, if you remember in previous numbers, we had around 800 or 8,000 compute nodes and that reduced to around 1,700 compute nodes and which consequently reduced the number of VMs from 30,000 VMs to around 15,000 VMs that we have today. So, the reason for that change in these numbers is because we moved all the batch workloads that previously we're running on in virtual machines to bare metal nodes that are provisioned now by Ironic. Still, everything is still in the open stack clouds. We just moved some workloads from virtual machines to bare metal. All right, so let's start with the migration cycle tool. We developed this tool to migrate instances between different compute nodes. But what is really the difference using this tool and running the open stack server migrate? I'm not talking about live migration of one or very few instances, but the live migration of thousands of virtual machines to empty the compute nodes. So, this is the goal for this tool. For this, we also need integration with our infrastructure. It's not just live migrating the instances. We need all the integration with the monitoring, alarming, and also a lot of orchestration because the VMs cannot be live migrated at the same time and we need to monitor the migration state, the migration, and the VM health during the live migration. That's why we felt the need to develop a tool like this one. So, currently, we have three different use cases where we are using this tool. The first one is hardware repairs. Hardware do break. Fortunately, sometimes even if we detect that the hardware has some kind of issue that needs to be replaced, the instances continue to run and it's possible to live migrate them. So, what we did was to design a workflow based on migration cycle that the repair team can use through RANDEC to automatically empty the compute nodes that are affected by the issue and make them ready for the repair. As you can imagine, this needs to be connected to the different pieces of this infrastructure, the monitoring, the ticket system, and so on, and is where this tool actually helps. The advantage is now that most of the hardware interventions in our clouds, if the compute node is not really dead, are completely transparent to our users. The other use case is hardware retirement. We have always hardware that needs to be retired and replaced by new one. What we do in these cases is we add the new hardware into the cell where the hardware will be replaced because we use cells and the cell will be the commission entirely. So, we add the new hardware in the cell and then we migrate all the instances from the old compute nodes to the new compute nodes. And all these processes can take weeks because most of our virtual machines have local disks, so all the data needs to be transferred. And the migration cycle helps in all these orchestration and automation. And finally, we have the other use case that is the kernel upgrade of the compute nodes. The retirement cycle of our compute nodes is between three and five years, and actually, some of the compute nodes have these uptimes. So, sure, this is very good for the user instances of the ability, but also means that the kernel was not upgraded and upgraded during all this time. So, we wanted to change these. We wanted to have the frequent kernel upgrades in the compute nodes, but at the same time, we don't want to disrupt user instances. So, the solution for that is to do constant line migrations, empty a compute nodes, reboot that compute nodes, and repeat again to a different compute nodes. Again, you see that this requires automation or orchestration to achieve this. So, these are some of the steps or features that are performed by the migration cycle. Disable compute nodes and to remove them from the schedule. Disable the alarming because if the node is rebooted, for example, for the kernel upgrade or stopped for an hardware intervention, it will generate an alarm and then we need to act on it. During the line migration, we need to monitor the VML because we do this because some VMs become unavailable during the line migration. In our case, this is mostly related because we have the local disks and for very large VMs, this can happen, unfortunately. We need to notify the operators if something goes wrong, if a migration fails, reboot the compute nodes, of course, and then also parallel execution because as you can imagine, if we do this sequentially, this can take a lot of time, especially for hardware commission where we need to line migrate completely in the entire cell. So, this also supports parallel execution where we can line migrate multiple instances at the same time. And then, also, I have a nice CLI interface and an easy integration with RANDEC that is the tool that we use for automation and orchestration. So, in your left, we have an example of the RANDEC interface that the repair team sees when they need to perform hardware repair intervention. You can see that they need to insert the OS that are affected where they will perform the intervention, the ticket number, and the operation that needs to be performed. So, for example, reboot the power off. And in your right, you have the CLI that is used for more advanced utilization, for example, the hardware commission. And this is done by the OpenStack operators, not by the repair team. And you see that the CLI supports multiple options that we have been inserting over time. So, the migration cycle code is available in this ripo. Of course, it's very op-unative and done for the CERN infrastructure, but it can help you and give you some inspiration if you are planning a similar tool for your infrastructure. Also, we wrote very recently a blog post about this tool and it's available in our tech blog, and you can have a read if you are interested about it. So, the other tool that I would like to talk about is the DB cycle. And as you can see by the name, we are not very creative when giving names to projects. The DB cycle solves the problem on how to test the DB schema upgrade and data migrations of databases when having multiple cells. This is basically for the Nova upgrade validation. So, there is one Nova DB per cell. If you only have one cell, that is maybe easy to validate. However, if you have 40 cells like we have today, or 80 cells like we had a few months ago, that is extremely painful to do to validate the Nova upgrade. So, we wrote a very simple tool that what it basically does is to do a dump of the Nova database, then upload that dump to a different MySQL instance, do the DB sync and DB online data migration, and at the end, confirm that everything works fine. So, the DB is actually in the new and expected DB version schema, and the online data migration happened without any issue. So, this tool is still not available in our repos. We will try to make it available very soon. And as you can imagine, over the years, we developed many other tools that help us in our day-to-day operations. For example, the project lifecycle or VM expiration. We already talked about these tools that we developed. If you are interested about them, there is a talk that was given, I believe, in the summit in Berlin. The video, it's available in that URL. And also, we have some other blog posts about some of these tools. For example, the expiry, the expiration of the virtual machines. So, that is for me. I'm happy to continue this discussion. So, you see, Adrian, that we try to solve the same issue with different tools. You want to go ahead, Adrian? No, no. Just to say, we have some Mistral workflows, which is doing the same operations. It could be interesting to share our knowledge about it. I wonder why you need to test DB migration when there is grenade job in the OpenStack CI? Well, we had a lot of issues over the years, because that is tested for a sample data only. However, when you have your databases with maybe a different character set setup in your MySQL, and you have a lot of databases with maybe different versions, because they were set up at different times with maybe different character sets, because when that was set up, sometimes you can get an issue when do DB schemas and online data migrations, especially online data migrations, because that will depend in your data. And over the years, we have been identifying some issues. I experienced once a problem upgrading from stretch to a Buster to stretch to Buster. Maybe something similar. Well, we always validate database migration, not only in Nova, but in all the other projects. But Nova, it's more tricky because we have a lot of cells, so it's very tedious going through all these different databases, download them, do this validation manually. But I believe others should do the DB validation before an upgrade. Are you doing the same thing? We do something similar, like we have clusters to try the grade that we populate with the old DB. All right. So I do that on a virtualized environment where OpenStack runs on VMs. All right, great. So tier 8 here, so probably we have some questions from the audience. Yes, we don't have much time, but we still wanted to discuss a few questions. There's the first one from Mikael Salou on why not using OpenStack Coature for this. It was in the context of, if I remember correctly, the evacuation tools and measuring workloads. Yes, I can give some tips about it. We tried to deploy it some years ago, but honestly, it was not prediction ready. We would like to improve the tool, but we didn't have enough people in the team to work on it. So we prefer a step-by-step improvement on our tool. I can also comment on that, because we've awaited Watcher. Watcher is very interesting. However, it works very well for static environments. So you have your workloads and then you run Watcher, and it will do a plan on which line migration should perform to improve the resource utilization, for example. There are several metrics that you can use. However, if you have a large infrastructure, you see that that plan can be really out of date before it really finished line migrations. And when it finished, it's already not valid. So yeah, I think that there are several things that can be improved on Watcher. We've awaited, we are not using it also on your side. And one last question from Sun Tzu Cho on live immigration. How many seconds would you have the network be disconnected? So like, if you ping every second, how many pings would you last? Maybe I can answer that one. It really depends on your network topology. Let's say if you don't use DVR, then there's a good chance that it won't ever happen to you that you will lose more than like, I don't know, three pings. If you use DVR, then it depends on how you configure your network equipment. Because that's how you got the path with ARP. And then if you use, let's say, I guess in OVH you don't have that trouble because you use BGP to the host, right? So it really depends on your network setup. And you can't say how much downtime you will get. Like if you tell me which kind of setup you have, then I can explain to you how it's going to be. If you let's say you use DVR and a large L2 networking, then you probably will get a lot of downtime, like maybe 15 to 30 seconds. Yeah. So the answer is, like always in OpenStack, it depends. So I think we need to wrap up because we're already over time. Thanks, Belmiro, and thanks to everyone on the show for joining us today. It's really time for us to close this episode. Those were great tools and pro tips. And I'm sure our audience learned a lot from this discussion. We have one more episode before the end of the year. So make sure to come back next week as our very own Marc Collier will discuss highlights and best of moments of our recent OpenInfralive keynotes. There were some very exciting moments there, special announcements. So make sure you don't miss it. One of the other announcements we made during that show was that our in-person OpenInfra Summit is back. We'll be going to Berlin June 7 to 9 next year. So mark your calendars and you can already buy your early bird ticket right now at openinfra.dev.semit. Well, you will also find sponsorship information if you're interested in sponsoring the event. And finally, don't forget that this show is for all of the OpenInfra structure community. So if you have an idea for a future episode, we really want to hear from you, you can submit your ideas at ideas.openinfra.live. See you again next week on Thursday at 15 UTC. Thanks again to all of our speakers who joined us today. And special thanks to Belmiro for leading the discussion. See you soon on OpenInfra.live. Bye!