 Okay, let's go. So welcome everybody. Thank you for attending this last session of the last day. So thanks. We will talk about a field report, how we manage at OVH to live migrate or instances. So first, who are we? So my name is Arnaud Morin. I work at OVH. I'm involved in deploying, managing the point stack infrastructure for OVH. Hello, and I'm Julien Kosmaou. I'm working in the same team as Arnaud. So I'm involved in the deployment of public cloud infrastructure and the automation. So what are we doing? So basically at first, we were selling dedicated servers, bare metal servers. Since 1999, and then we tried to evolve and we provided some web hosting in order to let our clients deploy web pages, basically. We also do some dedicated cloud, which is a product based on VMware vSphere. And of course. So we are doing public cloud, we are providing object storage, we are providing instance and volume with, we're providing volume with safe backends. And of course, it's providing an open stack. So before starting the talk, just a little bit of context. So everybody is aware of what we're doing. So we are managing 25 open stack regions based on open stack, either open stack Juno or open stack Newton. We will talk a little bit later about how we have graded from Juno to Newton. And we have more than 250K instances deployed in all this region worldwide. And yeah. So when do we need live migration? So as a public cloud provider, we don't know what our customer managing their infrastructure and of course, we can't reboot the host with all their instance running. So when we need to do some hardware maintenance or we need to reinstall some host, we need live migration to move all of this VM on other computer nodes. For example, at the beginning of the year with the Spectre and Meltdown in CV, we had to update Kernel everywhere, microcon, and of course live migrate instances and upgraded host. Another big project for this year was to upgrade our infrastructure from Juno to Newton. Juno was running on trustee and Newton and Xenial. So to keep consistency between infrastructure, et cetera, we needed to live migrate also all the QMU process running on trustee to Xenial host. We'll explain that at the end of the presentation. And of course, with more live migration, it will trigger some bugs and we will try to explain the bug we are on control and how we fix it. So let's talk about the novel live migration mechanisms that we have. So first, what you must know is we used to have all the OpenStack regions on Boon to trustee running OpenStack Juno. And we were using some QMU and Libre version which were regular software version on this OpenStack Juno release. So we needed a very well-working live migration, but as you may know, it's not the case on a pure OpenStack Vanilla Juno. So and also we have multiple instance profiles and this is why our migration is painful because we have instances which can have local disk based on SSD disk and we also provide to our clients instances which can boot on Ceph disk directly. So we can also add some signed-off volume, attach signed-off volume to instances. We also have some config drive possibility, GPU, FPGA and of course a lot of different CPU architecture. So with all of this stuff, it makes our live migration very complex and we had a lot of edX managing this live migration bugs. So to describe a little bit our workflow, when we find a bug on our production, but live migration, first we try to reproduce that bug in our own dev environment. And then we can, we try to figure out what is the root cause, if it's on QMU or if it's OpenStack for it in issue. Because we are on Juno, a lot of issues have already been fixed upstream. So we check if there is already existing commit upstream and we try to cherry pick in our OpenStack environment. If it's not fixed upstream, if it's impacting the production, we try to find a code workaround. And we have a team with working on our OpenStack code and they try to work with guys upstream to fix the issue. So for the first bug we encounter, so on VM, we are using a lot of memory on a loaded instance. Migration, migration never finished and take really long to converge to the destination post. So to be able to finish the migration, we needed to post the instance, so with Virge Suspender and to reproduce the case, we tried to run StressNG in the VM. So here there is an example, so on source host and destination host with Virge command, you can have some metrics about what is happening on host source and destination. So here on destination, after VM was suspended, the migration can finished and you see that the data processed, it's huge and we try with a lot of iteration to converge the life to finish the live migration. There is an option in QMU, it's auto-convergence. So with this feature, it fixed a lot of issue, so for that, it needed to remove 2.5 and a new version of LibVirt that we have not on Juno, but we had on Newton. So first, the first thing to do was to upgrade QMU and LibVirt on our Juno infrastructure. So what auto-converge is doing? So when you are trying to live migrate instances, we are doing a lot of memory usage, even if you have a big bandwidth between your source and destination host, even with 10 GB, VM can dirty memory page faster than you can live migrate to destination host. So auto-convergence will slow turn CPU step by step to help live migration to finish. So here there is an example of how to enable it on Juno and on Newton. Yeah, we had also another issue which is related to the LibVirt method that is used by Nova to manage live migration. So on the Juno OpenStack release, Nova was using migrate to I2. So this function was not able to live migrate an instance which have a local disk and an attached signer volume. So because of that, we needed to shut down the instances, detach the volume, migrate the instance, not live migrate because the instance was shut off and start again. So the solution to fix that was to use the next version of this function and of course to do that we needed also to upgrade QMU and LibVirt again. One other issue we had was about config drive. Config drive is an option that you can pass to Nova boot command or OpenStack server create and when you use that, it will just create on the compute host itself a new disk.config file which is used by the instance to get instance metadata instead of doing HTTP call, it will use this file directly. But in LibVirt, if this file is in a specific format, then it will be considered as a read-only and LibVirt is not able to migrate this file. So the solution is kind of dirty, but it works is to use SCP command before doing the live migration to copy this file on the destination host. So LibVirt sees the file, it just says, okay, the file is here, I don't have to live migrate this file. And this is just a simple line of code that we cherry picked back to Juno. It's available on Nova code. So we didn't fix all of our issues. Recently, we tried to use CPU pinning mostly for L1TF CV. So we make some tests with CPU pinning in our infrastructure. So it's working great with the scheduling and QMU in Nova compute. But live migration is not yet implemented. I think there is people working on that upstream. And another issue we had was with the format of our disk. On some flavor types, we now use a raw disk instead of QQo2. And when we want to migrate some QQo2 VM to compute nodes configured in raw, live migration was failing. So we make a little patch to allow that. And I think maybe there is something we can work on to allow live migration from QQo2 to raw compute nodes and make the conversion on the fly. And about LibVirt migration feature, there is another option we didn't test because we don't have the QMU version needed. So it's post-copy. It's a really interesting feature to help live migration with really loaded VM to help to converge. So it's an option that can replace the auto-converge. But it's not impacting the guest like auto-converge do with CPUs total. And there is also a compression of RAM pages. So it's a feature we need to test in the future. So now that we show you all kind of live migration issues that we had and that we fixed somehow, let's talk about the tooling we built around that in order to automate live migration because nobody wants to use live migration by NOVA, live migrate, blah, blah. So what we first did is we developed a tool inside of VH that we called RunCLI. And RunCLI is amazing because you can use that to, for example, call some mistral workflows and you can just drain host. By draining, I mean you can just empty that host. So you are using RunCLI to tell OpenStack to just empty that host because you need, we need to do some upgrades, do some stuff because the host is not healthy or stuff like that. Around this tool, we built a lot of other features. So we can, for example, upgrade instead of only one host, we can upgrade multiple hosts in a bunch, in a bulk way. We can just manage aggregates. We can move host from a region to another. We can do a lot of stuff. And we also recently deployed Mistral in order to automate some actions using workflows. So I don't know if you are familiar with the Mistral workflows, but here is an example. We can see on this web interface, which is called Cloudflow, that we built a Mistral workflow called LiveMigrateInstance. And this workflow is able to live migrate an instance in an automated way by pinging instance before to check if it pings, trying first a live migration without block storage, without block migration. If it fails trying again, if it fails again, trying with block migration and do a lot of stuff that an operator can do manually, but just do it without manual action by an operator. So let's dig into this workflow. So first workflows on Mistral are very simple. First you have some inputs, which can be considered as variable. Then you have some tasks, which can, for example, execute an action on a remote server through SSH, through anything you can imagine. And you can also call some other workflows. So you can just chain workflows together. And at the end, you have some outputs that you can use in other tools, such as Renssela, for example. I have an example workflow, which is called LiveMigrate. It takes as an input only a region and an instance ID, VM, here. Then this workflow is, in this workflow, we try to ping the instance before doing the live migration. If the instance pings, then we will set a flag called pingable to true. If not false, of course, and uncomplete whatever the result we will continue by doing the LiveMigrate action. The LiveMigrate action is what? It's a workflow which is trying to use Nova server LiveMigrate function, which is baseline in Mistral. And it will try first without block migration. As you can see here, it's false. It will try multiple times. If it fails, it will execute another LiveMigrate action, which is this time LiveBlockMigrate. It is basically the same one, but instead of block migration false, we set it to true now. So this way we don't have to care about, is this virtual machine migratable or without block migrate? We just need to migrate it. So now we talk from our workflow. So workflow is executed from an external host. We will check what's happening and what we added directly on ComputeNode. So we developed a lot of probes for our alerting system and we deployed it on our ComputeNode. So this way we can detect if something goes wrong for the instances deployed on ComputeNode. And we can detect a post-LiveMigration issue. For example, if everything has been cleaned on source nodes and if everything is present on destination. In our alerting, these probes are reporting to our Shinkan infrastructure. So for example, here we have instance routine. So basically it's just checking the four IP that are present in the ComputeNode zone. And we see here that there is one that does not respond because there is a missing route in the principal namespace. So we need an interruption to fix that. We will check, I will describe some probes we have on our ComputeNodes and what we are checking. It's mostly network stuff. So for our Neutron side, at OVH we are using BGP directly on our ComputeNodes and routing public IP in our instances. So it's a neutral agent coded at OVH. So first we are checking if there is a missing or a missing BGP announcement after the LiveMigration or if everything has been cleaned on source. We also check open flow routes. We are using them for our private network. We are checking if open V3 to bridge and every namespace are present or everything has been cleaned on source. And we are also checking if instances are responding to ARPing. For example, a customer puts some security routes. The mistral workflow is not able to ping the VM and not able to say that everything is okay. We can detect an issue if, for example, everything was okay with ARPing and if after a LiveMigration it triggers an alert. On NovaSign and QMU and NovaSign, there is less to check on our side. So there is a mechanism in Nova to auto clean orphan disks. At the beginning we were using LiveMigration with no patch. We had some issue with failed LiveMigration and disk was still existing on source. So we disabled the feature to auto clean the disk automatically by NovaComputer. So we just had to re-enable it because it just works. When we do LiveMigration, we also collect metrics and data on the compute itself in order to send that in a time series backend. And then we can do some graphs and some monitoring in order to see how long it takes, how many time it's failing or succeed. We are collecting that thanks to open source projects that we provide on GitHub, which are Nordrig and Bimion. And basically you can just put a custom Python script or any kind of script. So we are doing some, for example, version of Java info just after the LiveMigration. And we just collect all those statistics and send them to the work 10 time series backends. And after that we can just do some graphs. So here we have some graph, for example, it's mostly just to have some numbers. So we migrate in approximately 180K instances since the beginning of the end of January. So it was just after the specs and meltdown issue. So that's why we started to make some graph and have some statistics on what is happening on our infrastructure. And we have a graph of a number of LiveMigration per day. It was at the beginning of the year. And here we have another graph about the average lifetime of LiveMigration in some region. So we can see that the average is about three or four minutes. It's long because we are mostly using instances with local disk. So we have a lot of this, so we also have the data to move from Compute Note to another one. So that's why. So now that we saw that we have a good LiveMigration system and some tools to automate this workflow, let's talk a little bit about or upgrade from Juno to Newton. So upgrading is sometimes easy, sometimes not. So the first easy part was the computes. Because thanks to the work done on LiveMigration, we upgraded QMU, Libert, and now we have the same software version of QMU and Libert on both trustee and Xenial. So we are able to just upgrade the compute by just using some IPT upgrade command, basically. It's a little bit more complicated because we are using Puppet to do that, but it's quite easy to upgrade the computes. The second easy part is the stateless control plane. Because when you want to upgrade, for example, Nova API, you just have to upgrade the Python kind of Nova, right? So basically it's the same story as for compute, but it's even easier because this time we were able to reinstall the server itself in a new Ubuntu Xenial version and install OpenStack Newton there. But the hard part about this upgrade is definitely the database, because if you already upgraded, you know that upgraded OpenStack is basically upgrading the database. So our solution to do that was to use a so-called fast forward upgrade, and we did that by building some Docker containers with Python inside, and inside the containers we just executed basically Nova Manage or Neutral Manage, Neutral DB Manage to perform the alumbic migration. Of course, also if you want to upgrade, you need a very good documentation because the guy who will upgrade, he needs to know what he will do, what he is currently doing. So it needs a very good documentation with all kinds of cases he can encounter during the upgrade. You also need a working backup, of course, but everybody should have a backup of the database. And you need a way to rollback. So it should be part of your documentation, but it's not, it's, okay, we need a backup and a rollback procedure. As a conclusion, we had a very old OpenStack Frankenstein Juno code, in which we backported a lot of Newton or even new releases, new live migration commits, and thanks to this working live migration and Juno we were able to automate the live migration and finally we were able to migrate from Juno to Newton in an easy way. So this is the end of our talk. Thank you for attending the session. Maybe you have some questions or not? So this is kind of loosely related to what you've been talking about, but you mentioned a couple of times L1TF and CPU pinning and stuff like that, and I was wondering what actually is your strategy with dealing with this, because it's terrible, I guess, for cloud providers. It's a good question, and we don't have the solution yet, so we can discuss about it. There are a lot of people there who ask the same question. If you are making CPU over allocation, it's complicated. You cannot really make CPU pinning, but in our case, there is some flavor. We have some dedicated resources, and we are not doing over allocation. So in this case, we maybe have a solution with CPU pinning. It's not in production. It was just we made some tests, et cetera. And of course, if you have over allocation on CPU, this is SMT per threading, but I don't have the solution. So you said that in order to enable CPU throttling, so auto conversion trust, you need to upgrade Libre and QEMO. So what impact it had on user workload then? I must know impact. The biggest impact was why we needed to upgrade QMU, so we needed to live migrate instances to be able to run on the new QMU, of course. So it was harder because this time QMU was not able to live migrate. The first live migration was hard, but after it was okay. We don't add any big issue by just upgrading Libre and QMU. It just works. About CPU throttling, from a guest point of view, it's difficult to say. I think it depends really on the workload of what the customer is doing, what the customer is doing. If there is a really high memory usage, there can be, if the migration takes a long time, it may have an impact on performance for sure. But live migration doesn't take a really long time, so I think it's still better than shutting down the guest and restart it on destination. Did you have any VMs with PCI pass through, and how did you handle them? We do, and we don't live migrate those. So did you see any issues with local storage, block migration, where there's some QMU version that worked badly with them, like it failed disaster, because I saw everything working fine, and I was happy then, and then I went to do production, and some instances failed, so I just stopped doing that. Using block migration? Yeah. So we had some bugs about sites which were, for example, sometimes I think we were copying data, for example, from, if you have a set volume, stop me if I'm wrong, but if you have a set volume attached to an instance, and you use block migration, then you will copy this set volume to itself. We had a bug like that, but we fixed it by cherry picking something into the code directly. It was using migrate to R3, and it was the bug related. By using migrate to URI3 instead of 2, we fixed that, and basically every block migration is working now. Right. But the issue was present in Juno, but in Newton we didn't add the issue, if I remember well. So which version of OpenStack did you use? We used Newton, but these, of course, this was a production, and this was in the hour's oldest hardware, of course, and the longest running instances, of course, as well. So it might have been something like versioning old. I'm not sure, because everything I learned myself in devil, it worked perfectly, but then there were some production instances that didn't, from local storage to another local storage. You had some corruption on the... Yeah, if I recall correctly. It failed, it crashed basically in the instance. I don't remember having such kind of issues, but maybe we can talk about that after there was a team, if you want, because... You didn't know how many, which percentage of the instances crashed during life migration? I guess some probably did. We had especially that with older kernels inside the clients. We've seen crashes sometimes. So by crushing, you mean instances not working? The instance is freezing. The Linux kernel inside the instance is freezing. Yeah, we had such issues, but I have no statistics about that, so I don't know. Do you have an idea of the performance impact of block migration? I guess people are using local disks because they need high performance, and then if you block migrate, that might kind of defeat this purpose. You mean if we block migrate, we could lose some performance? Yes, during block migration, I guess people are using this for databases that are kind of disc latency sensitive, and then if you suddenly block migrate, then the database might spark. Yeah, so it will affect the instance, because of course we are just regulating it, so it will affect the instance, but it's still better than just shutting down or suspend it. Are you sure? I mean, imagine that some customers might prefer two minutes of downtime to like five minutes of erratic performance, but yeah, sure, I agree. On small instances, it's quite painless. Did you have the problem that sometimes instances end up running somewhere else than NOVA thinks? Yeah, we had that, and so that's why at the beginning we disabled ripping of instances. NOVA, instead of ripping, we were logging, and we had some probes on computes to check if an instance is supposed to be there or not, and in the Juno release, we had some tools to fix that, and by patching live migration and now in Newton, we figured out that it's not needed anymore, because NOVA is working in a better way, it just works, so Julien said that we enabled back ripping of instances, so we are not supposed to have any double instances now. Okay, and did you have the issue where you issue live migration, we have this in Okata, I think it's really rare, but you issue a live migration and then nothing happens, because we only consider the host that it's currently running on, and you have to clean up some weird database entry, and then it works. We do have, sometimes if the first live migration is failing, then the second will also fail, because the destination is already having some part of it, and it's just failing, and we do that the third time. We do live migrate third time, and it's working after that, and that's why on our list, we do live migrate multiple times to help work around that, and we also had an issue which was about Sethmon IPs, I don't know if you add that, but if your Sethmon IPs change, then your live migration may fail, because IPs are not available anymore. And we don't have any fix for that for now, so no more questions, thank you very much.