 Rhaid? Rhaid? Rhaid i'n meddwl. Rhaid i chi'n meddwl? Rhaid? Rhaid? Rhaid. Rhaid? Rhaid. Rhaid i'n meddwl? Rhaid i'n meddwl a laminhau cymdeithas at LHP Public Cloud. Rhaid I'n Matthew. Rhaid i'n meddwl yng Nghymru yn Bristol, England. Rhaid i'n meddwl i'n meddwl i'n meddwl i LHP for 18 mwyth. Rhywbeth, ar gael, gwnaeth ei ddod yn ddod yn y Hong Kong Tyngwyr yn ffarn. am gyhoedd rôl yw o'r bobl yn ymgyrch gydag ffrog i'w gamarchu gigau ac yn gweithio'r perthynag o'r cwble cymryd yn 2014 o'r 15. Ewch ymddiw, fod ymddiw'r bobl yn ymddiw. Mae'n gweithio i fynd yn gwybod â'i cyd-ddech yn ffrog. Yn gweithio eu cymryd cymryd cymryd yn dda, mae'n gweithio'r cynhyrchu cymryd cymryd, mae'n gweithio'r cynllun o hellwysfaolau, i wneud cyfweld, a fy neud arniwch roedd mae'n bwysig a rynnu'n hawl yn grafwyr ychydig. Rydyn ni wediwen wedi grwysau pobl sy'n Diablo. Rwy'n hyn, mae dylwn ni'n deall wedi mynd Diablo. Uchdo, Lybwr CVM Cimewd y ffyrdd Cyfedfer, sgwm yn cael ni'n mynd i gyflethef i agynnyddol, ond byddwn ni'n deall. Pwysigline, meddwl yw hyffordd o gwasanaethol, roeddwn ni bock am hp storidol ychydig, a i gyd yn fwy o amser o gwybwyr o'r pasod o'r rhan o'r gweithio wedi bod yn cael eu fath o'r pasod o'r gweithio eu bod yn cael ei ddiweddol am gynhyrchu gyd erioed. Felly mae'r dros y ddylau o'r rhan o ddweud o'r gweithio o'r gweithio ar y cloud o'r openStack, felly bywyd a'r ddweud o'r ddweud o'r ddweud o'r ddweud o'r ddweud o'r ddweud o'r ddweud, felly o'r ddweud o'r ddweud o'r ddweud o'r ddweud, ac mae o'r cyfraith wnaeth o'r ystafell nhw. Felly mae wnaeth eich fflexibl oherwydd mae wedi bod yn ymgyrchu fflu kimyd, a'r fyddwch wedi cael eu cyfrifio yn ei wneud yw chemi. Mae'r cyflaenau dim ar y cyflaenau o'r llwydd i gyd, felly mae'n gyhoedd yn ffaith yn ei ffrif, ac tra hi ddi'n mynd i gyrdd i gyrdd i gyd yn y llwydd o'r llwydd i gwerthau. Oherwydd, llan ar gyfer yr wneud yn hynny, a'i gynny'n gwybod yn ymgynghwyd arlas, felly byddai hyn o'rotaeth ynnyntРawr maeth. Yma mae'r amddar yn gallu gwir i ddiwedd. Felly, y gallwn i'r rhaid o, rydych chi'n gwirio'r hyffPrifysgai. Felly, dwi'n gwisio eu bod hi'n peth o'r yrwyth i siaradau o'ch cymryd cyffrwyr wedi gwirio'u hyffrwyr i'r hyffrwyr i siaradau sy'n cyfrifysgau, a gweithio'r ychydig yma sy'n rhoi ei fod ychydig i'w ddefnyddio fent yng nghyrchu, felly wrth gwrs, ddim ymrhyw. ond ond migrathol. When I first heard about live migration, I thought it pretty much sounds like magic. You have a compute node running a hypervisor running a VM. And then the VM moves over to another compute node, like a different physical hardware with no interruption to the guest OS and like very minimal interruption even to the external communication with the instance. And that seems pretty amazing to me, but unfortunately not actually magic. It's done with computers and these are like complicated machines, with complicated failure modes and you'll hear about that. So I mentioned about the ephemeral disks, that means we're doing what's called block migration, which means we have to copy during a live migration not just the contents of the memory, like you would if you were using shared storage, but we also have to copy the disk of the image. People say it's maybe not such a good idea because it's slow because you have to do that. So like in detail during a live migration, there's a pre-live migrate step where OpenStack works out if the live migration is possible and tries to set up where it tells the hypervisor, the libvert to set up the connection, like peer to peer between the compute nodes. Then a sort of snapshot of the disk and the memory is sent across, and whilst that's happening, that's sent across to the target, whilst that's happening, like more activity is happening on the source node because that's where the instance is still running. So when that finishes, like there's some dirty memory and some dirty disk which needs to be synced across like again, and then that might happen again and again and again. And then eventually, fall goes well, the instance is briefly paused, like sub-second pause, while things like networks are switched over, volumes are reattached on the target node, and then everything gets unpaused, the instance is running on the target node, and your live migration has been successful. Right, so what did we do? So you've heard maybe about live migration. Typical reasons why people want to do it are if you've got some impending hardware failure that you know about, you want to evacuate a host, or if you've got some like OS or BIOS or like firmware patches that you need to apply which would mean shutting down the node that the instances are running on. So now our case which I'm going to talk about today was the second of those for security and for compliance reasons because we want to keep minimum node, maximum age. We want to be updating all our compute nodes every three months to take advantage of security and OS improvements. But our customers instances have up times greater than three months, right? So we use live migration to move the workload around and be able to update these nodes, right? So we decided that we were going to try a rolling live migration of all the instances across these thousands of nodes across the whole data centre every three months. The way we decided to do this was to basically pick like a host and migrate all its instances to an identically spec host. So rather than doing the kind of live migration where you fire off the request back to the scheduler and have it just put somewhere else, we're just doing like host to host migration. That will become relevant later as well. So it's down there, cool. Yeah, first step like work out if it's even possible, do some testing, right? And because we're doing block migration, the biggest thing is about the disk size. Specifically, we're using copy on write images from glance. So you download the image and then you've got like your overlay on the top and that's the bit that Libvert wants to migrate over. It doesn't copy the read-only underlay bit. In fact, it doesn't copy read-only disks at all. I'll come back to that later. And of course, like I said, about the dirtying of pages over and over, that will obviously cause things to slow down, right? So how long do you think a live migration takes? Like any guess? Well, sometimes. They range from sub-second to over 18 hours. In our public cloud. The reason is because we have this extremely large flavor. We have an 8XL flavor. It's 120 gig of RAM and nearly two terabyte disk. So it's not even like 18 hours occasionally. It's quite reliably over 18 hours because people don't boot these things up unless they've got some serious work to do which involves a lot of disk and memory. So we found like 18 hours was a pretty reliable figure that we had to work with. So yeah, this is why people say don't do block migration because it's really slow. But you know, it's 18 hours. Like go to bed, wake up, finish it. All right, and we did some functional testing as well. We found around 30 different failure modes where we defined a failure mode as just like, we tried it and it didn't work. Some of these are like, actually if you read the docs, it's not supposed to work. But we wanted to really just use like the same tooling across the whole cloud. Just migrate everything, you know, without having to like do too much fiddling around depending on what kind of instance we're talking about. So I've got some examples actually. Asynchronous instance operations which have been started before the migration, completing during the migration failure mode. Volumes occasionally trying to be attached to the source and the target at the same time. I don't know, race condition perhaps somewhere. Instances in unusual states. So live migration, like the instance has to be live. But there are different types of live. I mean, in terms of Libfa and Kimiw, there's only one type of live. But OpenStack subdivides that into paused and rescued which are also kind of live as well. We saw migrations never finishing, right? It's possible that the page dirtying will just never be caught up by the network transfer. And like you can do some back of the envelope calculations and work out how much the memory IO and disk IO you need to use before that becomes a problem. And if you do those calculations, you'll be unpleasantly surprised at how inaccurate they are because it seems to be that a very small amount of memory or disk IO would cause just an endless migration, right? So if you have a query Libvert through verse and find out how much is left to go, and that number might just not be going down. I mean, it might even be going up. So, yeah, migration is never finishing sometimes. And failures don't always leave the instance back in the proper state that it should have been in. So it goes. So these are some of the restrictions which... Oh, I've got it down here as well. These are some of the restrictions which we got round by patching the code. So if you're active, not paused, we've had a patch upstream now for migration of paused instances. That landed in Kilo. Rescuit is a little bit different because when you do rescue with OpenStack, it boots into a different flavour and attaches what used to be the boot disk as a leg and extra drive. And that flavour might not be the same size. So if you do a migration that goes through the scheduler, it looks to be a different size when it's unrescued. So you have the problem of how to account for there being enough disk space on the node that it lands on. We didn't have this problem because we were doing, remember, the whole host with the same size disk. So we knew it was going to be fine because it was fine before it started. Stopped or suspended, that's not live, right? So you can't live migrate it. But we didn't really want to, like, have a very complicated tool chain. So what we do in this case is bring the instance up to paused, live migrate it and shut it back down again. So there's no CPU cycles take place on the guest OS at all. But this was like we tried to patch this, or send this patch back up to upstream Nova and they didn't really like it. Fair enough. But it worked for us. Error is just like who knows what's going on. There could be all sorts of things wrong when it's in an error state. Some of them are migratable, some of them aren't. Config drive, right? So that's the idea that Libvert won't migrate you a... If you're doing block migration, Libvert won't migrate read-only disks and config drives are read-only disks. They appear on the guest CD-ROM drives or something like that. And so you'll find, like, after the migration completes, the instance is on the target, but there's no backing file for the config drive isn't there. So we patched that in the... We patched that. And attached volume's the same problem with Libvert the other way around, right? This time, the attached volume is read-write. So Libvert will try to migrate it because you're doing a block migration. It will migrate it from on the source, where it's attached over here, to on the target where it's attached in the exact same place. So all it'll do is read it out of Cinder and try and write it back into the same place in Cinder, which makes it unsparse. If anything goes wrong during that process, you've got, like, a very high risk of data corruption. So, like, recently, some patches have landed in Libvert, which allow a lot more granularity in how you can specify which drives you want to migrate when you're doing a block migration. That's really good to see, because that would have been, like, great to see 18 months ago. But it's still good to see now, right? So, restrictions that we worked around operationally, the instance can't be too busy, right? I told you about the, you know, never-ending migration. Lorraine, who's just down here, did some good work on interrogating the proc file system to work out in advance if the instance was too busy. And, like, what can you do in that situation? You have to just, like, wait for it to calm down, phone the customer. I don't know, you can't migrate it, so you've got a problem. The keystone token must not expire during migration. That's correct. It must not. And if you've got 12-hour keystone expiry and 18-hour migrations, then it's going to happen. And the effect is that the post-live migrate, where it tries to reattach the network and the volumes to the target, that will fail. And then it's just, everything's rolled back. Like, the very last step in your 18-hour process is just, nope. Brilliant. And, yeah, I told you about how the block migrate only copies the overlay of the copy-on-write file. That means that the image, like the underlay, has to come from somewhere else. And where's it going to come from, right? It might be cached on the drive, sorry, on the target node already, because Nova does a bit of caching of images and there's only a finite number of them. Or if not, you can get it from glance. Or if not, like, someone has to notice in advance that this is going to be a problem and SCP the image across. Which is, like, totally doable, because it's going to take a day for an operations team. Right. So, let's talk in detail about the process that we use to do this rolling migration across the data centre, OK? First of all, like, at the per instance level, I mentioned about how async operations can mess you up. So the first thing we do is lock the VM. So this is an admin locking the VM. We can't do anything to it, and then we just wait so that anything that was, like, in-flight when we locked it finishes, like, how long? 60 seconds? Start live migration, and once that's started, which you can find out through the OpenStack APIs or you can find out through querying LibVert directly, once live migration starts, you can unlock the instance again. And then you wait, and the live migration takes place. Like, you know, there's no way in the OpenStack APIs of querying the current status, right? You have to go to LibVert or to the hypervisor. Some other way, excuse me. So we, you know, we've, like, obliged to develop some tooling around this, because, like I mentioned, you know, the amount remaining can just, like, stop going down, so you have to watch it. And OpenStack doesn't let you do that yet. Those work in-flight right now to allow that, to see. Right, but anyway, assuming that everything's gone well, the live migration finishes, right, and your instance is finished. So per compute node, which is, like, you get an empty one, which is, like, up to date with your latest OS and security and firmware and bios and what, you know, the stuff you need, you then take a source node, which is the one you're going to upgrade later, you disable it so that no new instances get scheduled to it, right, and then five at a time we start migration over. I saw someone from Time Warner saying not to ever do more than one at a time, but we found five at a time work, and perhaps it's, you know, because we have, like, good network locality for the source and target or something like that. That was good for us. And then when they're all done, you can re-enable the target node as a scheduler target for NOVA, and then you've got an empty thing that you, you know, the old source you want to, like, upgrade all the stuff on that, like, reboot it. And across the AZ, we have, like, multiple of these source-target pairs of nodes with this going on and on and on. So we've got, like, dozens or possibly hundreds of migrations taking place simultaneously in these pairs, and the effect is, like, a beautiful field of corn with, you know, a nice breeze going through it. Everything just rolls through nicely and, like, what could go wrong? Well, so maybe you've guessed, right? What goes wrong is, like, any single failure of a migration where you end up with an instance stuck on the source and it won't go off the source, makes that node unusable as the next target, right, cos you've got, you can't do anything to it. You can't upgrade it. So, like, what do you do? You, like, get a red pen and you're, like, this one does not play nicely with others and you put it in the corner and, like, you tell someone in Ops to have a look at it and maybe you have to, like, get someone to support to reach out to the customer and ask them if they wouldn't mind just, like, you know, sometimes, we remember we're doing this only, like, three months periodically, so sometime in the next three months would it be possible to, you know, just arrange for you to reboot that server and we'll, like, work with you to make sure it comes up on a new node? Like, that's fine as long as it doesn't happen too often, right? So does anyone want to have a crack at guessing the success rate for an individual live migration? 85% of the time. Very good. It's like... So I did some maths earlier where I times 85% by itself five times and so if you start five off at once, you have a less than 50% chance of them all going out successfully. So this thing where you're writing it like this one's no good and you put it in the corner, like, this is happening a lot. Like, more than we wanted it to, certainly. 85 is a long way away from 100. And so we had this, like, ongoing increasing operations problem and, you know, in some cases, like, a lot of work for the support teams. People are dealing with, like, dozens of new failures every hour and, you know, that, I mean, the worst that can happen is it just slows and stops the whole process until you figure out what to do with these nodes and then you've got some new targets and you can restart it. But it's a lot more labour intensive than we were really expecting. So how about 85% of the time it works? What about the other 15% of the time? Why wasn't it working? We did a lot of testing already. We got some, like, good operational procedures around it. So what could go wrong? I grouped these kind of failures into three classes. Like, firstly, it's possible, the migration never ever starts, right? There could be a problem establishing the peer-to-peer connection between the two hypervisors. There could be a problem in the pre-live migrate function in OpenStack. And the logging in that function is pretty good. It usually comes out like starting live migration failed. OK. That's improved recently. I mean, remember this is not like, I'm not talking about the latest bleeding edge code. That has improved recently and I'm glad to see it. We use the same network for live migration as we use for all control plane traffic, which is a different network than we use for instance to outside world traffic. But still, it's the same network that all the RabbitMQs talk to each other on and stuff. RabbitMQ is a rich source of comedy if you've got, like, thousands of nodes. And hammering it with live migration traffic as well doesn't help. Doesn't help either party. So failure one, right? It never starts. Failure two, I mentioned earlier, it never stops. You've got past the first hurdle and then... So the libvert offers you no guarantee when you start a live migration that it will ever finish, right? It offers you the opportunity to set a time out where you can just say, if it takes this long, just give up, cut out. It offers you the opportunity to increase the amount of time that you're willing to totally suspend the instance to copy the last bit of dirty page stuff. I think the default for that is less than half a second, but I suppose you can increase it to whatever you're comfortable with. But, you know, the default is very easily impossible to achieve. So I mentioned it a few times now. It's possible it never ends, right? And the last failure mode is that it does end. Like, not how you wanted it to end. Like, it's migrating. You look away. You look back and the instance is dead on the floor. Like, what happened to you? We never replicated this outside of production. We tried, right? But inside of production, this is a very bad thing to happen because from the customer's point of view, the instance has just stopped dead. There's no data loss. You can just start it up again. It comes back up and it's fine. But we're not going to just try it again because if the same thing happens again, we're going to make people very cross. We never got to the bottom of this. It just looks like a hypervisor crash. There could be, like, any number of problems. We never replicated it outside of production. And we've only got, like, a finite amount of customer goodwill to keep just hammering away at this problem in production. So, like, in the end, this happened. And, like, that's it. It happened. It's a bit worrying. We don't want it to happen anymore. So, the upshot is anyway. The instance ends up powered off on the source. So the migration hasn't happened. You can't recommission the node and you don't really want to try it again just in case. So, oh, yeah, sad face. Yeah, to do, like, replace that with the actual picture. Right, so in the end, like, we stopped trying to use live migration like this. It just, like, became impractical. There's, like, I'll summarise what, you know, I just said, but three main problems, like, operations and support pain caused by the high failure rate, meaning we've got this big pile of doesn't play nicely nodes and instances to deal with. Instances shutting down unexpectedly. You can imagine this is not very popular with people who are paying for those instances. And there is, like, a well-known, like, performance impact on instances because of, you know, the fact that you're doing more work on the box. Now, we use some C groups for tenant isolation, but we only use C groups for managing the CPU and network bandwidth allocated to each VM. And those two things are actually, like, completely not a problem for live migration. It's memory IO and disk IO that totally kill it. And they're not C-grouped, so as soon as it starts getting bad for one node, sorry, for one instance on a node, it gets bad for all the instances on the node, even the ones that aren't being migrated at the moment. So it's like, whilst this is ongoing, there's, like, an unknown duration period of bad disk and memory performance. We actually found that high CPU activity kind of helped because it was putting the instances like they were too busy thinking about it to actually have a need to write anything down. So it cut down on the memory and disk IO that they needed. All right, so, like any good plan, there was a plan B. And, like, there's two options, but they both involved basically shutting down all the instances, doing what you need to the node, and then bringing the node back up with the same instances in the same place, right? So we don't need to migrate anything. We don't need to have the rolling effect. We can do it to the whole data centre at once if we want, although perhaps they're a little bit large. So the choices are suspend, resume, or stop-start. So if I put NTP can just recover the system clock, that's not true, right? Well, I suppose it can in theory, but a lot of customer workloads are completely, like, not okay with clock drift of the order that you get for doing suspend, you know, doing suspend, resume. It was really unpleasant. And suspend takes a lot longer than shutdown for some reason. So in the end, stop-start, right? We chose three weekends. We've had one of them already, and there's two more coming up in June, and we're doing, like, third of a data centre at a time, shut down all the nodes, upgrade them all, restart them. And that has proven to be a lot more popular with customers than, like, random failures, poor performance, and so on. What's up next? Oh, yeah. So we're coming towards the end. I've got some talk about our future plans, right? We're still pushing some of the changes we made for live migration upstream, and we're still finding that Libfa is increasing in its capability to do the kinds of things that we struggled with. So some of the things that weren't appropriate when we first tried them now work quite well. It's really great to see testing of live migration has made it into the gate of NOVA so that, actually, if you put up a patch that has broken live migration, you'll get, like, a failed job out of it, which means, hopefully, that regressions will become a bit less common, and we're really going to work hard to make this more robust. We're totally seeing it as, like, a first-class citizen of functionality for the Heavion Open Stack Private Clouds distribution. That's what I'm working on now. And so this week, I think, which is why I don't know very many details about it, but there's been an announcement of a partnership between HP and Intel, which I'm pretty excited about. If you saw the Intel guys talking about live migration yesterday, you'll know they know what they're talking about, so I'm pretty pleased about that. And that is actually my last slide, apart from the one that says thank you. But because I'm slightly ahead of schedule, I can give you, like, an invisible slide, which is just some details about a paper that Paul, who's just sat there, showed me about moving workloads around the data centre when I was doing research for this. It was pretty interesting. It's well known that the major cost in running a data centre is the power, right? And especially air conditioning is expensive. And there was some work done by HP Labs which tried to quantify different ways that physically mobile workloads around a data centre could help, right? So one thing you can do is stack all your work in one corner and just chill that corner down and turn off the air conditioning elsewhere. That works quite nicely. Another surprising one was to spread the load out as evenly as you can, like open the windows and turn off the air conditioning, and if you live in the right climate, that can keep the data centre cool enough. So, you know, it's still like an exciting thing for a data centre operator to do. And it was a little disappointing that our, like, beautiful plans didn't come to fruition, but we're still working on it. And it's great if you would join in and help us. What's up? Question straight away. To do this every 90 days? Yes. And presumably that's because of HP's security policy on applying patches. Correct. But are you hitting things every 90 days that require a restart of the machine? That's a good question. I mean, you know, there's an awful lot of stuff that can be upgraded in a place other than, you know, obviously quite new in the kernel, but you don't want to buy us. So I guess that we were planning, within the, like, 90 days, we were planning for the rolling migration to last about a month. And I suppose if we didn't need to do it, then, like, why would we put ourselves through that trouble? But it's, you know, it's a, like you say, it's the policy that dictates that it might be necessary. So, yeah. We're in exactly the same boat and we have the same problems. And I wondered, have you ever, have you considered just to buy the bullet and have some shared storage for the all nodes, like safe back nodes and just new proper migration? Yeah, so the reason we use the ephemeral disks that's set, like on the hypervisor, is for every other use case, apart from live migration, it seems to offer better performance. So it's... There, and cost, right? So this live migration plan was, like, post, like, data centre design, architectural design, and so the original design was for the speed and cost. Thank you. It was a great presentation. We are in exactly the same position. Yeah, where are you? At CERN. So shared storage to costly. We are actually doing plan B, but we'd like to do plan A. We have to do plan A because we've got 3,000 hypervisas to reinstall with the new version of the operating system. So I'm very interested in your experiences and you mentioned some tooling. Are you willing to share that? Nick, are you willing to share that? Lorraine? For your plan A, which you're getting rid of anyway. Yeah, I mean, a lot of it is... a lot of it could go out. A lot of it is, like, we were calling directly into Libvert for stuff that's now available through the OpenStack APIs, and it would be out of date, but I don't see why not. In fact, you know, that might be a decent idea for like an OpenStack Ops kind of project. That'd be great. Thank you. So it sounds private cloud, isn't it? So we... I don't know how much control you have over what kind of workloads you get. People say live migration is not very cloudy because it's all about keeping your pets alive, but like public cloud, we have no idea what kind of pets are over that. It's similar. I have no idea what's running on my hypervisas. All right, cool. More questions? A question you mentioned about the hypervisas as a base. Are they just one kind of hypervisor, or do you have a mixer match? Sorry? Is it just what kind of hypervisor should I give you? So we use Libvert KVM throughout. We use... So when you want to do a live migration, you want to make sure that the CPU instruction set on the source and target is the same. So some, like I guess the majority of our nodes, we use the Libvert CPU model. But for a small number, we use the pass-through and you can't migrate between those two different types because the CPU instruction set is different, so it won't work. I guess Libvert probably kicks you if you try and do that way, won't you? Thanks. Hey, so I guess I can echo Cern's response there. So I work on the Nectar Cloud, and we have the same issues. So it's just a theme. You mentioned that you had a problem with attached volumes. What sort of volumes were they? So these would be not boot volumes, but just data volumes. The problems are... Right. Is it a nice SCSI volume or a SIF volume? Nick? Nice SCSI, right? I've seen this problem with SIF and I patched it from the bit because it was just a simple issue. So the problem can be, if you've got two volumes when the instance arrives on the target, they can have their IDs swapped over because they're just numbered sequentially in seems like arbitrary order. I guess possibly the same thing happens with SIF. I don't know. All right. Did you all catch that? There's like a HP Public Cloud Knowledge Base article about how to get around that by referring by UUID to the volumes generated. You said something about C-groups. Were you considering trying to throttle the source as you started the migration? No, we used C-groups not for migration, but for general tenant isolation because we have some pretty noisy tenants. Have you tried or considered actually changing the CPU quota on the source? It's possible, but it's not the CPU activity that's the problem, it's the shared memory IO and disk. You can do it for the disk as well. What value do you choose? Mostly it's bursty activity and you want everyone to get what they can. Dynamic C-groups dynamically changing them is quite difficult. I believe. I've never tried it personally. Are you all right? Hi. I'm from Claude Wat and Public Claude, we have the same problem. So we have the same open stack. Have you experienced that if on ephemeral you have higher IO it gets worse? Like if you have an SSD ephemeral or whatever? Yeah, well so during the migration anything that gets written on the source to the disk has to be also sent across the network. So I guess like if you've got a faster disk then you need to use more network bandwidth to keep up because the pages will dirty more quickly. I don't believe we have any SSDs, I look down at my colleagues. So we use like raided spinny rusty disks. You mentioned that you were doing one to one migration. By doing that if you have a hyperrather which are nearly empty when you migrate one to one you go to an hyperrather which is still nearly empty. Have you tried the possibility to combine nodes like you have to and you want to put on one? No, we didn't for this use case we wanted to keep it simple but that actually is a good idea because we pay licensing fee per physical CPU so if you can squish all those on to one host that would save money. But no, we just did exactly matching hardware, matching instances because we don't want any chance of stuff not fitting. We've got other problems. Customer affinity and anti affinity rules which aren't recorded or maybe I've now recently started being recorded with the instance. Thank you. Any more? Anyone else from a different cloud with the same problem? No? Okay, well then I'll call it.