 I think that is now our queue, yes. Good afternoon, everybody, and thank you for joining us. I know it's after lunch, and we're going to try and keep you awake as far as possible. Today, we're going to be talking about migrations and workload migration within the OpenStack community. And it hopefully will be a fun topic to get through. We've got a big group with us today, more of a play than a presentation. So I'm Sean O'Mara. I'm a senior systems architect with Mirantis, and the EMEA architecture lead for the services group. I have quite a bit of experience with migrations, both as a customer and from within Mirantis. Ernest De Leon, cloud architect with Mirantis, and senior manager for the services engineering. We've got Veroma Verchikov, software lead engineer. Looks after the product we use for migration. Irat Karetenov. Irat works for one of our partners, CloudOps. And not with us today is Samir, who's the project manager for workload migration. Right. So we're going to cover an introduction to workload migration. When to migrate? Is it a good idea? Why and why not? Assessing workload migration viability? And what happens during a migration? So what is the process behind it? And how long does a migration take? So a little bit of background. Is a migration a good idea? Well, all your apps are both cloud native, aren't they? Show of hands, whose app works with a company that only has apps that are cloud native? Yeah. Kind of the expected answer. Love the participation. In a perfect world? Yes, we don't want to migrate workload. We don't want to have to worry about the pain of moving workloads from legacy systems or even other clouds. It should just be recreated. But we don't live in a perfect world. The short answer? So why migrate? Well, we've always got legacy applications that we have to deal with. Those applications need to be moved onto new infrastructure as quickly as possible. Quite often, we don't know these workloads or understand what's part of these workloads for a huge number of reasons. Everything from 20 years ago to the people who wrote them have left to just sheer complexity. Time to market. We want to move quickly. When we build a cloud service, a big part of that is cost saving. If we can move, reduce the cost of managing that infrastructure, we can move applications onto it. It helps us with our time to market. And the last one, which is the common one that we've seen is upgrades. When you're changing a cloud and you're building new clouds and you want to upgrade OpenStack in place, we all know that that's been a big question for a long time now. Whether or not we can upgrade OpenStack in place. What we've been doing up till now largely is workload migration from other OpenStack clouds across moving all the workload, moving all the data, moving all the information about your cloud. The other reasoning, customer impact. I want to move information. I want to upgrade. I can reduce my customer impact. They don't have to go and recreate applications. And finally, a big one is architectural change. I want to implement a new SDN solution. I can build a new cloud and we've got a guest in the room. Hello. Oh, that's not working. That one's not working. Hand them over. So the Merantys bearers joined us. Now, Merantys has a number of bears. We've got bears and teddy bears. Hello. Hello. Hello. Hello. This is quite tasty. How's everyone doing? Good, good. I'll come around and maul each of you individually if I don't get a better response. How's everyone doing? All right, I'll take it. It's Tuesday. Yeah, I love you too. More noise and maybe there'll be vodka afterwards. That's more noise than I will chug vodka afterwards. How is everyone doing? You guys are going to see something that Stacks City you've never seen before. The drunk bear wandering around. It'll be great. Drunk bear and a guy in a kilt are wandering around. Get both of us in the same picture and you get a special prize. All right, no, I'm... Thank you. You're very welcome. Enjoy the rest of the presentation. Sorry for the interruption. Thank you, bear. So we had the surprise visit from the Merantys bear. Okay, carrying on. There are a number of reasons why you would not want to migrate your cloud. At the end of the day, you're continuing bad practice from your legacy applications. Traditional design practices are usually not appropriate for moving workload into cloud. And infrastructure availability, which is the traditional world, we make infrastructure more available rather than the applications more available. Cloud design criteria, pretty much the same reason. Cloud native apps work better in cloud. They allow us to take advantage of the capabilities of cloud, workload migration, scaling, all of those things. Infrastructure restrictions, you know, part of the reason clouds can be built cheaper is we can use commodity hardware. But of course, that means failure rates are higher. So it's those sorts of things we have to take into account. Supportability of legacy, legacy applications are a nightmare to support. People leave, you lose the flexibility within our clouds because now, of course, we can't just shut down components and hope the application recovers. And cost. Parallel support costs of maintaining your clouds when you have legacy applications are huge. And finally, lots and lots of extra infrastructure. So now we get a little bit into the meat and potatoes of what happens during a migration. So, Sean covered, you know, should you migrate or should you not? Which is kind of the most important question and why should you or why shouldn't you? I think it's kind of implicit in there that if you have a native app that's designed for cloud migration, it kind of takes care of itself mostly. So the next part is, so we've decided we want to migrate, right? From one cloud to another or from, usually it's from one cloud to another, but it could be from legacy equipment to a cloud. So what do you do, right? So the first thing you do is you have to do a technical deep dive into each application architecture, right? So you identify all the applications and usually, depending on how your tenants have designed these, they can fall into projects or they may be spread across multiple projects. They might be spread across multiple clouds. Either way you want to assess that and understand in deep detail what is the application architecture? What is the functionality of the application, right? What is it delivering at the end of the day and are there any SLA expectations both from the application to end users and from your infrastructure to the application owner? So in that process, you kind of have four major areas that you focus on. The first thing is identifying all the application stakeholders. This is usually a diverse group of people from a business owner who's signing the check for all of this that's being written and migrated. This includes application architect, both software and hardware. This can include lead developers. This can include project managers that oversee this application and it running and then technical account managers. If it's a customer facing application, there's a whole lot of other people. So you have to find out who the key stakeholders are to understand what their requirements are in terms of a migration. The second thing is you set up the technical deep dive call with all of the tenants, right? To understand in detail all of these things. One of the things you request to have at those calls are logical application architecture diagrams so that you can dig into the nuts and bolts again of how the application is distributed, if it is, what are all the components that make it up, what network segments are it on, what type of storage does it use, all those kind of things. And usually in those diagrams, it gives you enough of a view to be able to ask the right questions in terms of what you need to know about that application. Then the last thing you do is review the architecture in detail. So once you get all that data, the next thing that happens is you go through a series of questions for each one of the applications. So the first thing you need to understand, right, is it highly available? Is it fault tolerant and is it distributed? These are just kind of looking at ways that you can take parts of the application offline without affecting the overall functionality of that application. If they're not, right, if these are legacy applications that's scaled up instead of out, you have to take that into consideration, right? And it means that there probably will be downtime. Is the application wholly contained within the open stack environment? So this is one that we've run across quite a few times where someone did a lift and shift of an application and kind of dropped it into the cloud, but let's say a portion of the application couldn't fit in the cloud. For example, the data store was on Oracle database that was far too large to fit inside the cloud. So they put all of the other tiers of the application into the cloud and just put direct pipe back to an Oracle database. That's a problem, right? You need to be able to address that. We've had others where people have connected physical routers via layer two into the side of the cloud for like VoIP applications, right? That's something that has to be taken into consideration. So you wanna understand, is the application wholly contained in open stack and therefore easy to move or do you have hooks into other infrastructure outside of the cloud that make it more difficult? If federal instances are critical in terms of open stack, because unlike AWS where there's an implicit understanding that an ephemeral instance, if it goes away for any reason is gone, and the application designer knows that, open stack doesn't treat ephemeral instances in the same way, right? And the use cases being used in open stack are a little bit different in terms of how they're using them. So usually they're trying to get disk speed advantages and so you could end up with, for example, large distributed databases that have data that is semi-persistent on the nodes themselves and you may have to migrate that, you may not, right? So you need to understand that. The last thing is the application comprised of Pets or Cattle and this goes back to what Sean was saying, right? So legacy applications are probably mostly comprised of Pets and Pets are things that you care about, right? Those are things that you can't just destroy and not worry about because you need that application to run, you need every component of it. Cattle is the exact opposite. It's that thing where you can move them around, you can shift them around and one moving doesn't affect the others, right? And so you don't have to give a specific care to that Cattle VM. So that's kind of the thought of assessing the viability. Now, once you've done that and you've deemed a workload viable for migration, the next thing you do is actually proceed into the planning stages of migrating. Hey everyone, I'm Roman. I'm gonna give a brief overview of how migration actually goes and what do we do and what kind of problems we experience and how do we deal with it. So this is actually where we started and in reality, it never worked. So as you see, this schema is quite complex and not really readable. So we had to simplify everything and what we decided is that we will migrate each town separately with all the load that Tent has and instead of having moving hardware from one cloud to the other, we decided that we need a destination cloud setup and ready before that we do migration. So what happens? So most of the objects in OpenStack are migrated using OpenStack APIs which means that unique identifiers for those objects are not kept in the destinations. So that's for the most part. This is most of the objects are done doing this. So we read objects from source cloud, then we run OpenStack APIs to recreate same objects in destination. This sounds kind of simple, but some objects need magic and well, in Miranda's magic means glass of vodka and then the next morning you wake up and everything's done magically. No, unfortunately not. So what you really need to, so in reality, life is more complex. So there are several objects which require treatment which are floating IPs, quota, usage volumes and ephemeral storage. So I'll go over those in a while. So with floating IPs we have two challenges. First one is if we want to keep floating IPs, then user and APIs unfortunately does not allow us to specify the same floating IP using APIs itself. So for that case, we need to kind of enter God mode and do direct database manipulations and kind of pretend we're being neutron, which obviously means that this solution is difficult to maintain and needs to be tested real carefully before we start doing anything. The other problem is IP conflicts which may occur. So obviously if you keep same floating IPs and source in destination cloud, you may have a situation when you have two floating IPs in different zones and in order to resolve that you would need to move VMs first then detach floating IPs from source cloud and associate them in the destination and after that shut down VM and source cloud. So the other thing is quota usage. So NOVA supports keeping track of usage of resources, but the problem is that when we do the migration it's usually being done with some super user admin account. So all the objects in destination are kind of recreated with from one user and thus, well, database and destination will be different from what we have in source cloud. Thus, again, we need to enter God mode and modify database directly. And again, the same problems. It's dangerous and difficult to maintain but yeah, that's the only way so far. The other one is volume migration. So obviously simply recreating volumes in destination doesn't make much sense. You need to transfer data as well. And for that, you would need to figure out what is the sender backend to move data from one cloud to the other. And you know that sender currently supports like more than 50 different backends like NFS, iSCSI, fiber channel, all of that, different vendors. And this means that data needs to be transferred in different manner. So for NFS, for example, it's just a copy of simple file for iSCSI. It means that we need to copy from block device to block device. So and also all those different combinations. So you need to design your migration tool to do that. The other problem is networking bandwidth problem which occur all the time. So like we had, sometimes we had data transfer speed like under 500 kilobytes per second, which is quite slow. So you may want to expect this kind of problems. And also again, networking problems and data transfer problems. So in order to resolve those, what you would do is you would split huge volumes into smaller chunks and transfer each small chunk separately and you would repeat an error. Now, yes. So the solution, it is a solution, yes. Yeah, God mode. It is a solution. Where to the Q and I afterwards, we can give you a bit more detail. Yeah, so with the FMRL, we have similar problems. So again, different types of backends, NFSF, just local storage, networking problems. So again, you would want to know how the files are stored in storage destination in order to copy those. And then you would want to repeat on networking problems and split huge files into smaller chunks. Now, with that, I give my word to Irak. So now the final step, I want to know how long does migration take. Usually once people know how long it takes, they don't want to do migration. But before getting to that, we want to know why we want to do the, why we want to know how long the migration takes. So first is risk planning. So basically every migration brings potential risk of downtime, so you want to be prepared for that and know when the tenant will be migrated. Once you know how long it takes, you can also come up with a proper cost, come up with a proper schedule for the tenants and basically everybody wants to know when this migration is going to end. So before also finding out how long it takes, we want to know what we're actually migrating. And the first approach is basically to do the migration in one shot, the kind of a Bing bang approach. So as we already discussed, those migrations are, usually doesn't make sense. One thing comes in mind, if you have a public cloud, you cannot really select, you have to migrate everything and you have to meet SLAs with your tenants. But in a normal case, it's even hard to do this because if you want to do migration in one bank, you need to have destination cloud to be the same size as the source cloud. And it's pretty much very expensive, right? So in order to tackle that, you want to do the migration in a progressive way. So you would be migrating tenants per tenants. So the first way of doing it is basically identify business critical projects that you want to basically move. And that way you would avoid to have enough to fill up your cloud on the destination site and also you're basically focusing exactly what you need to migrate. Another case it could be you just want to migrate some VMs between tenants or on a different cloud. So now we know how to do this. I would just quickly explain the way we are automated, the time estimation discovery, sort of. So first we collect data from source and destination clouds, create test volume images and the FMRL VMs with FMRLs and do the test migration and basically get the speed speeds. And finally, once you know the data that you want to migrate and the speeds you pretty much can do the time estimation. And then you can plan your migration, set the schedules for the migration and you can see that you might have very slow speeds so you might need to address your links. You might go through the one more iteration of collecting data and estimating your speeds and then pretty much you make a final schedule and you start your migration. So now I'll just go quickly on each of these points and we can ramp up. So as I said, when we do collection of migration data we basically discover both clouds and store it in a database and then we just basically present size views per tenant. So the most important things to consider is images volumes and VMFMRLs because this is actually the data that needs to be migrated. And we also can provide a list of unused resources. So for example, there are volumes that are not attached. There are images that are not used for building VMs so we can provide it and basically just you can do the cleanup on them. The open stack resource migration is pretty much fast because it's automated and it's only metadata to migrate so usually it's not considerable. And pretty much we can also provide total sizes per cloud. So this is just example from the last project. This is just numbers fresh in mind. So you can see here the glance transfer we performed for the APIs. So we get speeds around 40 megabits per second and for VMRLs and Cinder volumes with NFS. You can use R-Sync and SCP, very slow transfer speeds. So we found there is like other transport protocols, one of them BBCP. So we got pretty much good speeds based on our link. And finally, for the ISCSI, there is no other invention than just do the block by block copy and the speeds are very slow. So we have everything. We have data sizes and we have speeds. So pretty much this is just example of one tenant where they had images of one terabyte, a firmware of three terabytes and volume of 20 terabytes. So the migration of the volume was done by transferring from NFS to NFS back end. So this is the speed. So the final, and we also have to consider the OpenStack resource migration, which involves migration of identity, computer resources and networking. So the total time for this tenant was 35 hours. And as we said, we pretty much, the migration of the OpenStack resources is negligible. Takes couple of hours, I would say for the whole thing, but also migration duration increases. Basically, if you have more data to migrate, it's basically linearly increasing the time. So there's no pretty much magic there. So yeah, and finally, once we have all of this data, we can ramp up with the planning, provide information for cloud owners, if possible, address link speeds issues and assess which tenants are good candidates for migration. Basically, we'll just guys cover and basically go for the migration. We also tried to use best practices where possible. So for Cinder volumes and FMRs, we tried to use better transport protocols. We tried to do the parallel migration. If you do automation, you can always automate to the way that you can do this in parallel. And finally, yesterday we had a talk where we discussed everything about data migration. So you can find the link in the YouTube, but basically you have other ways, basically not migrating data at all where you can reattach your storage or you can replicate your data. So you can also look into those. So I'm transferred to our KELT. Thanks, Eric. So all of this is being done through a set of tools that Miranda's built, which we use through our services team, the sheer complexity of clouds and the fact that we're manipulating everything in the background. And this requires hacking a little bit at your clouds to make everything work. We do this as a services product. So that was the question you were asking. So we have a set of tools. So if you wanna go and have a look at them, they are fairly complex and they can be found on Git. Just look from the Rantis Git. But I would suggest that you speak to your Miranda services rip around that to find out because of the sheer complexity of doing this. So any questions? And thank you for listening to us. This is two questions, maybe. So you provided that number of hours for the migration. Was that all downtime? Was everything off during that whole thing? Effectively, yes. Sorry, it can be done in smaller chunks. You could do it at an instance at a time. So the software does allow us to do that. But you also have to take into account dependencies. So one of the levels of complexity that we handle is the linkage of the two clouds and the two environments to handle that. But that is very dependent on the architecture. Yeah, and I'll add to that. For these particular customers, the vast majority of the tenants we moved had multiple zones active at any given time. So this was indicative of one zone being down. So the other zone was still up. So the application was not down, but a zone was offline during this time. So that might help you. And I also noticed on the chart that we're showing the various copy protocols for ephemeral data. Under API, you just have a little X as not an option. There's a block migration option in API, right? It's almost impractically slow. And it tends to fail randomly. It fails for me all the time, and I was wondering if it fails for everyone or if I'm doing something wrong. No. I think everybody has experienced it failing. Has anybody attempted to fix it or addressed it, or is it just a thing that nobody cares about and has been fully sidelined? I'm fairly sure that there are reviews for it. I haven't looked at it recently. I don't think anybody here has. No. No, I'm finished. Thank you. In your example, were these two clouds actually on talking to each other? Yes, they were. And you've got the same networks with the same external connectivity. How can you have the IPs be present at the same time? Yeah, so I'll let Iret handle it, or Roman, or I can explain it. So what happens is you're exactly right. They're two disparate clouds, right? And even to the extent that these are totally different versions of OpenStack as well. So what happens is they extend the network between the two clouds when they set up the destination cloud, and they plumb everything the same on the destination. In other words, all the IPs are available, but they're not in use. And I think it was Roman that pointed out in his presentation. Well, you were talking about floating IPs, but in lots of cases, we are actually extending our tenant networks onto our bigger backbones, which are public addresses. They're not private addresses that are not. So floating IPs are the challenge because of the way OpenStack handles floating IPs. You can, through OpenStack, assign a specific IP. What we typically have done in the past is we turn off the DHCP in one of the clouds. So the destination cloud will turn off the ability to assign until the migration is completed for that segment. And were you actually taking the database from one cloud and just populating in the other now? So the guys explained with the APIs is we actually suck all the information out of the source cloud through the APIs, through quite a large number of API calls, and then reform the data so that we can do transforms in the data and then push it back into the new cloud. So hence we lose the UUIDs. I'm still curious about the IPs when we are connecting both clouds to an external network and they both have to be routed for the same IPs. I wasn't talking about floating IPs. Shared gateways. It can be handled fairly easily at an L3 layer. There is the challenge of preventing both clouds using the same IP blocks. So quite often you'll end up doing things if you're forced to have both clouds active at the same time, you have to cut the blocks in half or restrict the number of addresses available. But you managed to move this entire cloud within that 20 hour period? So this is one, the example is one portion of a far larger cloud. We, this was almost two years ago now with the same tool set. We moved a cloud that had just on 400 terabytes of data in small chunks over a period of about eight weeks. Moving approximately, and somewhere around 500 gigs a day. During those eight weeks you had both networks. Both clouds running, both clouds available. Customers being able to work in both clouds. And what were you doing with the subnets that were shared between the two? So every day we rerun the API transfer, the transform so that check of the old cloud, compare it to the new cloud and move the changes across. We'll look at the tools then. Thanks. Probably more detail than you need. If you'd like to come up afterwards we can have a quick chat about it and I can tell you a bit more. Thanks. We did say we wanted to do it quicker. Right, you guys all got time for a 10 minute post-prandial nap before your next session. Thank you very much.