 Hello everyone. Nice to meet you here. Thank you for attending this presentation. My name is Oleg Gelbuch. I'm working in Mirantis Labs. This is a small team dedicated to innovation and development of prototypes of solutions that will eventually become a part of Mirantis product line. Today I would like to tell you about Pump House. This is a solution that we have been working on for last like few months. And this solution is aimed to solve a problem of rolling upgrade of OpenStack Cloud. First, let me explain why we tackle this problem. If it needs explanation, why we wanted to address the upgrade problem. Now OpenStack deployments. Mirantis is working with OpenStack deployments and OpenStack deployments are everywhere. We see different types of deployments, different releases of OpenStack installed by our customers. And we also deploy with our tools. But eventually every customer wants the latest release of OpenStack with new features, with new drivers and so on. So usually all customers has their applications running in their clouds and they want to keep those applications running in new versions of OpenStack naturally. So another important thing that all customers want is to minimize their hardware requirements and to reuse servers, physical servers they used in previous installations. So from here we took the goals and requirements for our prototype product. The first requirement that we wanted to satisfy is the minimal new hardware. The second is the minimal impact on the end user applications. It means that applications should work after upgrade just like they worked before and the downtime when applications are not accessible due to upgrade is a minimum. And we want to upgrade step by step because we need to verify on each step that everything is okay and if for some reason a new version of OpenStack is not working well we need to be able to roll back the upgrade, to stop the upgrade, to pause it. And of course we want this process automated because there are really brilliant operators teams out there, they have a solid track record of upgrading their custom OpenStack environments but what we wanted to do is to productize the upgrade process and this requires automation on every step and the unification of the approach. So we developed a rather simple forcliff strategy for the upgrade. First of all we need several hosts. It depends on the desirable architecture for the upgrade target cloud but it should include at least one controller, physical server for controller and at least one physical server for compute node. This is the initial seed of the upgrade target cloud. Next when we have those servers we deploy the upgrade target cloud with a new OpenStack release. We start to move workloads, I will tell a little bit later about what workload is in our understanding. So move it one by one to target cloud and when we release some capacity in the source cloud we can take it and upgrade it and assign it to the upgrade target cloud. And of course as I said before at every step we want to be able to verify the migration, the upgrade and that they work together. So next couple of slides is dedicated to description of upgrade flow. For example upgrade flow this is the first step. We use pump house application which talks to compute API of the source cloud on the left side to use live migrate to live migrate servers, virtual servers from selected physical hosts and disable those hosts, put them in maintenance mode, I will tell about maintenance mode in a couple of minutes. And the next step is to deploy target cloud. We use fuel for deployment of the target environment because this is a fully automated open source deployment solution with the API that satisfies our needs. Next step is to start migration of workloads, start releasing the hosts in the source cloud. We use OpenStack APIs to select resources that are included in the workload to recreate those resources, replicate those resources in the destination cloud and the upgrade target cloud. And basically this process is repeated until the capacity of target environment reaches the set threshold. Next we need to add the capacity to the target environment so we repeat the first step we use live migration to move remaining servers from one of the compute nodes in the source environment, decommission it from the source environment, add it to the destination environment and repeat this process until all compute nodes from the source are upgraded and added to the target environment. So this is a so-called forklift upgrade and let me explain why this approach was chosen because the eventual goal of OpenStack upgrades when you talk about upgrades of OpenStack is they in place upgrade so you don't do this forklift of resources, don't basically install new controllers and so on, but is it even possible? So what bump house do is combine workload mobility orchestration with the barometer and configuration management. Those two functions are rather contradictory because to provide workload mobility and orchestration you need to be architecture independent. If we want to productise this, productise upgrade process, if we want it to be repeatable, to be reproducible, to be like to work on wide range of configurations that OpenStack supports, we need to have this process as architecture independent as possible. The second feature of this first function is that workloads are moved in units so you need to move workload as a whole. You can't move part of resources for workload and leave other sources behind because you have two different clouds, they share nothing and you can't in the most in the most general case you can't be sure that components of workload resources of workload can access each other and the applications run properly. On the other hand the bare metal and configuration management is focused on architecture because you need to you need to exactly define what you want to deploy, what you want to upgrade to actually. The second feature of the bare metal configuration management is that it needs to be staged, you need to be able to stop at some point and just work with remaining nodes. I've been talking about workloads a lot and a couple of words about what are workloads in Pump House. This is pretty simple because what we are working with in OpenStack is user applications. User applications work on top of some resources in OpenStack. The combination of all resources used by certain applications that produce some service, some result is a workload. For example in our prototype version of Pump House we support the most simple, the most basic type of workload. It's a virtual server with all matter sources that it depends on and all applications that run in that server. So you can see that server workload is composed of virtual server instance that depends on several types of resources provided by different OpenStack services. We can select multiple servers by grouping them by for example tenant ID. So our approach to the migration of workloads is revealed in the SharedNothing Cloud. So target environment, target cloud and source environment, source cloud, they absolutely independent. They don't have common keystone back end for example. They don't have it in our assumption. To rebuild the server from source cloud in the destination, in the target environment, we can take two options. One of those two options. First is to move image and just rebuild from the image. The second is to create a snapshot during the migration process and instantiate the server in the target environment from that snapshot. In the future we plan to address other types of migration, other types of migration. First our target is the gross cloud block migration. This assumes that hypervisors in the source and destination somehow can like exchange the network traffic. They can copy data directly from hypervisor to hypervisor. And OpenStack, we need to extend OpenStack to provide the ability to adopt migrated instances. And the ultimate goal of the upgrade is a live migration upgrade because eventually we want to be able to seamlessly move resources from one cloud to another. And this is basically provided by live migration. That requires elaborate management of shared storage between clouds because we need to have the same storage layer in source and destination cloud. So this is a simple scheme of image-based rebuild. Obviously it doesn't allow to retain data stored in the ephemeral storage of a virtual server. It suits for such cases as stateless applications running inside servers or for infrastructure virtual servers like routers or load balancers with external source of configuration. Snapshot build based rebuild is a little bit more elaborated. Two steps are added to the main flow and this increases downtime for the server because it takes time to make a snapshot, to transfer a snapshot from source cloud to destination. This option suits for basically any type of application that stores data in the ephemeral storage of the instance. So as I said we use OpenStack APIs for the migration of data and let me explain why. As I said we need a unified way to migrate resources. We needed to work on basically any architecture supported by OpenStack. There are more effective, more performant ways to basically migrate the data. You can overwrite the disk file for example directly from one hypervisor to another hypervisor if you use KVM with Qco2 and run on Linux. But what if you use for example Ceph? You can't do this as simple as that. You can't just copy over the disk image. So OpenStack API is abstract the backends as much as possible and this is our goal. So working on the resources orchestration we faced a challenge. The main challenge of this migration is that servers have dependencies and those dependencies in turn have their own dependencies. So this is the like simple tree for migration of single server. We need to do all those operations to retrieve data, to store data, to translate cloud specific parameters. So parameters for resources from source cloud match the parameters for their dependencies in destination cloud. So this is an example and we tried to solve this using Taskflow library thanks to Joshua Harlow and the team who worked on this because they really created great tool that allowed us to basically simply build those dependency trees. What you have seen on the previous slide is actually the dump of flows created using Taskflow library. So that's very useful tool for us. Taskflow also allowed us to solve the parallel execution of tasks that don't depend on each other. For example, migration of several servers that don't basically depend on each other can be done in parallel using built-in Taskflow mechanisms. And another important thing that Taskflow provided us is that the immigration algorithm is deterministic. So we can know in advance before we start migration that at some point we will run out of capacity in the target environment. And we can in advance schedule the migration of physical host from source to destination to increase its capacity to provide additional capacity for the resources. Now to bare metal management. Our approach was pretty simple for the prototype. We used remote power management for the commissioning source cloud. We used automated deployment framework. We took fuel as the most available and most effective from our standpoint, our point of view framework out there. And we wanted one by one upgrade of nodes controlled from user interface being in the script or graphic user interface because sometimes, for example, you need to rewire network before you can reassign nodes from source to destination. This can be basically automated. It depends on your network configuration. So in a source environment, we need to prepare node for decommissioning. And this process basically has three different steps. First of all, we need to move all virtual servers that not yet migrated to the destination cloud. We need to move them inside the source cloud because, as I said before, we select servers, virtual servers for migration based on our workload definition. And it can exclude some servers running on the specific host. So we need to preserve them. We need to migrate them to other hosts in the source environment. Next, we set parameters for the node in source environment because fuel basically requires that node boots from the network and then it can provide upgrade. And last step is to remove the node from the environment. We do this by simple OpenStack call service delete. We just remove it from the list of services recognized by OpenStack. And we wanted the decommission management to be pluggable so we can support source environments based on fuel and based on other deployment frameworks deployed by other solutions. So for the fuel, we have fuel API and it's pretty simple, just one call to HTTP and that's all. For all other types, basically, of deployments, it's IPMI. We just issue a couple of commands and decommission the node by power set. In the destination environment, fuel provided us with automated upgrade. We use it to assign a role to the node in the fuel environment. We then call the automated deployment and optionally, we can test the success of the deployment with fuel provided tools. Fuel also provides the latest OpenStack release, the Miranda's OpenStack. And what's also important that the fuel is open source framework, we can take, we can implement some functions that are not supported currently by fuel, that needed by us and basically commit them back. For example, when we need to create a target environment, we basically need a call that I replicate the configuration of source environment as close as possible. And in future, it's possible that we will commit this change to fuel upstream. So on the bare metal side, we had a challenge connected to maintenance mode for servers, for OpenStack as a whole and for workloads. So at the host level, at the level of physical node and source environment, we implemented the maintenance mode as evacuation followed by disabling the node service in the host. From this moment on, the host can be manipulated, can be decommissioned without affecting other resources in the source cloud. On a level of unit of migration, the maintenance mode in a prototype version is simple because our unit of migration is a server, virtual server. And we just suspended before we start migration. So we ensure that its state doesn't change before we move it to the destination. And our next step, our future plan is to implement the maintenance mode on the level of NOVA project. Okay, so here is a small recording that shows how basically Pump House UI looks like. And as this is a prototype, the UI is pretty simple. It doesn't have really many functions, many like knobs and levers, but you can see that it includes two environments, source and destination environment. For each environment that presents a list of tenants or projects that basically group servers and a list of hosts that are included in the environment, you can see that below those two panes is the log pane, which basically tells you what is going on in your environments, how the upgrade is going. So let me skip it a little bit. Oh, I'm sorry for some reason. It's now. It's okay. So you can see that we can select a tenant. We can see servers that are included in the tenant and we can see resources that are basically assigned to the server and source environment. And those resources that we display include image and floating IP because those basically the most important things that we can control. You have seen that in addition to those resources, there are much more dependencies for each server that we also need to migrate before we can respond the server in the destination environment. So when the environment is ready for migration, we basically can click a migration button to the right of the tenant name and see that the immigration has started. The tenant appears in the destination environment. All servers are put to the migrated state. Some of them put in the suspended state. Migration of those servers starts. And you see that thanks to task flow, this happens in parallel. So all they get suspended at the same time and they appear in batches in the destination environment. So a couple of words about the underlying technologies that we used here. We implemented two types of the solution. First type is the service mode and this is API server that allows to create flows using migration flows using task flow library and task flow tasks call open stack clients to perform actions on the source and destination environments. Another option to use with Pump House is the CLI mode. This is a simple script that directly calls task flow flows and you can specify IDs for tenant from which the servers will be selected for migration. Some overview of our next steps. As this is a prototype, we didn't yet release first version of it. We will release once we have support for Cinder volumes and our next steps are to support more elaborate resources like heat stacks. For example, and Syllometer metrics later. We plan to implement the upgrade scheduler that automates the migration of resources in conjunction with the migration of physical nodes as I described before. And we plan to implement some advanced workload types. First of all, project workload is the workloads that consist not only of servers and they are dependent resources but includes all meta resources that belongs to the certain project. And another type of workload that we aim to implement is a stack workload. This is a set of resources defined by heat stack. And basically, last but not the least, is integration with fuel deployment automation. For the integration, we want to be able to basically provide upgrades to fuel, upgrades functionality to fuel. You can find source code and documentation on Github. This project is developed and open. And we have API documented on the API service. This is useful for documenting APIs. And I started a series of blog posts in Mirantis blog dedicated to the pump house. There is an initial post with introduction. And shortly, there will appear deep dive posts with description of the internals of the service. Thank you for your attention. We have like five minutes for questions if you have one. The timeline is as follows. We prepare the MVP for upgrading with only servers and volume supported with only compute nodes upgraded by the release 6.1 of fuel. It will be the end of the year probably. And we hope to support more elaborate scenarios I mentioned. For example, the cross cloud storage by the next design summit. That's preliminary timeline again. Thank you. I think quite catch when you start the migration. How are VMs actually moved between the clouds if there's no shared storage? How are you actually doing it right now? Hello? How are VMs migrated between clouds right now if there's no shared storage? So once the migration starts, once the VM is being moved, how is it actually moved and there's no live migration between two clouds? It's shut down, written to file and then the file is migrated across and then re-import or re-hydrated if you like on the far side. Is that correct? I'm afraid I still can't hear the... I can't read after. I'm sorry. All right then if no more questions then thank you for your time and for your interest. I hope to see some early feedback on the bump house if you have tried it and thank you. Have a nice day.