 OK, let's get started. Before we start, I have some questions. How many people here are running on OpenStack Cloud right now? OK, how many of you have the plan to upgrade your OpenStack Cloud in the next one or two years? OK, great. And how many of you are the contributors to the community, either the cloud projects or the developers, the drivers for Newton, Cinder, or NOAA? OK, great. I think we got a lot of people in this room. Thank you. So let's get started. For this session, we will introduce OpenStack upgrade without downtime. That's how to rolling upgrade your OpenStack environment. See it work. OK, here we go. So for today's agenda, we will have a general introduction about the left upgrade. And then I will turn over to Takashi. And he will introduce how NTT designed the testing of the left upgrade. And how to do that was the upgrade procedures. And also he will share the test results in the two rounds of testing. And in the last, we will have a summary about the issues we have during the testing. And also we want to call for help from the committee to do some improvements. So first of all, we want to introduce who we are. Takashi, you want to have a self-introduction? Hello, everyone. My name is Takashi Natsume. I work for NTT Corporation. Thank you for coming today. OK, for myself, I'm Yan Kaili. I'm from Canonical. And I worked with the Takashi team to provide the consulting service for this project. So for the OpenStack upgrade, why do you want to upgrade without downtime? I think this is some reasons coming from, especially when you deploy the OpenStack in a production environment, all your deployments to provide the public cloud service. The first of all is OpenStack has released every half a year. How do you catch up? How do you to make sure you can get the latest features from the OpenStack unit to upgrade? And also, for example, if you have an environment which is deployed on the fair zone or grizzly, I think now the code is already out of maintenance. You have to maintain it by yourself. There's a lot of maintenance calls you need to consider. And in addition, there are a couple ways you can do the upgrade. You can upgrade using an offline upgrade, build another environment, and migrate all your resources eventually. But you need to invest the new hardware, so this introduced additional cost. So in this session, we will introduce how we upgrade the OpenStack environment from Havana to Asahaw's in the same testing environment. So before the project started, I think entity team, they set up their goals for the left upgrade. The first goal is there is no impact on the existing results. The existing results, including the instances the customers already deployed and they are using, or even now they have some connections to their instance right now. And also the results is including the volume storages. And also, we don't want to introduce any performance problem during the upgrading process. So the second goal of the entity called left upgrade is there's no impact for the new requests. For example, if I want to request a new instance, I want to create a new volume. But right now, the upgrade is in progress. We don't want that to impact the end users. So but how we do that, that's why entity team started a test project. They deployed the test environment followed exactly the same architecture in their production environment. And they conducted maybe two to three months testing in this environment. And they defined what component we need to consider for the upgrade, including the open-stack core components, Noah, Cinder, Glance, Newton, Kiston, and Height. But what I want to highlight here for the other open-stack components are the meteors underlying, for example, the Massicule, our directories, their load balance service. And also the operation system itself is not considered in this project. So the test is connected from Havana stable to IZHouse One. That's because at that time, IZHouse One is the only released version when the testing was started. And later on, we found some bugs during the upgrade, which was fixed in IZHouse Three. So for Noah, rolling upgrade, we upgraded to the IZHouse Three in the second round testing. So a bit of background information about the system architecture. If you take a look under this system architecture, we separate the services, which can support active high-ability. And the services can support active passive high-ability. Also, the services have no HA solution right now. So this separation, the reason is we need to take the different strategy for the upgrade for these components. So another thing I want to mention here is for network solution, NTT takes the VMware and NSX in this testing. It's not the standard normal neutron something. It's a neutron-plus NSX. And for the storage, block storage, in this architecture, we take the EMC storage driver during the testing. So now I would like to hand over to Takash to introduce some details. From now, I will explain our live-up-grade testing. We, NTT, performed a physical study for live-up-grade in our test environment. Step-by-step upgrade, rolling upgrade, was needed in order to satisfy our goals that Yankei mentioned. Because the way to upgrade the entire system at one time affects user's resource utilization and user's API calls. On the premise of it, I will explain our trial for the viewpoint of these five items. At first, I would like to talk about things to be investigated before creating upgrade procedure. We investigated these three items, batch-checking OpenStack source cores. As you know, there is a possibility that database schema is changed in new version. And the consistency of APIs between components is also important. Components means Nova, Schindler, and so on. For example, check API version's consistency in calling from Nova to Schindler. And the consistency of RPC APIs and REST APIs in each component should be checked. For example, RPC API's version's consistency in calling from Schindler API to Schindler volume. When it comes to RPC API, major version is very important. About this matter, I will explain the details later. Then I will explain considerations in creating upgrade procedures. User resources must be transferred from upgrade target host to other host. In our case, user resources were VM instances, block storages, virtual routers, and virtual networks. These resource migration steps must be added into the procedure. For example, VM instances have to be transferred by live migration, and so on. But Schindler's volume migration was not utilized in our trial. As a factor that determines the order to upgrade in each component, there is RPC API versions. Errors occur in case that the major version, callee supports, and RPC messages' major version are different. When major versions are same, it is no problem in case of the minor version callee supports equals or greater than RPC messages' minor version. But if the former is less than the latter, errors occur. In this figure, if RPC major versions are same, a callee has to be upgraded after a callee is upgraded. In this case, upgrade is performed in the order of process A, process B, process C. In other words, the order of upgrade must be decided so that there is no contradiction among RPC API versions. And we also have to decide the upgrade order of OpenStack components on the basis of consistency of APIs between components. And you should consider the way to block requests to check processes in progress and utilizing graceful shutdown function. About changing database schema in NOVA upgrade procedure designed in the community, database schema is upgraded at once, and NOVA conductor processes also upgraded at once. This procedure didn't meet our requirements. So in our procedure, in order to avoid API call interruption, controllers are rolling upgraded. Therefore, OpenStack processes coexist on different versions. In order to enable it, at the beginning, add new version tables, add new version columns, and add new version indexes. Then upgrade OpenStack processes. In the end, delete old version tables, delete old version columns, and delete old version indexes. In our trial, it was not needed to convert data format, but it must be considered in some cases. And database lock time will be a long time if large number of records exist in database tables. But we did not write our procedure with large number of records. In that case, these tools will be able to solve the issue. About high availability configuration, from the perspective of live upgrade, active-active configuration is design-level. But there are some cases that active-active can not be configured, so active standby is forced. For example, the case that's in the back end or a neutron plug-in does not support active-active configuration. In our test environment, Nova, ConsoleOS, and HeatEngine were also configured active standby. Neutron L3 agent and Neutron DHCP agent were configured active-active in our test environment. It means that those hosts were configured active-active. But essentially, these processes should have been configured active standby because they have their own state. In the latest version, Heat has multiple engine functions. So HeatEngine processes would be configured active-active. Now, and Neutron has L3 agent HA, BRRP function now. In active-active configuration case, requests to the target host are blocked as the load balancer while the target host is upgraded. When switching active standby, there is a service downtime of the component as expected. By the HA configuration, the live upgrade is performed in these procedures. We had three patterns. In active-active configuration, steps block requests connections to upgrade target host, migrate user resources, upgrade host, and unblock. These series of steps are repeated on each host. In active standby configuration, steps are upgrade standby host, block request to active host if possible, switch active standby, and unblock. These series of steps are also repeated on each host. In no HA configuration, Nova compute hosts steps are block request to target host, migrate user resources, upgrade host, and unblock. These are also repeated on each host. This figure is one shown earlier. This was our test environment. Based on that, we explained, especially by considering upgrade steps of each HA configuration and RPC API versions, we created the overall upgrade procedure. This is the overall upgrade procedure. At first, backup database and change database schema. Then, rolling upgrade is performed on each host in order by considering RPC API versions. For example, Cinder volume is upgraded before Cinder API, Cinder scheduler, controllers are upgraded. And heat engine is upgraded before heat API is upgraded. About Nova, controllers, Nova API, Nova scheduler, Nova conductor are upgraded before Nova compute. As you know, compute API measure version was up in I-Sahara's release. But stable Havana Nova compute can understand I-Sahara's level 3.0 messages. So it can be upgraded in that order. When live migration is performed from Havana node to I-Sahara's node, I-Sahara's Nova compute cannot understand Havana level messages because of the measure version difference. So Havana Nova compute has to be set I-Sahara's compact upgrade levels section in Nova Conv. It enables Havana Nova compute to send I-Sahara's level messages. And there are active, active configuration hosts, active standby hosts, and single no-HA hosts. They are upgraded by following the steps explained in the previous slide. At last, database is back up, and the data schema is changed. About test tools, we placed a load on our test environment during live upgrade procedure test. The rows were API requests covered patterns of calls between components and between processes in each component in our use case. North, south, and east-west network communication and VNC console connection. About test environment, it was mandatory that the environment has the same configuration, especially high availability configuration as same as the production environment to be applied upgrade procedure. And we tested our procedure in the environment that was gotten back easily by using Chef. We tested our procedure and identified issues by these criteria. Criteria were whether our procedure satisfy our goal or not, and whether inconsistency between database records and substances occurred or not. And we checked the steps. The steps, that takes a long time. As a result, that we tried our procedure, we found these issues. The bug about heat-grace shutdown function has been fixed by NTT team. And about remaining issues, these items were issues we cannot solve. The first one is errors due to active standby failover. They were resource creation failures. The second one is errors due to mismatch of RPC API measure versions. The error was unsupported RPC version. In latest course, it would be unsupported RPC envelope version. This error occurred from Nova compute to Nova console OAuth and from Nova Nova ENSC proxy to Nova console OAuth. It was unavoidable because RPC API measure versions could not be changed. The third one is communication interruption on Neutron L3 agent. About communication interruption on Neutron L3 agent, at first, we had an issue that user requests to create a new router were assigned to the upgrade target host. In order to solve the issue, we set admin state up of Neutron L3 agent to false. It solved the scheduler issue. But as the issue occurred, it was communication interruption. But this issue would be solved in general release by using L3 agent or HA function. But we have not tried it yet. And when VM live migration or Nova Nova ENSC proxy upgrade was performed, interruption of the console connection also occurred. About fallback, it was impossible to fallback after changing DB schema at the beginning. About the resulting knowledge, a clean installation was needed in OpenStack components installation because there is a possibility to cause errors if overly to install was performed. Then, Yan Kai will wrap up our presentation. Thank you, Tekesh. So I will do a quick summary. So I think for the live upgrade, our goal is to achieve the one percentage live upgrade without service downtime. But it's still difficult to achieve that when you do the upgrading from Havana to Icehouse. I think the issues we found is about network downtime. Because there is no active-active high-ability provided for Newton L3 agent. You have to have a trade-off. Either you stop the new request to create new networks, then migrate to the router from one network node to another network node one by one. Oh, you need to block one network node. And then to make sure the schedule works well, but there will be interruption for the VMs with the router on that network node. So another issue we found, a general issue is that some API requests will be failed if there is a failover between the active understand by components. And another one is about the VMC console. During the upgrade, if there is a VMC console, it will be disconnected. I know there are discussions about how to keep the seamless console by spies, but I don't think it's already implemented right now. So basically, here is our suggestions for the community. So if you think about the whole upgrade procedures or the strategy, I think there are two key things. You need to consider. The first one is the active active HA for all the components. This is very important. You can do the left upgrade. The other one is how you make sure the two versions components can coexist and provide service when you're doing the upgrade. So in this case, the RPC version will be very important to support the rolling upgrade. So also, I know there are some plugins or drivers provided by the Wander companies, either for Neutron or Cinder. But for example, if you provide SDN controller, for the Neutron server API itself is provide active HA, but maybe because I need to use this SDN controller, it breaks the active active HA. So there's another issue we found during this testing. So basically, in this testing, we didn't cover the centimeter. And also, I know there are some design sessions for the Cinder rolling upgrade, heat rolling upgrade, no-one rolling upgrade, something like that from today to tomorrow. So I expect there will be more improvements to benefit all the users. Also, in this topic, we didn't cover if you have the no-one network, how you migrate to the Neutron. Or if you have the Icehouse Neutron, the legacy router, how you migrate to the DVR-distributed water routers. So I think that there are a lot of future works to do. For the entity public cloud, we expect to deploy the production environment in Juneau. And we plan to have another round of testing from Juneau to Kilo. In that case, we want to achieve much better result than this round of testing. Thank you. So any questions? No, no. One more page. One more page? This is reference documents. Oh, OK. Sorry. Yeah. So we will have ceremonies for the questions. OK. Good. No questions. It's lunchtime. OK, thank you all. Oh, sorry. So right now, you're running on Icehouse. Have you guys already started looking at moving to Juneau? And what things do you see right now that you're going to be looking at issues? Or have you thought about that yet? The issue? Yeah, of trying to go for it. A general issue will be the RPC API consistency. You need to double check. Another one is the Neutron. If you want to take advantage of the L3 agent HA or the DVR, that's a different design, a different implementation. How you migrate from the legacy router Neutron to a new Neutron design, that's another big problem you need to consider. Yeah, thank you. Thank you.