 OK, good morning, everyone. It's almost about time, so let's get started. So in this session, we are going to be talking about rolling upgrades. So we will be briefly go through what is a rolling upgrade and the current upstream development status in each project. And then we will show our test results of our OpenStack deployment, both in VMs and containers. So my name is Lu Jingying. I'm currently working as a software engineer for Fujitsu. And this is my colleague, Hugh. Yeah, I'm software engineer from Fujitsu, Vietnam. So if you have any questions or suggestions regarding our presentation, you are very much welcome to reach us through either the email or ping us on our IRC channels. So let's get started. Before we go through our test results, we will briefly explain what is a rolling upgrade. So in the technical community's definition of OpenStack Foundation, we are trying to do a rolling upgrade. For example, we have a project, and we have some services inside this project. And if we can upgrade one of the service inside this project, bring it to another version. And then if we can do this to all of the services inside this project one by one, we call this a rolling upgrade. And in the TCS definition, we do not strictly require that we do this kind of shutdown and restart one by one. So we do allow some of the services are being shut down together and bring up at the same time. So in this case, if we can have a HAC setup of this project, so while we are trying to upgrade this project, we can be doing this from one copy to another one. So from the user's perspective, we can try to be minimize the service downtime while we are trying to upgrade this project. So why do we want rolling upgrades? If we do not have rolling upgrades, we have two other options. The first one is we just shut down all the service and bring them up together, and we call this a code upgrade. So in code upgrades case, we will be experiencing a large amount of service downtime because while we are shutting down and restart all of this stuff, this project has no ability to respond to any API requests. And also we can build another identical OpenStack cloud, which are running in the new release. So we're just migrating all the OpenStack resources from the old release of this OpenStack to a new release of OpenStack. And we call it blue-green and secondary upgrade model. But in this case, we will be suffering from really high hardware cost because we are building an identical OpenStack cloud only for upgrading purposes. And we also want to emphasize that what rolling upgrades are not. So this is from the OpenStack TC's definition. As I briefly mentioned earlier, in rolling upgrades, we do allow some services and bring up together at the same time. So this means we cannot guarantee zero downtime upgrades while we are doing rolling upgrades. And even in the definition of zero downtime upgrades, we cannot guarantee that the users are not affected by this upgrade happening in the backend because our API requests may be experiencing some delay of response. And if we want to ensure that the users are not affected by this upgrade, not at all, we call it a zero impact upgrade. So bear in mind that in the OpenStack world, we have three tiers of upgrades, rolling upgrades, zero downtime upgrade, and also zero impact upgrade. So why don't we just have rolling upgrades by nature? I mean, what blocks us from having it? The reason is why we are doing the rolling upgrade will end up in a case where we have both old codes and new codes running at the same time. And if this happens while in the OpenStack's world as a distributed system, this communication between services inside a project is being done in an RPC manner. So while we have new codes running up, how can we guarantee that it can have also working RPC communication with the old codes? And also normally we upgrade the database first. Then how can we guarantee that the old codes can still be accessing our new database schema without any problems? So in order to address these two kinds of incompatibilities, different projects came up with different functionalities that they want to use to solve these incompatibilities. So for the RPC layer, we may have some RPC versioning so that we can control even new version of RPC codes, can communicate with old version of RPC codes. And also if we have any payloads that are transferring between this RPC layer, we can use all slow version objects to address these payloads. And also for the database side, we definitely need that mechanism that both old codes and new codes can be accessing the database. So for some projects, they are using database triggers to guarantee this access. And also while we are doing the database schema or data migration, we do not want this migration to be affecting our running codes. So we want to have an online kind of manner, an online manner of schema and data migration. So this is the current upstream project status with respect to the rolling upgrades implementation. As we can see, most of the core projects in OpenStack has already support rolling upgrades feature. And the only left one is neutral and the community is still actively trying to implement and hopefully we can get it in the P-cycle. And for other projects like Ironic and Heat, which are also broadly deployed in the public clouds, the community are also working hard to have it. And for some other projects such as Sparky and Designate, the community has not have any plans to support it, but we are trying to push it in the upstream. And starts from here, we will be explaining how we're trying to do the test of rolling upgrades in our two deployments. So in our VM deployed OpenStack, we have a very typical three controller nodes running in active active mode. And then we have three compute nodes, two network nodes and three storage nodes. And the active active running controllers are running behind a load balancer and a virtual IP. And in our container deployed OpenStack, the architecture of OpenStack itself are basically the same, but we are using an open stack official project called COLA to do the container deployment. So we have two other nodes specifically for COLA's purpose that are provisioning node and Docker private registry node. And with respect to our tests, we are only been testing these five projects, which are Keystone, Glance, Nova, Neutron and Cinder. And the upgrade order, we followed the recommendation from the OpenStack official upgrade guide. And we are trying to do the upgrade from Newton release to stable Orcata release. And then how do we try to imitate that the actual things happen in the production cloud? So we continuously send some asynchronous CRUD API requests and for some of the create requests, which takes a long time. For example, booting a VM, we create a amount of VMs beforehand only for the deletion's purposes. And also, while we are trying to upgrade one of the controllers running behind a load balancer, and there may be a time that this controller cannot respond to any API requests, which is expected during our upgrades case. And in order to avoid that the load balancer continuously sends this API request to this controller, we try to drain this upgrading node from the load balancer while it is being upgraded. And after it is fully upgraded, we bring it back to this load balancer. And our estimation of the service downtime. So with respect to the performance, at this moment we are mainly focusing on the API service downtime for now. So we may have multiple slots of API downtime. So we record all of the response we received from the API request we sent to the cluster. We just simply add up all the multiple slots of downtime we are encountering during all these API requests. And the general steps to upgrade each project are basically the same. The very first step is we need to expand the database schema. And then for some projects, if they use database triggered kind of database access, we need to set up the database triggers before we have any running new release of codes. And then in the control plane, we just do this upgrade one by one in a rolling manner. And for some projects, if we need to contract the database schema, it has to be wait until we have finished the upgrade of the whole controllers first. And then if this project has any data plane services, we do this rolling upgrade after we contracted the database schema. And all these steps can be automatically through some Ansible tools or batch scripts. And while we are trying to do this rolling upgrades in VMs, it's pretty straightforward. We install the five projects, Newton release distro. And in beforehand, we prepare the Orcata release local repo because we do not want to be downloading all these new codes while we are doing the upgrading. And then we just deploy an Orcata release step. Oh, and a special trick here is for the projects that we need to do the database migration, we deployed another Orcata release step stack only for the purpose of trying to do the database migration for this OpenStack VM cluster. And then you can just go online for each project. They have a very well documented documents showing you how we can do this rolling upgrades in each project. And now I will hand over to my colleague. He will explain us about our container tests. And for rolling upgrade doing in the container environment, so we come up with a COLA project is at the official Big Ten project. And your package on the OpenStack service and its dependency into the container base and then your Ansible Kubernetes to deploy the full stack OpenStack cluster. And while we're testing the rolling upgrade in the container environment, we miss some trouble with the COLA Ansible doing rolling upgrade. We come with the first option is we use the native COLA Ansible upgrade in common. And because the Ansible, we are using Ansible Engine, Ansible Zero Line Engine, and Ansible Engine remove all container with the same name and then bring up the new container with the same name. So there could be a little bit downtime and there are not the same way as we do native rolling upgrade. For example, in Keystone, you need to fence down the first node and then doing upgrade in the DB layer or RPC layer. And then after that, we restart on the service in the first node and do the same step in the next node. So with the COLA Ansible upgrade command, we cannot do this, so we eliminate this first option. The second option here, we try to deploy, for example, we try to deploy the new turn release and then go into each container you made and do some command-line APT update or APT upgrade. But all the COLA container images are restricted with the root permission, so we cannot do some kind of this way. And the last option here, we try to deploy the new turn release and then put the octa-release into the ignose and stop manually the new turn release and start manually the octa-release. So this comes with two problems. The first thing is if we manually start the container, so we may lost some environment sexting or some configuration from the COLA automation tool. And the second thing here is that from new turn to octa, so COLA deprecated the load container from the hacker to FluentD. So this option also cannot achieve the rolling upgrade in container. So we finally, we found some workaround to try to do the rolling upgrade in the container environment. So the first step here is firstly, we deploy the octa-release with COLA and then we stop and rename the five project octa-release container. For example, we remove, rename the, from the novel API container to the novel API octa-container. And then we deploy the new turn release, open stack with COLA. Yeah, the first step here, this is the second step we rename and then the last step, we deploy the new turn release, open stack with COLA. This and that, we have the bulk container versions of open stack service running on the same node and then we can start manually and do not lose the config or environment sexting. So the general step for rolling upgrade in each project is for each service in each node, we stop the new turn container and then we start the octa-container immediately and we do some special trick to do the database migration is that we deploy another DevStack environment and reconfig the DevStack DB to the COLA cluster for ensuring the online data migration or trigger can come. There are some nodes here for Docker run and Docker start. Firstly, Docker run is a normal container upgrade mechanism. It can create the union read-review, malvolume set configuration and start the container. While Docker start, only start the container. So the first three steps of the Docker run is very fast compared with the container start time so you can eliminate or evaluate the three steps while the Docker start only depend on the specific service running in container. So we start the open stack service in Docker and then we need to wait for the service to come up. So here, my colleagues, we can come some results from the already upgrade into both environment. Okay, so this is our test results with respect to the API service downtime in each project. So as you can see for projects such as Keystone, Glonx, Cinder and Neutron, we can almost reach zero downtime upgrades. But in Keystone's case, while we did, like, dozens of times of tests, but we still occasionally encounter some DB.log quest DB.log failures. So we cannot guarantee like zero downtime upgrades. We still have one or two of the API failures. And with respect to Cinder's case, because currently Cinder volume still has to be running in active passive HA mode. So while we are trying to bring up the passive node and shut down the active node, there might be a slight window that the API, the Cinder volume cannot answer to the API requests. So we met some failures in the container's case while we're trying to upgrade the Cinder. So some explanation of why we encountered so much downtime in Nova is that for now, the Nova's rolling upgrades will still require a full shutdown and restart of controller playing in Nova. So now, for example, we are running an old release, maybe any release of Nova, and while trying to upgrade it, we need to bring down all the controller playing service, upgrade the database, reconfigure the configuration files, and then bring up the new releases. But the good news is while we are doing this in the controller playing, the running Nova instances are not affected by this procedure. So at the moment where we are trying to upgrade the controller playing, we are only restricted from launching new instances or resizing or deleting the existing Nova instances. And the rolling upgrades in Nova only happens while we are trying to upgrade the Nova compute. So for these running compute nodes, if you are trying to upgrade them to the new release, you can just live migrate the running instances on one of the compute nodes to another one, and you do the upgrade of this Nova compute. So the rolling upgrade here is the, remember the RPC incompatibilities we mentioned earlier? So this RPC incompatibilities happens between the new version of controller playing and the old version of the compute nodes. And also because we are doing the live migration between Nova compute nodes, this RPC incompatibility also happens between the old Nova compute nodes and new Nova compute nodes. And for Neutron servers case, since Okata release, we can have multiple version of Neutron server running at the same time. But actually the development of Oslo version object hasn't been finished yet. The community is still trying very hard to get it done in P-cycle. And the reason we can support this from the old cycle is that we banned all the contract scripts so that previously all the Neutron server downtime happens while we are trying to do the contract database migration. And now since all these kind of contract scripts are banned from the upstream, we will be no longer encountering any contract cost Neutron server downtime. And also as I briefly mentioned earlier, sender volume now only supports active passive HA mode, although the community is working hard to get us to the really active active HA mode. But we are not there yet. So we have some options we can do to try to minimize the downtime while we are bringing up the passive node and shutting down the active node. The first one is we can configure both active and passive sender volume to be listening to the same queue. So that while we shut down the active one, the passive one once it is really, it can read from the same queue seamlessly. Or we can increase the timeouts between sender API and sender volume. So while we're doing the upgrade of sender volume in the backend, the sender API will hold all the API requests until it hits the API timeouts. And if we can finish the sender upgrade before the API hits the timeouts, we will not see any API failures either. And so the downtime difference between the VM deployed containers, VM deployed open stack and container deployed open stack is that in the VM's case, we need to upgrade and unpack all the packages and then restart the services while we are upgrading one project. While in containers case, the only time we need to wait is just the restart time of each services. And normally people believe that the restart time of containers has already been much quicker than the VM's restart time, so it's reasonable that we are gaining some less service downtime if we deployed our open stack in container environment. And also for these four projects, even though we did not see any failed, we didn't see that much failed of API requests, but it is reasonable to assume that if you are in an intense environment, because we are bringing down one of the, one copies of these running services, so it is, now we only have two thirds of the services answering to the API requests, so if you are in an intense environment, it is reasonable to expect that your API requests will be experiencing some delay of response. So some conclusions here that for these four projects, now we can upgrade them with almost the lower downtime and which is a good news. And for the Novus case, if you are deploying them in the containers, you can expect less service downtime compared to VM's. So some remaining tasks that we want to achieve in rolling upgrades we definitely want to see rolling upgrades feature in all the projects that we want to use. So now we are actively pushing the community to implement them in Barbican and designate. And in the future, maybe we will turn to Trove and Melina or Sahala. And also because we want to use COLA to deploy our container environments, so we would like to see COLA can support the native rolling upgrades which have already been provided by these projects. So we have some ongoing blueprints we are taking care of and once they are fully merged, we can just use the COLA and Sable upgrade to do the rolling upgrades in containers. And some future work is rolling upgrades are cool, but for sure we would like to see totally zero downtime upgrades. And in order to do this, we have two options. The first one is just like the rolling upgrades case, we just implement this feature in each project. And this can just directly give us rolling upgrades, no matter what your deployment is. And the second is while we are trying to do the zero downtime in containers, we have another session explained our current approach which can give us a zero downtime open stack cluster upgrade and the room is just next to this one and right after this session. So if you are interested in, you are very welcome to join us. And also zero downtime give us some delayed response of API requests. So we would also like to see that how can we just have no impact to our users while we are doing the upgrade in the backend. So this would be our second future work. And also for all these three different kinds of upgrades we are discussing here, they are all talking about adjacent release upgrades, which means if you want to have a P-release of open stack, you have to be at least in O-cycle first. You cannot jump from N-cycle to P-cycle directly, but this can be painful for some public cloud providers because they have to do all this bunch of upgrades case every six months. So if possible, we would also like to see the skip release upgrades, but this can be painful because due to the current deprecation rules in the open stack community, most features are deprecated only within two cycles. So if we want to have skip release upgrade, we need to address that issue first. And we made our upgrade scripts and our testbed information public. So if you are interested in follow our procedure of testing rolling upgrades in these two deployments, you are welcome to pull this GitHub repo. And we also would like to thank the communities for helping us while we are trying to do these upgrades. And if you have any questions, I think they will be very happy to answer your questions on IRC. And also a special thanks to our colleagues who help us a lot while we are trying to implement these tests. Okay, so now we are ready to take some questions. Hopefully we can answer your questions. If anyone has a question, okay, like that gentleman over there. Thank you for sharing with us a very useful direction. We need it in six years. Everyone needed to discuss about seeing this upgrade, but in my mind, practically, first thing we would like to make sure for upgrading is that even with non-zero downtime, let's say, okay, I'm not even expecting zero downtime. Even one day, even a week after really upgrading and it's really working as before. That's the kind of first goal, but even that goal has not been easy. Because mainly, even next release, they change the configuration parameters. So how did you deal with that when there's a significant number of configuration parameters that they've created or newly introduced? And not only upgrading source code itself, you have to find matching configuration file per project, right? Yes, yes. So in our tests, before we upgrade it to the new codes, we prepare the next release configuration file in place. And then after the upgrade to the new codes, it can use the new configuration file. But yeah, I understand your concern. So eventually we will need to restart the new version of codes in order to bring those new configurations really into practical use. Yeah, so eventually we still need to restart the new codes even though we have upgraded them. So are you planning to, with new release all the time, if you can provide, right? Every project can provide, okay, see here, if you have this kind of parameter, then because of the new change, you have to use some different way that people can follow, right? Yeah, so before you upgrade that part of the codes, you need to bring the new configuration files in place first. So there are no way to manage the new train of the configuration. And they submit, there are two sessions about how we can implement the configuration management in an OpenStack project. And there are some tries with the ETCD version three, but there are no final consensus. So OpenStack 4.0 is recommending us to upgrade it only from the X version to the X plus one, not keep really the upgrade because of the application or removed configuration. Yeah, hi. Thank you very much for what you're doing here. This is huge from a new comma to this space. You know, our fear has always been if we start up something like this and we can't do the zero time, I mean, zero downtime upgrade, that just things that are just, you know, prevent us from just even coming to this space. So seeing this, what you're doing here is really outstanding. I just also have a question too. Obviously, looking at from the VM perspective and from the container perspective, you're saying that with the container, there's less downtime. So if someone's new like myself, starting out on this path, would you recommend going the container path versus the VM just for that purpose? Well, I think if you are deciding on like which deployment you want, you cannot only seeing it from your upgrades case because this is only one of the factors that will influence your decision. You also need to see like what kind of workloads you will be running in your OpenStack cloud. And maybe in your case, the container may not be a super choice for you, but with only respect of the rolling upgrades perspective, so we think we would prefer containers, but that's only for rolling upgrades case. Absolutely, that's what I'm trying to figure out just because we're definitely, if you're new to this space, you're not gonna get fancy right away, so thank you. No problem, so if no one has any questions, we can close here. Oh, sorry, I've got one more. Hi, morning. Thank you for sharing this interesting experience. So I'm just wondering about one more piece. So you talked about rolling up updates, and maybe you have some experience with rolling back and if something goes wrong. Yes, so this is really, how can I address it, painful or, so I mean like you're asking if we made some problems while we're doing the rolling upgrades, how can we roll back to the previous version? So with the perspective of database, the OpenStack community has banned the downgrade of database, so if you want to bring back the old database, you need to have a backup of the database, and if something bad happens, you can restore that database part. But for the codes part, we don't have a reasonable or now satisfying approach to do the rollback of this kind of downgradation. So this is also something we need to address in the future. But for now, we don't have a satisfying solution for that. Sorry. Thank you. No problem. Thank you very much for your time. I hope you enjoyed the session. Thank you. Thank you.