 Hello. Good afternoon, everybody. Welcome to join this session. How about the lunch today? So I think we can get started because we have minutes. And welcome to join this session about the rolling up goal of Intel Private Cloud. And my name is Xu Quan Huan. And I'm based in Shanghai. I'm leading a team in Shanghai to build Intel Private Cloud providing cloud services within Intel. And standing beside me is Ling Tan. Yes, my name is Ling Tan. I'm also based in Shanghai, China. I work for Intel OTC. He is an active Ironica project developer. So today we will present you our experience on rolling upgrade from Havana to Icehouse. Just before this presentation, I want to do a quick survey. Can you raise your hands if you use OpenStack as your production environment? Great. And who are planning for an upgrade for this environment? Oh. And who have finished a successful upgrade in production? Nice. OK. So today, first of all, I will introduce you Intel Private Cloud, what's the Private Cloud and what kind of services we provided and what module we used in OpenStack. And then I will talk about the strategy and consideration behind this upgrade. And then I will work you to the rolling upgrade route, which contains the rollback plan, validation plan, and some automation tools behind these upgrades. At last, I will show you the result of our upgrade and our feedback to the community. A long time ago, we have already developed a home-grown cloud framework to manage ESSI as a hybridizer before OpenStack comes out. So we have an existing cloud environment. There are many users use them. So we integrate our environment with OpenStack, how to leverage OpenStack to help us to switch our hybridizer from ESS to KVM, and leverage OpenStack to help to reduce the when locking and manage KVM hybridizer. Since 2012, we started to investigate. And finally, we integrate OpenStack. We developed a driver inside our cloud framework, VM provisioning module, and use this driver taught with OpenStack cluster. And just as I say, so we do not use Horizon in OpenStack. We still use our portal, our existing portal. So we only use the Nova, Glance, Keystone, Cinder, Neutron, this project. And this project is we are going to upgrade them from Havana to Icehouse. And what's the architecture and topology of our OpenStack cloud? We implement the high availability of our OpenStack control APIs. And as you see, we have two controllers and two low base nodes. And all the requests come through the public VIP. And low balance to different controller. And then within this cluster, we have the MySQL cluster and the RedMQ cluster. From the storage side, we are using multiple layer storage, including SAP cluster. We have SAP cluster providing block storage. And at the same time, we also use the local storage with SSD to provide high performance storage to our user. And in the compute nodes, we divide them into different host aggregates. So we can schedule different VMs to different host aggregates to fulfill our customer's requirements. And from the network perspective, we are because the limitation of Neutron L3 agent in Havana. So we are not heavily rely on the Neutron L3 agent. We are using the VLAN in Neutron. So we use the physical switch to help us router the data outside to the pub network. But we do have the Neutron L3 running in our network node. But it just for some testing or some small scale usage. From in the left diagram, you can see we have used the puppy to deploy this cluster from scratch. We have our local code repository. We have our local pipe package and local repository. We use puppy to get this package and load it to a specific nodes and deploy it with puppy. And also, you can see we use LogStash to aggregate all the logs in the node to a centralized place and use Elastsearch to search all the logs. And based on the API of Elastsearch, we also developed the auto-arrow log detector system, which this system will help us detect the error logs in the whole cluster and send out the notification to the operators. And in the last two items, we used the Schengen to send, get alert, and set mail, and use Ganglia to collect some metrics. So this is our production environment already running about one year. So why we decide to upgrade the existing cluster? I think there are two major reasons here. Because with the business growth, we found the old version cannot fulfill our requirement. It has some bugs which blocks us with the customer. And we also want some new feature, but this cluster cannot fulfill us. And we leverage different choices. For example, we can merge some patch into this branch and upgrade it to the cluster, but that costs a lot of effort. So we decided to directly upgrade this cluster from a version-by-version. And another thing is Havana do not support the by-community. So we decided to go to another version. Made a decision then how to perform this upgrade. We don't want to speed the version because there is huge risk. And I'm seeing fewer such practices in community. So we decide. And this one is we cannot afford the failure or disaster. So we decide to go with version-by-version. And we hope it can be a rolling upgrade, which means a piece by piece. We upgrade one service by one service to see if there is some problem. If we make some disaster, we can roll back quickly. And to reduce the risk, we only upgrade the OpenStats services. We do not upgrade database, OS, or even staff. We don't want to do the whole thing at one time. So what's the right timing for the upgrade? In theory, we should, based on our monitor data and file a proper time. But when we actually do the upgrade, we do not use the monitor data. We just we see we are based on our practice. We know the VM is not going to have downtime. Only a field downtime on the OpenStat API. So we just apply maintenance window for this API. And do not send notice to the customer who are using the VM because after our proper testing, we find the VM is not going to have downtime. And the long-term strategy, after this first time upgrading our production, we decide to follow the community steps, upgrade version-by-version each bit release. So the long-term strategy. The road to rolling upgrade, I think that contains a lot of things. And it's a road to build the confidence of you. And at the beginning, I proposed our team to do the rolling upgrade. Team mates, they are very afraid, fear about these things because they do not have confidence. So I set up different, set up two goals before we rolling upgrade the production environment. First one is the VM environment. The second one is the beta environment. I encourage our team mates who are working on this area. They first of all, they all have to finish the rolling upgrade in the VM environment, build the confidence, and do some research and practice. And the main purpose of the VM environment is to come up with our new local repository, come up with the configure file changes, and verify the basic upgrade procedure. But there is some drawback of the VM environment because as I mentioned, we use VM in our production environment. But in VM environment, we cannot create a VM mode in the VM. So we go to the another steps is the beta environment. We verify the local report we get from the VM and refine the upgrade procedure in similar topology and architecture as our production environment. However, we do not have the same size of database in our beta environment as production. And we do not have so many different tiers of storage in a beta environment. So there are some things we actually have to do in the production environment. So we take about four weeks to go through all these steps from VM environment to actually successfully upgrade our production environment. Also, we should come up with some plans doing this road. First one is the rollback plan. I think each upgrade should have a rollback plan. But we better do not rollback in our production environment because we don't want to see some unexpected issues. But we do need to try the rollback plan in our beta environment or VM environment to make sure we have the ability to rollback to a pre-upgrade state if something fails and we do not figure out. We should bring up the services as soon as possible. Do not want to exist the maintenance ring so how to rollback? We should probably back up the database, configure fire, and repository. And then you just restore these things to the environment. So after we have several tests of this rollback plan, it's very easy. And validation plan. That is a question is how do you verify your cluster is working fine after the upgrade? At the beginning, we want to try the Congress. Congress is a framework to capture the cloud status. And actually, he is an open policy framework for the cloud. And for full field of this task, he will collect the cloud status. And we want to leverage this capability of Congress. But what he eventually would not use Congress and because this project didn't meet our requirement because, for example, he didn't gather all the tables of the project. For example, he mixed some tables in Keystone. And he only stored the data in the memory and it changed according to the status of the cloud change. So we cannot easily get this result and different after we upgrade. So we choose another option. We manually write some scripts to collect the data from the cloud through OpenStack API and this result after the upgrade. We do the validation plan, validation scripts after we upgrade each services. So we have all the plans, scripts, enhance. The next thing is how to automation and upgrade some nodes by sequence. Upgrade is the sequence matters. And another important thing is the config file matters. Havana and Ice House, there is some field changes in some config file. We should update the config file items according to the version changed. And we should clearly understand which service has what kind of changes which will update our template in our public code. And in our beta environment testing and VM environment testing, we made a lot of issue on some config file change. So we come up with an idea. If it's best if we have some config file converter, it can get the input of our previous config file. And after this program, he can output a new version of the config file with the correct value set up in the config file. We have this proposal and we have to do a little work on it. But unfortunately, we do not finish it by today. So we still manually change the config file. But I think this is maybe the community can work on in the future, the config file converter from one version to another version. And how to automatically upgrade multi-nodes, we leverage our local CI system, just like the diagram show. We have everything in local code, Apple and Pipe. And we update this, replace the old version, and the public master, he will get these things and deploy it to our new, deploy to the, for example, controller and install the new code and update the config file. So we upgrade the cluster by this way. But there is some drawback in the puppy, for example. It's not easy for us to control the sequence of nodes. So we look at the Ansible. We think Ansible is nice tool to help us to consider, to control this kind of sequence. So we combine this puppy and Ansible together to do the automation. So after this whole preparation, we actually we can go to the production environment and do the rolling upgrade. And the preparation is that we should backup all the things, including config file, database, and build the IceHouse local repository in our server to have puppy can get these new things. And the last thing is very important. Just remember to stop the puppy agent in the cluster. We run into this trouble in our beta environment. We successfully upgrade some nodes, but we forget to turn off our puppy agent and he will override our upgrade to the Havana version. And if you use other tools to manage the configuration of your cluster, just remember to turn off them. And then the sequence is we choose the controller 1 as the nodes to upgrade and keep the controller 2 because we have the load bands in this controller 1 and controller 2. We can just stop the services in controller 1. And have the API request can still go through the controller 2. At the same time, the controller 2 can still add some backup nodes. And then we upgrade the services in controller 1 and bring up the services and shut down some services in controller 2. After finish the API open stack API, controller API, update, we upgrade the compute nodes with the Neutron and Norway single services. What's the general update process for each project? That is very simple, I think. First of all, you upgrade, update the configure file to new version, stop the old services, and uninstall the new code. Make sure uninstall the new code. And install the new code to the DB sync. And for Neutron, the DB sync you have is a little complicated. But when you follow the, it's different from other projects. You should do the stamp before you do the DB sync. And then start the new services where it verifies if it all works well. This process is works for each project. So I have a diagram to show you the sequence of the API status changed. First of all, as I said, shut down all the services in controller 1. And then we upgrade the keystone, bring up the keystone services in controller 1 shut down in the controller 2. So the only downtime for the API services is only when you sync the database, change the database schema. And during all these upgrades, there is no impact to VM. And then grams, Noah, Neutron, sender, it's done. Then you get an ice house controller. So next, Lin will help to present the detailed steps of each upgrade. And he will also share the surprise we made with you guys. OK, thanks, Su-chuan. OK, then I will talk about more details about what we did in our upgrade. So basically step by step, it's maybe a little boring for someone who already done this work. OK, let's do it. So first of all, for the keystone, actually, as Su-chuan said, what we need to do is to modify the configurations and then stop the service and do the cleanup, like the uninstalled code and the Python keystone client. And then do the install the new ice house package and do the DB sync. And then that's how you can start the new services. Keystone is pretty simple, and the ground is the same. But before upgrade the ground, the one thing you have to do is to convert the character set for UTF-AE. And then it's the same to modify the configurations and to stop service and do the cleanup. Actually, there's one mistake we made because sometimes we are lazy. So we've got to do the cleanup for the glance. And then when we do the validation, such as glance image list, then we get an error like duplicate options in the common line. And then we found out that because of the old code, I mean the PYC files still exist in our node. So one thing I want to emphasize, you must do the cleanup and don't be a lazy bone like us. And then you do the DB sync and you get a new service for the glance. OK, this is normal service. Normal service is similar. You just need to modify the configurations. Actually, just one thing I want to say is for the upgrade levels, you can enable this computer equal to Ithouse competitor. That will make you able to use new Ithouse Nova API to communicate with the old computer host. So it's very convenient and useful. Thanks for community provided this option to us. It will make the RPC API to equal to 3.0, which is compatible for Havana and Ithouse. With that, you can start and do it VM as usual. And the rest of things is similar. So to stop the service and do the DB sync and do the cleanup, and then you can start the new Ithouse Nova API services like this. OK, yeah. Neutron is a little bit difficult complexity because of the ML2 stuff. So first of all, you have to convert the character set to UTIFA-8. And then you should keep the configuration untouched and do the cleanup code and start the new services. And then you should do the upgrade database, like do the stem and run the ML2 convert script. Actually, we also made low-level mistakes here. When we're making stems, we're using a capital H of Havana instead of the small Havana, which is not recognized by Neutron DB manager. So it will try to upgrade the database from Grasero instead of Havana. Yeah, that's something weird. Maybe we should complain to the community. And then after we're doing the DB schema upgrade, so you can modify the configurations. There's also one important thing you should see is for the default plug-in configurations. You should change. I mean, it's the 15, 9 time. Yeah, it's time. So you should change the plug-in configuration from open-v-switch to ML2. That's very important. And then you can start the Neutron server. That's all for the Neutron server on control node. And then we finished our work in control node. And then we start to work on the network node. So for the Neutron agents, like L3 and MAC data and open-v-switch agent, so it's very pretty smooth. So you just uninstall the packages, do the clear up, and then modify the configurations. And you should clean the OVS configurations if you use L3 agent. And then you can start a new network service. And we don't have a problem of network here. And OK, it's Nova compute. This cat is just for fun. It's unlike the Neutron. So we don't have any problem here. It's very simple. You can just need to modify the configurations. And then you don't need to do the migrate of VMs if you don't try to upgrade your OVS. So just stop the service. Use a new configure, configurations, and then install the new one, and start the service. One thing you should remember, if you are being able to upgrade levels configuring in control node, you should comment these options. OK, that's for the Nova compute host, Nova host. And the last thing, the sender, sender is much easier, I think. You don't even need to change the configurations. You just need to do the clean up and install the new sender packages and do the DB sync. But in our case, they are a little bit different, because we are using self as our back end. So I mean, RDB. So we have a problem here, because there is a big change for RDB from Havana to S-House. They changed the image name. So from the fake ID to a unique ID. So after we finish the upgrade, it will give us the error report that could not find the block device. So you have to manually change the name of the image instance in self volumes. And then that's all for my part. So back to Su-Chuan. Thanks. And so we have kind of smooth upgrades. No downtime for the VMs. And we have finished all the API service upgrade within our maintenance window within one hour. And the downtime actually is only happened when we upgraded the database schema. And we also learned a lot during this upgrade. For example, when we do the upgrade in Neutron, the sequence matters. And we should take care of each word in the document. And there are some problems we may when we are in the beta environment. But fortunately, they do not happen when we do it in the production, such as in the DHCP, type port is tagged to 49.5. And in the DHCP agent, we walk around it by disabled DHCP in the subnet, then enable it again. And some problems will be the VMs device lost behind. And we should use the OVS command to set the tag back. And another thing is, because we are running the environment for a long time, so some failure actually is deleted because the mechanism of Horizon shows us an update in the dashboard. So make some mistakes to find out some deleted flavor. And we should really note very carefully, just in the case that Tanlin just mentioned, the RBD driver changed. So it will use another name, so we have to rename them. And I think for the community, we also think there is a direction the community should go. Tanlin, you can help to cover this part. Oh, OK. From our experiences, we can see an upgrade tool is deadly reserved. You should not rely on human beings truly. We made many mistakes here. So lucky enough, we didn't break our production environment. No blames from customers. And we think even for the public template update, we should use Ansible to do the modifications. And also, we definitely need to generate proper configurations and to verify it. So I think it's a very worthwhile work. We have planned to do it. And the third thing is, we really hope the community will enable live upgrades completely in our project rather than in Nova now. So we also want to contribute more to the community to make the upgrade work much easier and convenient for all of us. Yes? OK. And last, I want to thank the teammates working on these upgrades. And these are their names, Mengchen, Zhang Xin, Tanlin, Zhou Zhenzhang, and Jian Tao. So we have a nice picture here. So any questions? Thank you. Can you please go to the mic? So have you thought how much of this could be applied to moving from, let's say, Ice House to Juno or Juno to Kilo? Have you tried to generalize your approach? I think this should be much easier than what we did for Havana to Ice House, because I have seen community has done many works for the live upgrade. And actually, the work for the early upgrade is not much risk. And are you also implying that moving from one version to another is a custom process? So basically, you have to adapt it from moving to, let's say, Juno to Kilo. You can't really create a general upgrade service. Yeah, something like this. Hi, I got here a few minutes late, so I apologize if you already addressed this. But what kind of workloads are you running in your environment? Is it more enterprise workloads? Are this research or engineering applications? Actually, majorly, it's dev ops, the workload. And because of the Intel business, Intel is an internet company. So the customer, they are using the VM to combine with their hardware to do the testing based on the silicon board or some CPU, such kind of things. And also, they will put their database into our environment. So do you have the common code base for upgrade for both VM and bare metal provisioning? And if you have, did you face any specific issues with the bare metal explicitly? Can you repeat your question? Yeah, is it the same code base used for provisioning the VMs and bare metal? OK, we use two different ways. For provisioning VMs, we use OpenStack. But we plan, we do some POC for your bare metal, like the ironic, but it's still in process. OK, thank you. So when you guys did the upgrade and you flipped the load balancer over to the one control node, did you do that across all the services and then go through and upgrade the services that were basically offline during that period and then flip back? Yeah, yes. OK, so you did the whole stack? Yes, yes. And then you flipped the load balancer over and then you went through the other side? Yes, yes. OK, cool, thanks. So why we do that? Because we want to shorten the maintenance window. Great. Any other questions? No. Thank you, guys. Thank you for joining this session.