 Good afternoon, everyone. My name is Wei Tian. I'm working for PayPal. Thank you for coming for the presentation. So today I'm going to share our experience with our upgrade in PayPal's cloud. So we have, we have upgraded from four sun all the way to kilo. So let's get, let's start. So basically the agenda today is I'll talk about the brief introduction for the PayPal cloud. And the second part is how we did the upgrade before the kilo upgrade. And the third part is about the kilo upgrade. So we, we did a lot of code change and the upgrade process change. And then I'll talk about what, what next. So how can we make the upgrade process more smooth after the kilo? So yeah, I, I, I saw people taking photos. So basically, so here's the slide. So that's the only photo you need to take. Okay. So, so about the PayPal cloud. So it started in June and 2012. So basically, yeah, it's a, with the single engineer and 16 the commission service. So today we have one of the world's largest open stack private cloud. So we have a number of hyper, hypervisors is 8064. And we have 84 racks. And we have 386,000 cores. And we have two BB, two BB storages. So basically the total, the total have, yeah, I forgot the, so the total VM we have is 82,000. So we have 100% of the PayPal's production traffic running in the, in the cloud. And we have 100 of the, the web powers, 100 of the PAS, DAF QA and the other departments. And we have the first production workload on SDN in 2013. So that's a summary of the, the cloud. So here's a brief history about, about our upgrade. So basically in the early 2014, we still have multiple versions of open stack running in 10 availability loans. So we have Grizzly and Forza, they mixed. And actually, it took us, took us, took us about one year to get all the easiest upgrade to Havala because there's so many variations. So after that, actually, earlier this year, we, we decided to go to Kino. So we already spent like, say, two to three months to prepare for, sorry. So in the early of this year, we, we are planning to upgrade from Havala to Juneau, skip the ice house. So we already spent like two to three months to prepare that. But then we found out we are, we are playing this catch up game for a very long time. We are kind of almost one year late from the upstream. So final decision is we'll skip ice house and Juneau all together. So we'll go to Kino directly. So that's what we are right now. So the current status is we have one AC in Kino already upgraded and all the others in flight. So, so actually the upgrade is, is very difficult. The upgrade is, so basically we could not follow what, what the upgrade process recommended by the upstream. So like say how you update all the services from version to version because in a very large deployment and, and we are running all the, the, the paper production node. So basically we have to do very special steps to do the upgrade. So basically we, we cannot impact our working, I mean, existing, I mean, the running VMs and there's no interruption to the, to the production services. And we also have a mixed foresand and grizzly environment. And then, and a large, as I talked before is we spent, before upgrade we spent next two to three months to prepare it. One of the big thing is because we have a lot of custom code. So, so basically when the upstream code gets, gets, gets moved forward. So what about our customization? So basically we need to prepare the code and merge code with cherry pick, cherry pick, manually merge all those things. And we also have the mixed network, we have no one network and new turn networking. So, so basically we will upgrade the, the no one network to all to the, the, the new turn network. So this, I mean, this is our code base, but basically it's pretty complicated because, so what we do is we cut off, we forked the stable Havala to our own. So we have a fork of the Havala branch. So the point is then we also have our, our new branch for our extensions. So the difficulty is in, in the fork. So basically we change some code here. So we, let's say we change some code in nine. And, and also we backport some of the fixes from the stable branch. And sometimes we also backport some of the, the fixes from the, the, the master. So the problem is this branch is kind of so messed up because you have our own change backports from different branches. And those backport actually is not straightforward. It's not a good merge because mostly you have conflict because you have, you skip a lot of commits in between. So you have conflict. Even you don't have, so even you resolve the, the code now conflict, but, but also the feature may not work because some other code are not available. So basically all those merge is, is, is time, time consuming. And, and finally, so in our environment we have two, those two branches. So, so, so for, for each of our build, we will merge those two branches together to do a build. So actually this is the situation for the, for the Havala. So the, for the fourth one is even worse. We don't have this branch. So all the changes are mixed up in one, one of the, this branch. So, so you think about if you get all the changes, customer changes, upstream changes, backportings, whatever bug fixes all together for upgrade to next version is almost impossible. So that's why I say to prepare the code for the upgrade, it takes a long time. So this is, this is what we do for the, for the Juno. So finally we didn't, we abandoned this, but, but that's the process. So, so this is our Havala branch. So basically if we want to upgrade to Juno, we will fork a stable Juno from the upstream. Then we'll, we need to backport next to what we, what we changed in Havala one by one to the, to the Juno branch. And also for our extensions, we need to branch out a new branch for the Juno. And our extensions may, next, those commit may not work with the Juno code. So we have to manually do some patch for this branch. So, so all those codes mean code merge, code patch will take a long time. And after the, the, all the testing cycles. So another point is we have, actually we have create our own tables instead of the standard tables from the upstream. So this is the chart. Next thing, the, the, the, the, the up one is, is the full sum. So we create some custom tables. So you see the, the nice mapping. So how this map to the grisly tables and how grisly tables map to the Havala tables. So basically we, we could not just run db sync to, to upgrade our database. So we need to write a lot of customized script to do that. So the other thing is, so how can we migrate the, the neural network to, to Neutron? So in full sum database or data center, we still have the neural network. So, so the, so the challenge is when we do the migration, we have to keep the data plan working. So there's no downtime for the, for the running VMs. So, so basically we need to do a lot of hack to how to migrate the VM from the Linux bridge to the, to the open, we switch and without the, without interrupting the, the, the, the VMs. So we have some next year, we implement a lot of customized codes for the next year. There's a fake, fake device driver for the, for the, for the networking and, and we, and we also need to create, I mean, clone the, the, the networking settings from the, from the, the Linux bridge into the, the SDN controllers. So we are using the NSX as the SDN controller. So actually we have more, so my, my coworkers, they have more details presentation for, for this, for the migration. So I, I just want to summarize one key point for this migration. So basically, if we say we cannot interrupt the data plan. So in the data plans, like say CPU memory, storage, networking, all, all those issues is for a running VM. So actually the network is the module. You can have a downtime. So, because why? Because the networking is designed for, for failures. So basically if we, if we have a very small time of the, the, the interruption for the networking, so the OS now will, they will retry the connection. So basically if we can do it quickly enough, so, so the, the VM will not get interrupted. So another one is how we deploy, how we packaging and deploy our, our, our code base. So as I, as my previous slide said, so we have two, two branches merged into a build. So basically we build everything into a virtual environment. So the, the, the point is, so for each service, so we build, build a table. It includes everything, all the dependencies for the, for the virtual environment. So the, what's the benefit to, to use the, the virtual environment to do the deployment? So first it's fast. So basically what you do is when you, when you deploy the table to the, to the server, the only thing you do is it will unzip. So you unzip that. It includes everything it needs. And it's, it's, it's predictable. Basically all the, because all its dependencies, all the packages, they are all packed together. So you can run multiple versions of, of Noah's on the, on the same control node if you want. And it's easy to roll back. So, so in our, our deployment folder, so you have, so each, each version has a folder. It's include everything. So the rollback is so easy. You just change the start, start up script to point to, to, to different folder. So that's a rollback. And also this simplified, simplified our poppy script. So current, so, so for the open, open stack services, the only thing the poppy script needs to do is the, the start up, the start up script and the config files. You don't have to deal with the pack, install packages or manage dependencies. So, so, so we try to change the process in, in a kilo upgrade, because we don't want to say it takes like, say three to six months to do an upgrade. Then we are, we are, we are still behind like the one year from the, from the upstream. So, so the, the principle is still the same. So the first one is no downtime on the, on the data plan. So any, any running VMs has to keep running. And, and, and we do have a few hours of downtime on the control plan. So the, so the, what, what that means is during the maintenance window, you cannot create new VMs or you cannot shut down VMs. That's why we always do upgrade on, on Saturday because most of the, most of the, so the activity is, is relative or no on, on, on the weekend. And we have Sunday if there's any issues happens. So for the upgrade, we want to short the maintenance window. So we need to, so the, the major time spent on the upgrade is to migrate the hypervisors. Next year for each ASA, we have like around a thousand hyper, hypervisors. So basically it takes a long time to do the migration. So we try to prepare the hypervisors as much as we can before the upgrade. Next year, for example, if we need some new, new libraries, we'll install those libraries before the upgrade and we'll copy the, the, the new tables to the popular cache on all the hypervisors. So, and the day after the upgrade, when you run Puppet, it will not download anything from the, from the, from the master, it just runs. So the process is first of all, we need to get our code, code ready for Kilo. So we'll do the code merge to, to make our customization works for the, for the Kilo code. And the second one is we need to build a shadow control plan. So, so the shadow control plan is, is a clone of the, our real production. So for, so for us is we don't want to do anything fancy, but we want to make sure the upgrade success. You have to be success because this is production. And we prepare the, all the wrong books for, for our process and for each component. Next year. So we have exactly, say, step one, two, three. So the wrong book is next year, you can give it to any people. They don't have to think about it. They do one, two, three, four, five. And we, we need to do dry run. So we need to do two round of dry runs. So the first dry run, so the dry run is, is, is, is on the, on the shadow control plan. So we try to, to, to exercise the upgrade process on the, on the clone environment. So for each, after each run, we'll, we'll have the QA cycle. We'll, we'll do all the functional tests, type test, performance and no, no testings. Then we'll do a second dry run. And then we'll prepare all the, the hypervisors for the final upgrade. So, so in the Kilo, we, we try to get rid of all those mess we have in this hierarchy because this is a mix that's committed from our own code, from upstream, from different places. So for a Kilo, we want to have, we want to use, we want to get rid of this fork. So basically we, we will use the, whatever from the upstream, the stable Kilo branch. And we only have our extensions in one branch. So which means any of those changes, we will, we either need to drop it or either need to merge into, into this branch. So, so the final way we, we found is it's almost impossible to do like say a cherry pick or fetch all the commits into, into one. So finally what we did is we, we completely refactoring all our code. So we get, we don't, we don't put any of the commits from the, from the old branch. We get the feature list. It's just like doing a new. So we say, we need to do 10 features. And how, how we, how we can do it, just write new code. So it's, it's a brand new code. It's, but the benefit is that, so you see that the, the, the branches is so clear after we do this. So, so it's for the future when, when we have the, next is the, the liberty. So we can quickly get the, the stable liberty from, from upstream and, and merge and just do testing with our own extensions. So, so the new, the, the new guideline is no change to upstream code. We don't want to talk, we don't, we don't want to have any ad hoc change. There's no change. So we don't directly backward or merge from the, from whatever Havala changes we had. So it's add up, we completely rewrite all the paper extensions. And we tried, so this time we want, we want to do everything in the right way. So we want to apply the standard open stack extension methods. And, and another benefit is when we close, when we are close to upstream, so we can actually, actively, actively, at the same path in the upstream. We can actually report bugs to upstream. So those bugs, like I say, actually we found a lot of bugs in, next in our, in our fourth and greatest Havala. So those bugs is, is not normal because if you don't run, run the cloud in, in that large scale, you, you, you will not never see those bugs. So, but, but the point is when we have those bugs, there's no way for us to contribute back to, to, to give back to the community because we are so behind. So, so here, here's the standard with how can you customize open stack. So you can do a, a, a WSGI media well. So you can do some filters for the, for the API. And open stack, all the, all the components, they have API extensions. You can do resource extensions, controller extensions. You can add child resources. We do all those things. And also you can extend in the manager classes. So if you check the next, say, the config file, knowwhat.conf, you have some config, say, like say, a scheduler manager. There's a cost name. So that's those managers or those filters, those are the classes you, you can extend. And we do custom, custom filters and weirs for the, for the, for the, the scheduler. And we also add a few, I mean, custom RPC methods between the, the components. And if you really want to do some, something fancy, you can also use the know what hooks. So you can, you can inject your code into the existing upstream code base. But that's not recommended. So finally, the thing is, if we have these hard rules and no change to upstream code, what if I really want to change something? One line or two lines. So monkey patch as a last result. I'll talk about this later. So here is our new, new project structure. So basically you see all those extensions, it's, it's kind of mimic the, the upstream one. If we extended sales, we create a sales for package. We, we have extensions for sales, compute, conductors, and API extensions. And the schedulers, like say, what says, those manager class, you can, you can override those filter schedulers, you can override. So the interesting is, this is a patch. So next thing we want, we really need to change one line of the code, the off, I mean, the upstream code, because it's not working or, or there's a bug. Actually in the Kilo upgrade, we, we, we found a bug in, in the Keystone. It's a small one. So we fired, we, we, we report a bug to the upstream. It gets fixed in two days. So that's the benefit if we close to the upstream. So the patch, whatever change in the patch, so the process is, you have to file a bug to the upstream, or you have to file a blueprint to upstream. So what change here, it cannot stay longer than three months. You have to clean, clean this up in three months. So here's the, here's the, so why we need to build a shadow control plan. So basically this is, this is the production. So you have all the services running. You have a database for those services. You have hypervisors. And then you have the third party, puppy master, foreman, sort master, rabbi, mq, and SS clusters. So the thing is, we want to do, make sure everything has to be successful. So we make a clone of everything. So you, we clone the controllers and we, we, we also clone the database. We clone all those third party services. So we want to make the shadow controller a clone of the production. Then we can access our, our upgrade process on the, on the shadow. So the only difference is this guy has a thousand hypervisors and this guy only has like 10. So here's how we do a driver run. So the first is, so, so all our, our controllers, they are VMs. So actually it's easy to create like say a 10 controller nodes. It's VMs. So you just clone from the, from the, from the template. So what we do is the first step is after we create all those clone environments, so we will set up a kilo environment in a puppy master. So the puppy master will have the kilo, puppy code. And then we will set up the kilo parameters in foreman. So foreman is the, is the, the ECN for the property. It, it, it control all the parameters, the values. And then we deploy the, the, the, all the kilo services to the shadow controller. So here's come those interesting parties. We clone the, the production database into the shadow controller. So the reason for that is when we, I want to, the shadow controller has exactly the same data as the production. So we clone everything, every database there. So after we do, of course you have to do a DB sync to, to get to the, to the latest version. And then we access the, the, the, the, the process to migrate, uh, hyper wires. So for, for the driver run, we migrate 10. So we are, we are migrate 10 hyper wires from, from this one to, to the, to the clone one. Then we'll do the testing. So, so for, actually this last step is if everything is fine, we'll move those 10 hyper wires back. So, so we've, we've, I mean, what's the benefit of using the latest release from the upstream? Of course, now we get the latest features and bug fixes from upstream. We don't have to cherry pick some fix from master to our Havala branch. And we can also give back to the community because now we are on the, on the latest version. So, uh, yeah, no more cherry picking. I just said that. So, so why could we not, there's one of the reasons we could not contribute. When we found something we fixed in Havala or fixed in Fosa, because the branch is not existing in upstream, they removed the branch. So there's no way we can give, give back. So next year, here's a, here's some, some examples we have, we have kind of duplicate what the upstream does later. Next year in the fourth time, so we implement something called computer loan. Actually, this is similar to the host aggregates in, in, in Grizzly. And in Grizzly, we implement all those aggregates, blah, blah filters. And they are in Havala. So basically they are similar, similar, but they're not exactly the same. And in Havala, we implement a lot of extensions, which is in Juno. So basically, so all those problems will go away, will go away if we are on the latest branch. So if we found some bugs, if we want to implement some features, we can just do it in, in, in upstream. So, so, so after the kilo upgrade, the, the liberty will become very smooth. So, so what, what we do is we can, we can just have the stable liberty branch and we, we branch out our, our, our, our extensions to a liberty branch. There could be a few patches for our own change to work with the, the, the liberty. So this may take, let's say two weeks instead of two months to, to get the code, the code ready. And for liberty, we can do a knife, knife upgrade. So we don't have any downtime for the control plan or the data plan. So, yeah. So that's, that's, that's, that's my presentation for the, for the upgrade. So if you guys want to know more details about how we, technical details, how we are operating the, the cloud in large scale, I have another talk tomorrow and questions. Oh, you can speak now there, maybe I can repeat your question. So, yeah. So the question is about the, the, the upgrade from the next, the fourth time recently to Hawa and Hawa to kilo. So basically, yeah. So Hawa upgrade is, was a milestone for us because at least now we are, we are all the, all the centers there on the same version. So from the kilo, from Hawa to kilo, we skipped the ice house and Juneau because we, we want to, we want to get the latest version. So that way, next year we could not do those like, like knife, knife, knife upgrade. And also we cannot mix versions because Hawa is too old. So, but after kilo, so everything will become easier. Next year, kilo will be released right now, right? So I think in three months, we will, we will, oh, liberty. So in three months, we can, we will be in liberty. So, yeah. So we are working on all the other. So one is already upgraded to kilo. So we are working on the other is this after, after the, after all it is done, actually at the same time we will prepare the, the liberty upgrade. So we will, we will go to liberty next year, early next month, next year. Okay. So, so, so actually the, the, the, the almost all the control nodes in, instead of like the popping master or the Rabi MQ, they all, they all get cloned. So the only thing not cloned is the, the NSS controller. So we create a separate transportation loans for the, for the shadow AZ. Yeah. So, so what I said is if you, if you create a separate transportation loan in NSS controller, they are kind of isolated environment. So they, they add a need, a little bit of work node on the production controller, but they will not have any interfere with the production. Hi. Can you mention the order in which you upgraded this different services? The order? The order. Okay. So, yes. So, so we, we have a, so, so, so basically we, we, you know, upgrade I said we have a maintenance window. Next year, when we do clone the database, actually we shut down all the services. So, so after we, we do the, so we, we shut down all the services. So people working in parallel, next year, the keystone guys will clone the keystone database. Noah guys will create clone the Noah database. And at that time, the, the, the shadow control plan is shut down. Then people doing the DB sync in parallel. After everybody is done, then we'll turn on the order services. Yes, please. Okay. So the question is how we prepare the, the, the new parameters in the, for the puppet. Next year, when we config, when we're doing the config file. So what I said is, so we use puppets to, to do the deployment and we use foreman as the, as the, as the database for the, for the other parameters. So next thing in Keto, for how are Keto, a lot, a lot of parameters, they are, they are not applicable to the new version. And Keto also added some new parameters. So what we do is we, we, so the foreman will generate a, a YAM file for all those, all the, all the parameters for all the services. And what we do is we, we install a DevStack Keto and we do a div on the YAM file. So we figure out which parameters, which parameter was removed and what are the new parameters. So yeah, that's how it works. Yes, please. Yes, I can show you some of the code. Let me see. So next year we, we, we, we overwrite the managers here. Next year for the scheduler. We have overwrite the manager here. So we, we add, so we add some periodic tasks to calculate the, the, the capacities based on VPC. So yeah, VPC actually is one of the features we implement in our cloud, but it's not in the, in the upstream yet. And we also, so here's the other piece of code say, so, so how we calculate the, the capacities. So we, we, we know it. So we have this API. So you just query the, say what is the capacity. It will give you, like say for each flavor, how many instances you can provision for each, each of the VPCs. Yeah, some, some, some code like this. So what I said is mostly next year the patch. So this is the patch, one of the patches because our hyper-order does not have the NBD installed. So we try to skip the import. So that's one of the patch we, we have to do. But, but the, the, the principle is we'll remove this because we'll install the NBD later. So we have two branches, right? So, so actually the, the, the, the combination is easy. It's, it's Python, Python is source code. So mostly you just copy the source code to the same, same, same folder. They're just merged. So it's easy. Yes, please. Excuse me. Word? The virtual environment? The virtual environment. Okay. Yes, for the packaging. Yeah. So before we, we, we, we split from the eBay, actually eBay, our coworker in eBay, they already tried that. So they already put the, the controllers in the, in the containers. Yeah. So I think that that's a very interesting approach. The, the thing is for our production, I've, and as of right now, I think the, the Python version around is, is enough. It's doing exactly the same. It gave you exactly the same features you want because it's isolated all the different process. Next thing, we are running the, the, the, the Keystone and Neutron on the same, same controller node. All right. So the, the, for some, some, some libraries, the Keystone may have a different version and Neutron have another version. So there's no conflict because all services is running in our virtual environment. So it serves the same purpose if you use, use containers. Okay. So thanks for coming.