 Okay, hello everyone, so my name is Wei Yitian and I'm working for PayPal. So welcome to my presentation today. So the topic I'm going to talk about today is about the F6 for the OpenStack. So basically as an operator of the OpenStack cloud, besides the OpenStack code, you have to do something else to make your cloud working for your customers. So the topic I'm going to focus today is about how to detect the resource nickings and also the misconfigurations, so how to do those house clean works to make your cloud consistent. So a agenda today is I'll just give you some of the numbers about our cloud. And second one is so I'll just share some of the scenarios why your cloud gets into an inconsistent state. And also I will give you the next what is our solutions to keep the cloud consistent. So I think there was a session in the morning for the, there's a panel talking about the duct tape, all those kind of hacks to operate your cloud. Actually, we are doing the same thing. The difference here is we try to get all those script tours, whatever stuff, we try to bundle it into our, well, kind of a product. So we want to have a set of the tour set, which we can use across the board. So here's the slides if you guys want to download. So you don't have to take pictures for each of the pages. Okay, let's get started. So about PayPal's cloud, so basically we start the Open Stack Cloud since July 2012. So today we are kind of, we have one of the words not just Open Stack Private Cloud. So we have 82,000 VMs running right now and it's growing. And we have 8,064 hypervisors. We have the number of racks is 84 racks. And we have 386,000 cores, CPU cores. And we have two PB block storage. So one of our azs, actually we have one of the largest azs for the cloud Open Stack Cloud. So we have, so in one az we have 2,500 hypervisors. So basically the PayPal cloud, we hosting our 100% of the production services. And also we powers 100% of the pass and dev QA services. And we have the first production workload on the SDN since 2013. So what makes the cloud inconsistent? So there's mostly there's two categories. So one is the misconfigurations. So I'm not talking about actually the config files for some services. Actually knowwhat.conf. So those configuration is managed by the puppet. So the puppet will make sure when you deploy the service, the configuration has to be correct. So the thing is, I'm talking about the configurations after you deploy the service. Next, if you have, so next, if you have a dev stack, you cannot use it immediately because what you need is, and you need, you need to create some flavors. You need to create some network and sublet. You need to offer some images. So basically in our production cloud, we have to do a lot of preparations to create all the resources for our customers. And so most of them is using the Open Stack API and actually you create flavors of the images. But things, a lot of things they need to consist. Because all the things is kind of a menu before. So the administrator just doing the, doing through the CLI to end those resources. So for the misconfigurations, we have implemented the VPC features. It's kind of close to the Amazon's VPC. But it's for our, I mean enterprise cloud use. So basically it's a little bit different. Unfortunately the VPC is not in the off-stream yet. So we have to do a lot of configurations for this feature. And second is the flavors. So mostly the flavors is say you have to set some special properties for the flavors. So the scheduler knows how to find the right hypervisors. And also for the images. So some images you can use in one hypervisor but not the other ones. So you have set those the correct properties. The other one is the host aggregate. So basically a lot of metadata is used by the scheduler for the host aggregate. Also the default security group. So we have customized the security rules for each of the VPC. Networks. Networks is we have both the overlay network and the bridge network. So you have to config the right network and subnet. Otherwise your VM won't have no connectivity. And also the volumes. So actually we have running both production and dev ops. So the production has to be compliance. And dev ops are non-compliance. So the volumes, I mean the block storage provider is different. Another thing is the resource leaking. So basically if everything is good, there's no bug in the code, it shouldn't happen. But in reality it happens. Especially when you're running a very large scale cloud. When you run it over and over, so in a long time. So you have all those resource leaking happens. So here's the list. Next we have the VM spawns. So you have often VMs and you have often disks. So if a VM is deleted but the disk file is still there. So also you have all those issues we encountered is the single volumes. So sometimes you know what I think this volume is attached. But single things is not. And we found often the ports in the neutron. So sometimes the VMs are deleted but the port is still left over there. And also in the production we have the DNS entrance. When you create the VMs, we'll give you a DNS, an FQDN for that. So sometimes you may have, like say, one IP with two DNS or two IP with the same DNS. Also since we use the NSX for our SDN provider. So sometimes the data in the neutron, they are not consistent with the NSX. So all those things happens in the reality. And also just inconsistent states caused by the RPC timeout. So the RPC timeout is one service sends a message to another service doing something. And it waits for the response. So but somehow the responses get lost or whatever. Maybe MQ issues so it didn't get a response. So it's timeout. It's timeout so the service A thinks this operation is failed. But service B actually does the job. It didn't send the response back. And also we use multiple cells. So basically cells, they try to synchronize the database that the status like a VM status is changed. It should sync from the computer cell to the APS cell. Sometimes it didn't happen and so yeah. So the misconfiguration I think I already talked about some of the details. So mostly it's because it's a manual process because you have to set the administrator has to manually set up a lot of stuff. So as long as we are human so we are bound to make mistakes. So give you one of the examples, very simple one. So if you want, say, I have a set of SSD hyperwizers. So I want to have a flavor say this flavor is support SSD. So if you choose this flavor it has to go to the SSD hyperwizers with the SSD drives. So the way you do it is you send an extra spec in the flavor say the SSD is true. And then you have to send the same thing say in the host aggregate you have to create a host aggregate say it has a metadata say SSD is true. So the Noah's schedule try to match them. But since this is a manual process, so there are a few steps, like say you create a flavor, you create the host aggregate, you set the extra spec, you may have a typo and actually in the flavor the true may be a lower case but in the host aggregate you may have an uppercase too. All those things happens. And you may add the host into one host aggregate. So this is a very simple example but when people make mistakes so it won't work. So this is a more complicated scenario. So currently the VPC is managed by the administrator. It's not a self service which means if a department come to us say can you create a VPC for us? So we need to configure a lot of stuff. So the middle box the VPC actually this is not our open stack object because it's not exist. So the thing is so the VPC is kind of a tagging in all the other resources. So we try to hook up everything like say we have admin talent for the VPC, this one. And then we need to define the shared networks for this VPC and we need to conflict the secure group for this VPC and we have two special images for this VPC. And then we need to create host aggregates for this VPC. So there's a lot of things to do. So all those things all those configurations you can down to the open stack CLI but when you're doing like 10 or 15 steps you may make mistakes. So the other one is say the other okay so misconfiguration I just gave two examples there's others whenever an administrator has to manually do something mistakes happens. So second part is the resource sneaking. So when you run in the open stack for a very long time the resource sneaking happens. So it impact your capacities like say you have disk file very huge like say 30 gig disk file left on hypervisor so you cannot use it anymore. And sometimes it cause operation failures. So we have this like say resource volumes issues, instance issues, port issues and sometimes it's not able to correct the state through the REST API or the CLI because the state is messed up already. So the way you can correct those issues is if you know the database schema behind you can go to database just add it something and also you can run a share script on the hypervisor to correct the state to delete the orphaned disk files something like that. So those are not standard open stack operations but you have to do it. It's just like say we have crown jobs to restart stories in case of something wrong. So yes we know it's important to find and fix the underlying root cause. But as an operator as a service provider you have to fix the issues immediately and of course the dev team need to figure out actually why this happens but you know it takes time. So here's I just gave some list of our examples for the resource leaking. So one of the things is often the VMs, sometimes you delete the VMs actually it's not deleted it's still running on the hypervisor. So in the know what table you say oh this instance is deleted but it's still running. Sometimes the instance is deleted but the disk file is still there so this things happens and the other one is the volumes. So sometimes the state in the sender they are inconsistent in the knower. So knower thinks it's attached, sender thinks it's not. Sometimes when you try to detach a volume it's stuck in the detaching state forever. And there's a knower recent state for volumes but it's not always works. And Neutron has often ports so especially for our bridge network because each port is associated with the IP which means if you have port leaking you have an IP leaking. So you are still run out of the IPs you cannot provision any new VMs because you run out of IP. And also we have some issues with the Neutron and NSX controllers. So the other one is the DNS. So DNS is not part of the open stack resource but in our production cloud it's very important because all the other services they reference the VM through the DNS. Mostly they use the FQDN instead of IP. So if your DNS enter messed up it's so your application may not work. So actually we are part of our ASIS we have tried to designate. So we are kind of one of the contributors to designate. And before that we also have an in-house service to do the DNS binding and unbinding. So basically the idea is whenever you create a VM or DNS VMs no one send notifications to the message box and we have a service just listen to the applications. So the DNS binding and unbinding they are unsynchronized. Sometimes it didn't happen. And I already talked about the inconsistent state caused by the RPC. So this part is we found more issues with the multiple cell environment. So when you have API and when you have used cells sometimes our command has to go through a lot of services through the RPC calls. So any of the RPC calls failed or time out. So OpenStack code mostly they don't have transaction controls which means if operation has five steps if the one, two, three parts but four failed some states are inconsistent. And we found this database inconsistent like say some information in the block device they are not in sync with the cell. So to resolve those two issues so we are developing two projects to just maintain our operations. So basically those two projects are still ongoing but actually the pieces for those projects are already there. So some of them are in Python code, some of them are in script, some of them in Chrome job. So we want to put all of those tools together into two big projects. So basically we have a cloud builder which we should resolve the misconfigurations. We want to eliminate the manual steps to set up the OpenStack resources. So we try to automate all the setup process and to award the human errors. So the way to do that is we try to declaratively define what we want. Instead of say you have a script which do ten steps. So we try to define this is the flavor we want, this is the host aggregate we want and the code will config everything for you. So it's just like Puppet, if you change your config file the next time Puppet will overwrite your config file so you cannot. So Puppet must always keep your config file inconsistent. So the cloud builder does the same thing. If you like say if I define a flavor in my config file say this is the flavor for this VPC and somebody go to the CLI, well you have the admin access to do that but somebody go to the CLI to change the flavor and the cloud builder will figure out hey this thing's changed and it will overwrite just like the Puppet does. And the other project is the cloud sweeper. So basically we try to put all those neat little tours together and to fix all those resource nickings and we are adding more and more tours. And we also need things, since we put all those things together we can have a report, we can have a dashboard to view the result next say what we cleaned, I mean what's going wrong. So it's very handy for operator. So the thing is for the configuration is we try to have a blueprint which means everything is data driven. We don't want you to do script. So we have a set of JSON files which define the resources you need and those JSON files will be checked into the GitHub. So which means if you want to make a change to the production cloud you have to go through the code review process. So after the JSON file gets changed, of course we have validations then it will go through the same process, it will get pushed to the production and it will get applied automatically. So here's an example, like say we have our JSON file for the VPC metadata. So for our VPC, like say each VPC you can have different properties like say the host name you can customize. For my VPC I want to have a pattern for my host name like this. Next say for each VM the NTP server could be different. The DNS domain is different. So a lot of things you can customize. And even the image properties you want to use config drive and other things you can customize. So when you define this, when you want to add a VPC what you do is you just add an entry in this config file and it will automatically get pushed to the production. So here's another thing we want to, like say, what is the resources associated with the VPC? So we are defining some flavors. Like say some flavors is for high memory, some flavor is for high IO, which you have the SSD drive. And for the images, because different VPCs, their images, the cloud-inly script is different for different VPCs. So we have special image properties defined for the images. And the other part is, like say the default secure group. So each VPC can be different. So you have VPC based default secure rules. Which means the user can create their own tenant. And their tenant automatically inheritance the secure groups defined by the VPC. And also another one is very tedious works. If you define all those mappings, flavors, host aggregates, sometimes people add the hyperweather to the run host aggregate. So basically in our system, we have a configuration management system. It's called CMS. So basically when the hardware is onboarding, what is the series number, what is the CPU memory, the types, models, they are in the CMS system. So the tools will look into the CMS system to figure out, like say, if this hyperweather should belong to our production server. So we can figure out by the sublets or some other properties. So we try to automatically add hyperweathers to the right group. Sometimes if you cannot, it will give you a list, say, so you have to manually adjust some of them. So the cloud sweeper, so what I said is, we have all those little, little tools to try to fix the resource leaking. So we try to group them together. So what it does is, so we try to have this task executors, so different tasks, they will have the same interface. And then you have a task, the scheduler, like say, some tasks you should run every day, some tasks you need to run every hour. And the other part is you have a report tool. So basically we are doing a dashboard, try to view the scheduling and the task results. So currently we have four cleaners. Next one is for Neutron. One is for the volumes and another one is for the NOAA part. Another one is for the DNS. And the top is the tools. Like say, sometimes you can fix something through the OpenStack client. So I'm talking about the OpenStack Python client. So you can call the API to do some fix. Sometimes you cannot, so you have my SQL drivers. You can go directly to the database to change something. And also we have the NSX drivers. You can directly go to NSX to delete a port or to fix something. And we use Fabric. So basically the Fabric is an SSH tool in Python. So the Python code can directly run commands on all the hypervisors to run some shared script, whatever things you want to do. And of course we have a unified DNS server. So basically we have the drivers for that. And for the back end we have the solid file drivers to fix the volumes if we need. Just to give you two examples, next say how the cleaner works. So for the open ports what we do is the process will scan the Neutron table. So basically if a port has device ID, then we try to check. So the device ID is the instance ID in the lower table. So if a port has device ID we want to check, say what is the status for the instance. If the instance is active, everything is fine. If the instance is delete or is in deleting state, then we need to do something. So the other path is if a port has no device ID, which means something failed in the middle because the device ID is not ready yet. So we will check the owner of the type. If the owner is Noah, it means it's an instance creation. So if we found an open port, then we will remove the open ports from the Neutron DB. So we just marked deleted as 2. And then we will run the NSX driver to remove the port from the NSX. Then we will scan the next port. So it's a simple tool, but it really works. So the other example is for the warnings. So basically we scan the warnings in a VM. So we want to know if the warning is... So we want to try to find the source of truth. So sometimes this guy said the warning is attached, the other guy said it's detached. So the source of truth is on the hypervisor. So basically the fabric code will go to the hypervisor to check if the ISCASY session is exist and active. So if it's active, then we'll check the VM configurations. We'll just call the NibWords API directed to check if the warning is attached to the VM in the NibWords.xml or at least X. So the thing is if everything is fine, I mean if the warning is in the hypervisor and is attached to the VM and the state is not good, so mostly what you need to do is you probably need to reboot the instance to get the warning attached. So if a warning is in NibWords, it's attached, so actually you don't have an active ISCASY session in the hypervisor, which means this warning actually has some issues. So if the warning is not conflicted, so we'll try to delete the ISCASY session on the hypervisor, so we want to release the warnings. And then we'll set the state to detached. So some of the state change, you cannot do it from the API, so we just go directly to the database and mark that warning as deleted. And so if we want to delete something else from the... So the thing is we have to fix different DVs, actually when we fix NOAADB, if there's a mismatch, then we have to fix the CinderDB if there's a mismatch. And for different cases, we may need to reboot the instance to actually let the instance reconnect to the warnings. So that's the end. Thanks for coming and questions. Yes, please. You're using a certain type of configuration management software and you created a new configuration management software to configure OpenStack instead of creating a module for existing configuration software. Sorry, I didn't... Okay, instead of creating a module for existing configuration software, you're using PuppetChef, I don't know, why you created a new software which basically invented it again? So it's different. So why I said it's actually a popular module deployed next to my NOAA service and it conflicts next to my NOAA.com, right? So you try to make all those things consistent, so my NOAA is running. But what we are doing here is the resource inside the NOAA. Next, a flavor. Why you do not control flavors by Puppet? By Puppet. Add one more module, it's not a big deal. It's easier than creating a new software. So mostly the Puppet, you can write your own class, right? So because all the operations we have done here, it's done in Python code. So basically you can... Puppets drive this. But it also can be drive by a Chrome job, right? Clean job. So how do you separate this... You think this part, he did not have the owner, so you want to delete it. But how do you know this is not in the middle of one creator VM normally? That's a very good question. So basically in the first version of our Neutron CleanUp script, when somebody creates a VM, the script gets triggered and it deletes their post before they finish the booting. That's a very good question. So now we check the time step. So how you do to void do this? Because before we also think about to do a cleaner, but we think we cannot avoid this situation. So we did not do that. So I want to know how do you to avoid? So we know how long it takes to create an instance. So especially how long it takes to create a port. So mostly we set a time longer than that. So the thing is if a port has no device ID for 30 minutes, there must be something wrong. So your instance creation will fail. Definitely, maybe it's already failed. Okay, so this is dependent on... All the operation is done by the task because if there's another client that they penetrate, do you create a VM, create a port? Then you don't know. Yeah, so basically we have more controls in our own cloud because we are kind of private cloud. So we manage what the customer can do and what the customer cannot do. So it's not like a public cloud. Okay, thank you. Given what you've shown, let's say you have a scale of 1,000 networks. How often do you recommend running this as a cron or whatever? How long does it take to do one full sweep of things? So currently, I think we run it next every hour. Okay, once a week. And how long does it take? I don't know how large your network is. So it depends. So for us, most of the VMs have one port. So which means if you have 10,000 VMs, you have 10,000 ports to scan. So it's not a big deal. Okay, and then when you're doing all this cleaning up, are you logging it properly so that your log actually says, okay, your cloud sweeper actually did this work rather than some... We are working on that. Okay, because that's pretty critical, right? You need to have that audit for that. Yes. So basically the point is if you use the CRI, you have the user ID then, right? Right. But if we go to the database... Exactly. That's why we need to knock our own actions. Exactly. So that's work in progress, right? That's been working. Work in progress. And this piece of work, you're upstreaming it? It's not enough streams. Do you plan to upstream? So this is a hard question. So some of the tours, like, say, may benefit other people and some of tours it may not. So basically I think that the biggest issue is the time restraints since we are kind of very small teams, so we... Why not put it in GitHub and let others... We should, yeah. So I think we can contribute part of those tours to the upstream. Okay. Thank you. I guess it's a follow-up question. Is there a source available for this? The source is not available now because we are still doing the work. I just told you some of it's in Python. Some of it's in the share script. And some of it's in the cron job yet. Yeah. I mean, it's okay. There was a talk earlier today about duct tape and bubble gum. Exactly. But operators have to do things like this. And there's an organization, GitHub, where operators... Yeah. That's part of the things we want to do is we try to give back some of our works to that group. Thanks. So any more questions? Okay. Thank you for coming.