 Hello, everyone. Good afternoon. My name is Kalan Nikolov. I'm Cloud Engineer at PayPal. The topic for this presentation is compute waste management for operators, or basically how to nicely reclaim what is yours in a private cloud environment. The agenda for today's introduction, how to identify and use resources or VMs, how to deal with those and use resources once they're identified, what strategies we have to eliminate that compute waste on the private cloud, and what tools we use at eBay and PayPal. At the end, I'll talk about the cloud medium. That's the tool that we use at eBay and PayPal for capacity management on the private cloud, and then it will be summary. For the introduction, I guess the reason most of you are here in this presentation is that you have the same problem as we used to have at eBay and PayPal, and that it's with compute waste and how to deal with the compute waste in the private cloud. As you know, on the private cloud, it's now that easy to identify and delete and use resources as it's in production. You probably heard a lot of talks during the summit about how to create, how to deploy things, and actually this presentation is going to be how to delete things. As you know, that's it. Deleting things is easy and quick, so this presentation shouldn't go long. So when we started, the reason we started looking at compute waste in the private cloud is that a couple years ago, we deployed a self-service cloud for the dev QA environment. Soon after that, we started running into capacity issues. The users started deploying VMs, also done just to try this self-service cloud. We started looking at solutions and how to identify those VMs and what to do with those VMs. As you know, the compute waste can be a very, very serious problem for many organizations. It can cost millions of dollars, so it has to be dealt with. Usually, there are two challenges in capacity reclamation. The first one is to identify the new resource that we have in the private clouds, and the second challenge is what to do with those new resources. We're going to talk about how we had eBay and PayPal reduced that compute waste. So I'll give you just an idea of the scale at the OpenStack Diploments at eBay and PayPal. PayPal is running on the welcoming percent of OpenStack on the web in mid-year. Most of the dev QA environments in both PayPal and eBay are on OpenStack. We have 8,500 hypervisors. Actually, the number is growing every day, but this is the number for now. The number of virtual machines is 70,000 plus. We also have several thousand users in these clouds. We have spread across 10 availability zones. So what causes compute waste? When we started dealing with that, we found that we have a lot of users who want to try the cloud service, the sales cloud service, so basically they log in, they just click Create VM, they see that it's a cool thing, and then most of the time they forgot about this VM, they just leave the VM running. We also have VMs who were left by ex-employees. They also forgot them to be deleted. We have VMs that are created by admins, simply by testing the service after failures or after new deployment. We have temporary sources, which were initially actually the temporary sources that are created by users just to test the thinker tool, and then again they forgot creating those VMs. Our planning is when, for example, we have VMs created by service accounts for a specific project that requests more VMs, and then they end up with using less VMs. The last is VMs with errors that usually when users try to create VMs, sometimes they run into issues, create VMs with error state, and those VMs need to be cleaned as well, sometimes either on the creation or sometimes on the deletion, the VMs are errored out. So how to identify the new resources on the private cloud? There are basically two ways. The first way is to install agents on the VMs itself. There's a third-party tool, so it can be self-written tools that can go on the VMs. However, sometimes there are issues installing agents on the private cloud, sometimes we cannot control what needs to go on those VMs. The other way is basically to have tools for collecting metrics on the hypervisor itself. There we can use a cylinder, for example, or we can build our own tools to collect data and process this data. So on the hypervisor, what things can be looked at, what metrics can be collected for identifying those and use resources. So when we started looking at those metrics, we found that there is no ideal metrics that can show you that, okay, this VM is unused or not used. So every one of these subsystems have some kind of issues. For example, the CPU can be affected by either external or internal factors. For example, CPU clock or this guy or the hypervisor. Or sometimes we find, for example, OpenVis, which can also affect this on the CPU. And sometimes when the VM is actually idle, you can see that there's high activity from the hypervisor perspective. So that's a mismatch. Memory also cannot be trusted. For example, the KVM basically keeps the memory once it's allocated. So if the VM was used at one point and then it cannot really tell whether the VM is consumed memory now or not. With this KIO, there are also issues. You can find a disk KIO of a process, but again, that disk KIO might be affected by external factors. The network, it's susceptible to network noise. This is mostly DHCP, NTP, LDAP, external pinks. And basically it depends on the environment. Sometimes we might have more or less noise. However, at the end, we decided to use network traffic to identify the industry sources. And the reason for that is we can filter out some of the noise by using some statistical measurements. We can also look at both egress and ingress traffic independently make additional determination whether the VM is used or not. For example, if the egress traffic is very high, but the ingress is zero, that means that there's a problem with the VM. For example, the VM is trying to request a DHCP, but it doesn't get anything back. How to deal with unused resources? So there are basically two ways. The first one is to use a chargeback, showback method that's basically making the departments responsible for their usage or just show them the cost of their usage. However, that doesn't work very well. The large organization is that they're always fights and whether that's the correct usage and things like that. So we decided to use another way, which is the smart reclamation or basically ask nicely about the usage of the VM. So what we do is we identify the VM that is unused. Then we notify the user about that his or her VM is unused and also his or her manager. And if the user doesn't take any action, we delete that VM. I will go over that a bit later with the whole flow. So we developed our own tools for basically identifying and managing that and use VMs. The tool that we developed is CloudMian, actually was covered by Trini. We use that tool for identifying unused VMs and on the self-service cloud. That tool started as a POC just to test it out. It was kind of very quick. I could script it, but then it grew into a more complicated tool. We started from a single availability zone, and then we added multiple availability zones. The resources you claim, the tool are exceeding $3 million so far. Here are some reclamation statistics that we have from that were made by the tool. So when we first activated the tool, the tool identified 42% of all VMs in the private cloud as unused. Actually, those were VMs accumulated over a period of two years. And then we started notifying the users that their VMs have been identified as unused. And the users decided to keep about 25% of those pull of 42%. And the rest of 75% were automatically deleted by the tool. Also, now statistics that when we get a user's choice to extend the life of the VMs, most of them selected either to never expire or one year. And there's like 30%, 40% of them decided to keep it temporary for one month or three months. The reclamation flow, actually Trini had a better diagram, but I'll try to explain it in the slide. So when we identify a VM as unused, we set expression dates for that VM. We marked as unused and we set expression date, which is 14 days from the time of detection. That means in 14 days, we're going to shut down the VM. When we marked the VM as unused, we sent email to the user, to the user that said, okay, this VM has been identified as unused, and the user had a choice to either extend the life to keep the VM, or the user can also delete the VM if he wants. Or if there's no action taken by the user, which is most of the time that we have, the VM will be shut down in 14 days. And we'll keep the VM again seven days in shutdown state, and then we're going to delete the VM. So basically we have 21 days to call process, and the user can always... Sorry. Yeah, we can talk after the... So basically we sent email to the user, and we also sent reminders to the users every, actually two days prior to the shutdown, two days prior to the deletion, and we usually CC the manager during those reminders in case the user is on vacation, or just with visibility as well. So if the user takes no action, we delete the VM, so the user decides to keep the VM, he can extend the life of the VM, he can continuously extend the life if he wants. So as I mentioned before, there's no ideal matrix, so we basically have decided to ask the users to make the final determination where to keep their VMs. The user feedback that we received, at the beginning we exploited a lot of users to complain, but surprisingly we received very few complaints. Actually there were not three complaints, but there were more that users didn't read their emails, or some of them hate receiving notifications, they want to block those notifications, or they were asking just why my VM is identified as unused. And the reason why the users didn't complain is because we gave them a choice to make the final decision whether to keep the VM or not, and that's basically what helped in the whole process. So the Cloud Minion, so the Cloud Minion mission starts with POC, it's basically becomes set of tools for identifying unused VMs, and then it can set exploration dates for the VMs, it can also delete or shut down the VMs, it can send reminders to the users and notifications, it can generate reports for how many VMs could be identified as unused, and much further reports. It also provides UI for the users where the users can manage their exploration dates of their VMs. They can also actually delete the VMs from that UI, which we think it helped also a lot, because the user goes to that UI, he sees or here see that all of the VMs of the user, and which the user usually have forgotten about as VMs, so they have a choice to delete the ones that they don't need. The Cloud Minion components here, the tools that were... So on the client side, on the hypervisor, we have several tools that are used mostly for identifying whether the VM is used or not. There's a CMCA, which is a Cloud Minion system activity tool, which collects data from different resources, for example, for network, CPU, et cetera, and basically the same way as sys.ca writes data to a file. Another tool is the CMSAR, which basically reads that data and makes determination whether the VM is unused based on predefined rules. I'll go over the rules later. We also use that CMSAR to generate reports about the network usage of each VM, so we send that data to another database that we can process and see all the VMs that, for example, can filter out all the VMs that have very high network utilization. On the server side, the Cloud Minion server side, we have a... It starts a temporary API, a Cloud Minion API service. It was kind of very quick and very... Which is using CGI, ProCGI, with the intention to replace it with something more robust. The Cloud Minion Manager, this is the main part that is processing all the data, and basically what it's doing is it syncs the data from OpenStack database with the Cloud Minion database. It sets exploration dates. It sends notifications and shuts down and deletes the VMs. The VM exploration management tool, which is where the users can see all their VMs, they can manage their VMs, they can change the exploration date of the VMs, or they can just delete the VMs from there. Here's the block diagram. As I mentioned, on the hypervisor, we have the CM agent, which talks to the API service, and then basically sends the data to API service with information about all VMs, whether it's used or not, and then the API is writing to the Cloud Minion database. Then the Cloud Minion Manager, the manager syncs data from OpenStack database to the Cloud Minion database. There are a bunch of other smaller tools. For example, the emailer that sends the notification to the user. It's not here, but we use it. It's LDAP query. The manager basically gets the real email address from LDAP, basically the username, and the VM exploration manager. It also talks to the API, and it can also be used to shut down and delete VMs. Here are some of the rules we used for identifying unused resources. As I mentioned, we use network traffic primarily for identifying the unused resources, and we use network traffic in all rules. However, in the rules, we can easily add other subsystems like CPU or disk.io. The rules that we currently use are if the network traffic stays below X megabytes, it's usually, it depends on the environment and sign environments. We have less or more depends on the noise. We usually set to four megabytes or sometimes even less. The network traffic stays below X megabytes for 14 consecutive days, or the standard deviation network traffic is less than X amount of, usually it's 1K, actually 100K bytes, or if the ingress traffic is zero bytes, then the VM is marked as unused. Once again, basically we try it with these numbers. It depends on the environment. Some environments are noisier than others. These numbers can easily be changed. The cloud media integration with other tools, for identifying unused resources, we have our own tools, but we can easily integrate with other tools for extracting metrics. For example, the cilometer. We had some issues with cilometer in the past. That's why we stopped using it. We're trying again to use it, but we're not ready yet to switch collecting data from cilometer. So basically anything that can collect data and generate reports, even sys.sasr can be used. Bizarre integration with OpenStack. When we started this project, we didn't plan for any integration. Currently, there's no integration with OpenStack. And there are plans to integrate with OpenStack dashboard, and that's going to be really helpful so users can use one dashboard for everything. So they don't have to use separate UI, which was supposed to be temporary, but it's still being used. And also integration with OpenStack cilometer. And almost the entire area, I'd like to call for help if the community is interested in that tool. We open source that tool. If the community is interested, we'll be very happy to work with the community on rewriting that tool, because when we started writing that tool, it was patched multiple times. It's not in very good shape. It works now. That's what we use currently. It's been automated to work, but it's desired to be rewritten. And also, there's a need for integration with OpenStack, especially with the dashboard. In summary, I can say that the capacity reclamation in a self-service cloud can be challenging, but it can be also rewarding once you start dealing with your unused resources. As I mentioned, we saved millions of dollars on that. And the smart reclamation that we started using has proven to be effective for us. And also, the Qualcommian tool, it helped us reduce the unused resources that we accumulated. And it's continuing to reduce that, identify unused resources and clean them. It's available on public GitHub if you're guys interested. We'll be happy to work with you on redesigning and rewriting it. And that's pretty much all for the presentation. It's going to be short. If you have any questions, you can use the microphone. When you're sampling the traffic to do the statistical analysis to mark machines for deletion, do you do that at random intervals or at different times every day to detect like if you have a bursty workload? So currently, we get a snapshot of the network traffic once a day. However, we can basically take a snapshot in smaller intervals. We're planning to switch to one hour instead of 24-hour snapshots. But once again, you can take different snapshots of different intervals, and then you can adjust your numbers in the rows, what you want. So that's pretty much up to you. But we found that once a day works fine for us. I was wondering if you have some other tools to manage restricting floating IPs? No, actually, once you delete the VM, the deletion basically cleans the floating IP as well. So we have automated the whole process. Once you delete the VM, it also a pull from LBus, and it's also a volume that the VM is using. There are also additional features that we'd like to implement. If you don't have that integration, it's about cleaning other resources with now a delete. We're planning to add those features here that basically can automatically clean other resources as well. But we don't have this right now. I'm just wondering about the deprecation period choice. I can understand that you'd want to turn off a VM and see if anyone screens. But why a total of 21 days? Wouldn't it be more useful to do something like 30 days or a quarter? Something that corresponds to a business cycle? It's up to you for us. We started with 14 days. Basically, within 14 days, we shut down the VM and then wait for another seven days. However, at some point, we decided to be more aggressive. So even we shortened that period. So it's up to the organization to decide how long you want to keep that VM. And usually, we have 14 days. Initially, when the VM is detected as a news, which is basically 14 days of inactivity. And then we have 14 days. Yeah. But you might have a VM that's only used for a batch job that runs once a month, runs a quarter. Exactly. That's why we have the final determinations made by the user. That's why we get the user. Usually, we try to make things as much as better. Basically, identify a new resource. But there's no ideal way to identify what the VM is in use. A user can use it like once a month and we'll find it. That's why we have the final decision to be a user. Okay. The other question is just wondering if, isn't there potential to do reclamation and things like object store in that? And if so, have you actually done just some what if analysis to try figuring out what percentage of your objects may be unused? We were planning, but we actually, we didn't have time to do that. But currently, no. Currently, the tool is, imagine it's like a POC. It just handles VMs and most of the VMs that we have. And then, but yeah, there are plans to add more features. And once again, if the community is interested, we'll be happy to work. Okay. Thanks. I guess just kind of a follow up on that to just see what do we have all the information we need in OpenStack to track if a volume has been mapped or used in the last three months. Same for images. So because users love to create lots of images that just take up space and glance and never get used again. We haven't looked at that as I mentioned, but yeah, it's possible to basically add that feature as well to the tool to kind of volumes. Yeah. If you have any other questions, thank you for the presentation.