 Hello everyone. Today we shall talk about how we tackle some operational and capacity utilization challenges at Workday. For those of you who haven't heard of a Workday, it's a software-as-a-service company that offers a single solution for finance, human resources, planning, and analysis. Our presenters today are Bogdan Katinski, Silvano Wubak, and me, Imtias Choudhury. Here is our agenda for today. First, I shall give a background of capacity management and operational challenges that we face, which comes with growth. Then Silvano and Bogdan will discuss some of the technical solutions that we came up with to address these challenges. Now let's get to the story. OpenStack is continuing to grow within Workday. Over the last few years, we've seen more and more applications running on OpenStack platform. As of now, we have approximately 39,000 VMs running on this OpenStack platform. Over 45 clusters spread across five different data centers in the United States and in Europe. Just as with great power comes great responsibility, with bigger clouds comes tougher challenges. One of our challenges is to come up with an accurate capacity usage and growth forecast. We need accurate capacity information from all of our clusters, not only at the cluster level, but at each hypervisor level. It is important for two reasons. First, to be able to accurately predict whether in our weekly maintenance window we can support any potential changes to the number of VMs we are launching. Second, we need a reliable growth forecast so that the platform can support any growing capacity needs of applications. Accurate forecasting is important because ordering hardware takes time and even more so during a pandemic. Bogdan will explain how we address this. Another challenge we face is to optimize resource utilization. We run different types of workloads on our OpenStack clusters. Some are memory intensive, others are CPU intensive. With such different types of workloads we noticed our hypervisors were underutilized. We needed to come up with an optimization scheme which maximizes capacity usage while reducing blast radius in case of a hardware failure or a hypervisor failure. Silvana will elaborate more on this. We also need to improve operational efficiency by reducing the time it takes to apply patches to a server. Servers often require periodic firmware and software updates. These updates often require servers to be rebooted. Since rebooting a server affects any running workload, we wait for a long maintenance window, usually quarterly, to do such operations. Trying to reboot thousands of servers in a short maintenance window is non-trivial. It's risky and it slows down rollout of important firmware updates. But what if we could tweak the scheduler so that the VMs avoid hypervisors that require maintenance, thus allowing us to apply changes in smaller batches of hypervisors more frequently? And Silvana will elaborate more on that. With that, I hand it over to Silvana. Before I start to talk about changes in the scheduler, I want to talk about two metrics related to VM allocation. One important requirement for our VM placement algorithm is maximize the high availability of our applications. In short means avoid deploying all instances of a project in the same host, so in case of the host dies, the impact is minimized. This is measured as blast radius and I want to show how we calculate in Workday. First, I want to show how to calculate for a single host. For example, for a given host X, if that host dies for project A, there is a drop of 20% of project A capacity because the project A have 1 out of 5 VMs in host X. This is the blast radius for project A in host X. But for the whole host X, the blast radius is the maximum blast radius for any project running on that host. In this case, it is 0.4 or 40%. In Workday, we don't calculate the blast radius for a project with 4 or less VMs in the cluster. This is why Project D doesn't have a blast radius. To calculate the blast radius for the whole cluster, we use the average and the maximum blast radius for all individual hosts. Note a lower blast radius is better since it measures the drop in application availability. Fragmentation is also metric that we aim to reduce. Fragmentation could be for any resource, but in Workday, our main concern is about memory. Fragmentation of memory measures how the space available in the cluster is spread over the host. For simplicity and make this meaningful to a broader audience in Workday, we measure fragmentation as the space available to deploy a VM that requires 240 gigabytes of memory. When comparing different strategies of VM placement, the solution with less fragmentation is the solution that provides more space to deploy those big VMs. Note in this diagram, in the deployment case 2, despite the hosts have the same capacity as in the case 1, we can't deploy the second VM in the Project B because the space available was split between two hosts. Also, the blast radius for both hosts is 1 or 100. This example is to emphasize how the order that we launch the VMs can change those metrics for the worse or for better. To give a perspective about how fragmentation is very important, in the past, in Workday, we were in a situation where across with around 300 compute nodes and average memory utilization of around 80%, and we couldn't deploy a VM with 240 gigabytes of memory because all space available is too fragmented over all hosts. Note 20% of capacity for a cluster of this size means around 30 terabytes of memory available, but no single host with 243. Out of the box, OpenStack already provides a solution to deploy different flavors, balancing fragmentation and brass raider. This solution is host aggregates. In Workday, we have used the approach of one host aggregates per flavor, a separate flavor with memory below or equal 64 gigabytes. Those flavors belong to the same aggregate. We developed two to adjust the host aggregates dynamically, based on the number of VMs per flavor. Using a host aggregates is easier to achieve a good packing. That is a big pro, but there are limits with host aggregates. For example, when a flavor being used for only one project and doesn't require much capacity could have all VMs in one or few hosts increasing the brass raider. Another problem is when the capacity of one host aggregates reduces or projects change flavors. Look at this example. There are two flavors and two aggregates, one for each flavor. VMs are the orange box and host the blue box. Letters are the project name. VMs for project A and B are in the same aggregate because they have the same flavor. But if project R decides to change their flavor to flavor 2, let's see what happens. Since the capacity needed to deploy flavor 1 reduced and the capacity needed to deploy faithful 2 increased, I need to move few hosts from one aggregate to another. But moving the host doesn't change the VMs run on it. In our example, project A was redeployed as part of their capacity change, but projects B have no requirement to do a redeploy, and those VMs keep landing the aggregates for flavor 2. Host aggregates work well if you just grow capacity. If your capacity changes frequently, they are hard to maintain, requiring a lot of operational work and a lot of resource to keep migrating. This means that we need to move from one host to another. But if we plan to change the VMs to host, how can we ensure there is no capacity issues? How to ensure we are not increasing fragmentation and brass rate? We develop a simulation tool that mimics what the scheduler does. That makes it easier to load the data from a cluster and simulate different placement strategies. The simulation tool was developed in Javascript to be more interactive. Just a static page with Javascript doesn't require any back-end. Filtering and waiting are simple to write in Javascript. A Python script extracts the cluster information in JSON format. The tool receives as input a cluster definition. That is the list of all hosts, all flavors and all VMs running on the cluster. Also, the scheduler modifiers. There are x-axis containing the host, three in this example. In the y-axis is how much resource are in use without overcommit, so can go over 100. The bar is the amount of memory consumed by VMs in the host and the line is the amount of CPU. Bar have different colors, depends on the flavor. In the border, we can simulate, create and destroy VMs. Any create will execute using the new multipliers. You can simulate deploy one or a few VMs, but also have deployed the entire cluster with new parameters. After many tests and simulations using our simulation tool, we found a way to balance the brass thread and fragmentation in Workday. No more host aggregates management and VM migrations. Note one important goal of this force implementation is to have an algorithm that is simple enough to understand why a host was chosen. In our proposal, all suitable hosts are available in each deployment. Different from host aggregates, where just a limited set of hosts are available. This maximizes the capacity available and reduces the brass thread. To achieve this, we use a feature in the scheduler called Wagers, where hosts can be preferred, but not limited. Because we want to minimize the fragmentation, Workday uses a memory wager with multiply minus one to stack the VMs. Note that alone increase the brass thread, because the scheduler will try to stack all VMs in the same host. To reduce the brass thread, we combine the memory wager with our own implementation of soft and toughened wager. Different from what OpenStack offers out of the box, our wager is configurable to accept the four VMs in the same host before starting to spread. Also, it doesn't require a server group. We use the project ID to know if two different VMs should be placed together. This way here, have a higher priority and has a multiplier minus two, because the development is actually soft affinity. The negative multiplier always inverts the order. I will explain why we choose the scale of multipliers later. Here are the results after change the scheduler. Not a huge improvement in capacity due to the reduced fragmentation. For most cluster, the average brass thread has increased a little while the max brass thread has reduced it. And because the solution is tacky, the number of hosts needed is lower than before. Just to clarify the capacity around the 240 GB VM are after we have executed all deployment. Periodically, some pets need to be applied on the host, like security pets. As the number of compute nodes increase, this is becoming harder because our maintenance time does not have increased. But this procedure could be fully automated if the schedule is also aware of the maintenance. Because this is not specifically maintained, the decision is based on hosts uptime. Hosts running longer get less priority on the deployment. After some time, they will become empty. Weekly, all empty hosts are patched and hay start. After hay start, the uptime is zero and the host gets higher priority on deployment. And the second group of hosts with longer uptime will become empty. The uptime information is cached on the scheduler to avoid performance issues. The uptime on blast radius and fragmentation is negligible because the uptime can assume just few discrete values. For a specifically host maintenance, we still use the regular procedure to disable hosts. But this solution gives us the maximum number of hosts empty to be patched without compromising the capacity. The first idea how to wait the host based on the uptime was to give a plus 1 wait on every 60 days, for example. So, there is no difference between a host up for 10 days and up for 50 days. But after we analyze the distribution of uptime among all compute nodes in all clusters in Workday, we realize is necessary a more sophisticated formula. To ensure the uptime waiter can assume no more than three distinct values. Those three values divide the host in three groups based on the uptime, minimize the impact of rath rate and fragmentation. Next slide, I will explain that better. This slide explains why we choose multiplier 1, 2 or 3 for the way here. I'm oversimplifying here, but hopefully that gives everybody an idea how to make a choice. First, note that all waivers have their values normalized between 0 and 1. In these diagrams, focus on the host with wait 0. Because the uptime wait is multiplied by 3, the winner host is in this group. But in the group with uptime wait 0, the uptime is no more relevant since all hosts have the same wait in uptime. Same idea for the anti-finity. Few hosts have more than four VMs of semi-project running, and the remain have not. So, the memory multiplier will make the final decision in the last subgroup. Let's talk about some of the challenges and requirements for capacity management at Workday. First of all, multiple independent microservices run on the platform and each may need to increase their capacity at a different time, for example when releasing a new feature. A central deployment plan controls the service deployment. In the past, the deployment plan was stored in a SQL database and requests to update it were submitted by JIRA. This process was error-prone and, moreover, before updating the deployment plan, engineers responsible for capacity management would only check if the total amount of RAM, CPU or disk is enough to increase the footprint of a service. And they didn't take into account how fragmentation and the order in which services are deployed affects the capacity. In the past, we had issues during our weekly maintenance window caused by a too large increase in the capacity of one of the services. Because of those issues, we made a requirement that the new capacity management process needs to validate if there is enough capacity taking into account fragmentation and also how the deployment order may affect capacity. On top of those issues, when a capacity issue occurred during a maintenance window, it was difficult to find which services updated their capacity most recently. So we also made a requirement that the new process has to make finding and reverting the most recent capacity changes easy. Another good-to-have feature was to make capacity management self-service. So instead of submitting JIRA requests, which had to be copied to the deployment plan by the infrastructure team, we let service teams make changes in the deployment plan directly, but still give the infrastructure team the power to approve or reject them. At the beginning, we were considering building a new tool with a web UI, but then we realized that we have the solution already available. We picked Git with Gerrit as the code review tool and Jenkins as it fits the business requirements perfectly and most of the users already know and use Git every day, so we didn't have to reinvent the wheel. Additionally, treating capacity changes like code gives us all the benefits of a CI pipeline, including tests. I will walk you through that later. Okay, so let's dive deep into how the new Git plus Jenkins workflow works in our capacity management system. This diagram shows the process of changing the capacity for one of the services. It all starts with a member of the service team updating capacity in a YAML file with the deployment plan. This change triggers the validation job running on Jenkins, which checks if there is enough capacity to increase the footprint of one of the services. Finally, the end of Ops team, which is the gatekeeper of capacity requests, gets a chance to review and approve the change. Merging the change to the main branch triggers another job on Jenkins, which updates the SQL database used by the deployment tool, so the next service deployment will use the updated deployment plan. Let's now have a closer look at how the capacity check from the previous diagram works. For this check, we use the same simulation tool that we use to simulate the changes in the scheduler which Sylvano talked about. Because we use Kube Jenkins, we can package the simulation tool and all helper scripts needed to run the capacity check in a Docker container. The data on the physical capacity of our clusters is stored in surf and is periodically refreshed by a Chrome job which runs on each cluster. The validation test pulls the cluster capacity data, reads the updated deployment plan from Git, reformats the plan to meet the input format of the simulation tool and runs a series of simulations. We simulate the best case deployment ordering, the worst case ordering and the number of random deployment orderings. If all simulations are successful, the job gives the change a plus one approval, which is a required but not a sufficient condition to merge the change. If, on the other hand, at least one of the simulation fails, Jenkins returns a minus one to Garrett with an error message explaining how much capacity needs to be added to the clusters to make room for the new deployment plan. In addition to approving or rejecting changes, the simulation tool generates data which can be used to visualize and summarize the state of all clusters after deploying the new deployment plan. The data is again in JSON format and we use Jenkins archive artifacts plugin to archive this data together with a static HTML file and JavaScript code which contains an application to browse and visualize the data. Since the simulation tool is written in JavaScript, we can reuse the same modules to generate graphs and render tables with the results directly in the browser. The graphs are interactive with a dynamic legend and labels which appear after hovering over a data point. We also present tabular data with some key metrics related to capacity and the deployment. So what are some of the key lessons that we learned? First, we learned that capacity optimization is a tough problem deciding on what to optimize itself is the biggest challenge. In our case, we defined the notion of last radius and fragmentation to determine what to optimize on. Simulating against real data from all our clusters helped come up with a good optimization algorithm. Second, when you're optimizing, account for all types of scenarios and resources, for example IP address. If someone requests to increase the number of VM they're launching on your cloud and you don't have enough IP address for that project, that itself can be a problem. Also to take into account is the deployment model for an application or microservices. For example, if someone uses blue-green deployment, then they may require for a short period double the capacity when they're rolling out a new version of the application. Finally, we learned that NOVA scheduler itself is very extensible. We gave you examples of how we managed to optimize resources by writing a plugin without requiring any changes to the main NOVA code.