 All right, I guess everyone is here. So we can go ahead and get started. My name is Ernest De Leon. I work for Morantis. I am the chief cloud solution architect there. So what I work on is a couple of different things. We work on obviously deploying and designing clouds, but also migration workloads in the cloud. There's my contact info. It's kind of been up there. So this presentation, I'm going to try to squeeze it into 40 minutes. We are the last one, so I'll stay afterwards if anyone has questions. I'll also try to leave some room at the end, but this is a pretty big discussion. This is a huge problem that people try to tackle. We don't often get to see the insides of how this gets done, so I'm going to try to elaborate as much as I can in this time period. So bear with me. So the overview is pretty simple, right? This isn't rocket science. It's actually fairly straightforward mathematics. It's just a matter of how you apply it and how you grab those numbers and how you massage them into what ends up being capacity management. So some of the things we'll cover, what is capacity planning? Why do we need it? Small and large scale clouds and how it differs. N10 lead times for capacity. I have some anecdotal numbers for you. As you know, we work with clouds. Our customers span the gamut from very small 5 to 10 node clouds up to thousands and thousands of nodes and dozens and dozens of clouds that span the globe in data centers, all the way up to telco scale. So we'll go into some of those numbers. Capacity inflection points and triggers, which are terms I use for how the capacity planning process is then used to actually grab capacity and bring it into the cloud. And the last thing is developing the process for capacity planning. And that's mainly because it's going to differ by what kinds of workloads you run, what the scale of the cloud is you're running. Your architecture, as you know, OpenStack doesn't have a single reference architecture. It's super flexible, so you can deploy it many different ways. So we'll dig into the process. And then you can kind of take that and apply it to your own data center. So the first thing is, what is capacity planning? Well, it's fairly simple. It's estimating the storage computer hardware, software, and connection infrastructure resources required over a future period of time. Obviously, because of time constraints, I kind of gauge this more towards the nova side of it or the compute side of it. But we use the same exact methodology to approach storage. So whether it's object or block, it comes kind of the same way. So why do we need it? Fairly simple. Cloud grows at a rate that you don't control. Your users have the ability to dynamically spin up and spin down resources. And their workloads shift depending on what they're doing. And you may not expect spikes in workloads. And this becomes incredibly difficult the smaller the cloud is. The larger the cloud is, it's a lot easier. And we'll show that when we kind of dig into the anecdotal numbers for this. Unlike the public cloud, so this is huge. And you can kind of debate whether Amazon does this or not. But when you're running a private cloud based on OpenStack, you now have access not only to the underpinnings of the infrastructure, but also you usually have a closer relationship with the users that are running workloads on your cloud, specifically the large demand users. And using a demand planning and forecasting kind of methodology there helps you because you gather more data points than you would normally get if you were not able to see the underlying infrastructure and the usage patterns. And not able to talk to that customer or that user directly and figure out what they're going to run now in the next six months in the next year, et cetera. And then the last thing is, so one of the biggest mistakes we make as operators of these clouds, if we do not do capacity planning correctly, is we try to use overcommit to solve these issues. And that ends up backing you in the corner at a certain point because inevitably you're going to get a workload in this cloud that is either CPU or RAM dependent and where you cannot have overcommit. And granted, you can create other availability zones or host aggregates or whatnot to move those workloads kind of into their own niche corner. But that's kind of not the right way to do it. So what we don't want to do is put ourselves in a position where we're using those kind of things to make up for the fact that we've incorrectly managed our capacity in the cloud. So small capacity clouds are very difficult to plan for, for a very simple reason. A user workload spike can break your model, right? This happens all the time. Anybody who's run a cloud gets a user on there and their application use grows. And there's a spike in usage. And all of your available capacity is gone. This is one of those things you want to avoid. Again, so on a small cloud, it's hard. On a large cloud, it's a lot easier. Large is easier just because, like it says here, right? The ability for a single user to spike to the point where they eat up all of your capacity is small, or it's limited, right? Now, you could have multiple users do that, but it's still unlikely. So we have to be diligent in monitoring the growth pattern on a regular basis, right? So when we get to the anecdotal numbers piece, we'll discuss that. But essentially, we're constantly gathering data. In most of our customers' environments, we're gathering data on a daily basis from the back end of the cloud that tells us aggregate instance counts, right? Utilization, we know the server counts, obviously, because we built the cloud and we've been adding capacity to it. And the last thing here is you never want to take a short-term solution to a long-term problem, right? So capacity planning is always a long-term solution. You're always looking at least six months, if not a year out. And primarily, that's because of lead times, right? And we'll get in a little bit more of what composes an end-to-end lead time for the purpose of managing capacity here. So the total end-to-end lead time includes a couple of things, right? So the first thing is the manufacturer lead time to get the servers to your door, depending on who the manufacturer is, depending on the type of hardware you standardized on, and then other things, right? Because you have things that are not under your control. It could take anywhere from three to four weeks to three to six months to get servers just to your door to the data center. That doesn't even count provisioning them in the cloud. The next thing is burning and testing. If you do that internally, right, which many of our customers do, some of them have it done by the vendor or the VAR who's in between. Rackstack cabling in the data center takes time, especially if you're running at scale. If you've got huge clouds, it's going to take several days for a team of guys to rackstack and cable all of those servers. The last thing is probably the easiest, at least if you're running Mirantis OpenStack. Fuel automatically provisions and pulls all the hardware into the cloud for you. But you still have to take that into account, right? That takes time. And it takes a person or people, depending on what your method is to do it. And if there are problems, someone has to fix them. A lot of times, an automated tool like Fuel will expose other problems you have in the infrastructure, like incorrectly cabled boxes, incorrectly servers that have the wrong size or amount of hard drives in them. These things kind of pop up in these tools, and you have to go back and fix that. The most volatile of all of these is the manufacturer lead time. Delays can happen for any kind of internal component problem. And depending on the scale at which you're operating on, that's another thing, right? You can go to the manufacturer and say, hey, I need 500 new servers, and I need them by two weeks from now. And they may not be able to deliver them to you, or they might. Or you might have a situation where a natural disaster causes a factory that makes hard drives to go offline, and then there's a shortage. And you can't get them, right? So you have to understand what that is. And you can work with your vendor or your VAR to understand what the typical lead time is for any of this, and just put that into your planning, for your capacity planning. And the last thing is, always look at this like you're a developer. Always pad the lead time, right? If you know that your capacity or your end to end lead time is going to take six weeks, then make your plan eight or 10 weeks. So you always have a buffer. And when we get into the numbers, you'll see why that becomes important. So here's the first of some anecdotal numbers. And I have about four of these charts to kind of show where all this three or four of them, I don't remember how many it is. Anyway, and I took a snapshot from March 2014 to March 2015, obviously, because these are representative of our customers clouds. And we're far enough enough out that this is not kind of a trade secret of any kind. I've also averaged the counts across five different clouds. The numbers are very similar across them and their workloads. So we're not giving away any kind of information here that no one else would know. And it gives you a picture of one year of utilization growth from end to end, which is big, right? At least in the open stack area. Most of the time you see snapshots of weeks, months, but this gives you a full year to year end. Next year, if I happen to do this again, we can go for two years and see what the utilization looks like. The data is from a single region within a large cloud. So these numbers came from customers that are running medium to large clouds. And this is a single data center, single region of many. I mean, most of these customers are running dozens of these, right? And then other regions numbers, like I said, we're similar. So there's not any kind of big discrepancy to where a small cloud is pulling down the numbers from an incredibly large one. So as you can see here, the chart starts very close to zero. And it goes all the way up to just shy of 12,000 instances. One region, one zone, one data center. The server count over time. Most of our customers, at least during this time frame, were standardized on 20 core servers. Part of the reason is it's kind of a best practice. If you guys used to be with VMware back in the day, then Eucalyptus, then Mirantis, these servers started coming out with massive amounts of cores, 40 cores or more. And you don't want those kind of servers in your cloud, because a failure of any one of those, especially if your workloads have any kind of auto scaling built in, you generate boot storms on the remaining nodes that are in your cloud. So you don't want the smallest server possible, but you don't want the biggest. You want something right in the middle. So usually about 20 cores is where we put the limit. And you set your RAM requirements according to that. So if your customer tells you that your instance sizes are going to be of a certain kind, and you need at least four gigs of memory per physical core, then you adjust the sizing parameters of your servers. So as you can see, again, the servers, I think most of these started off with anywhere from 10 to 15. It was a small pod. And then over time, it grew to just shy of 1,000 servers in the region in one year. So that just tells you right there. You've scaled from 0 to 1,000 servers in the course of a year. So it gives you a good idea of how fast you have to order hardware. This is the capacity graph. So what this is showing is, based on these customers' instance types and flavors, how many instances could we hold in this cloud at any given time along this path? And then the next graph I'll show you will plot these three against each other. So you can see how this looks in real life. And, you know, big spoiler here, we managed to keep capacity ahead of demand, right? And that's kind of the idea. So again, starting at very few instances we could hold, all the way up to just shy, I think that actually tops out at just over 18,000 instances that can be hosted in this one region. And so this is where we plot the three against each other. So the server counts, as you can see, it's a fairly linear growth pattern. The instance count, it moves up a little bit more aggressively. And the capacity supported, it also moves in the general same direction, but you can see some differences in there, right? So you can see just after January of 2015, the instance count takes a more aggressive jump. And the reason is because at this point, more and more of the internal users of these clouds are becoming comfortable with using the cloud, and they're starting to use things like autoscaling, right? Whereas before, they were still doing the old lift and shift off of VMware and bare metal. They're now getting into using autoscaling groups, right? Because they've figured it out. So now the growth is kind of, you know, the ship has sailed. You just have to keep up with it. You can also see in the yellow line, right before January of 2015, it kind of loses its aggressive slant. And the reason for that is when you, again, when you're on a small cloud, you're aggressively growing, right? So the point at which you hit an inflection point and a trigger to get the capacity playing process or rather the hardware provisioning process kicked off is a lot sooner, right? And the number that you're doing, right? The amount of hardware you're ordering as a percentage of the total hardware you have in the cloud at that point is much bigger. And when you get up to a certain scale and you're seeing there that it's just shy of 10,000 instances, give or take, you now can back off that number, right? So whereas, and I'll show you another slide with those numbers roughly put there for you, but once you get up to that point, you know, up to that point, you're right at about a 50% utilization trigger where you need to go and start the provisioning process again. Once you get to a scale where you're up around 10,000 instances, then you can start backing it off to about 80%, right? So you're keeping a 20% buffer instead of a 50% buffer in your cloud, which, again, is why it's easier with a larger cloud to manage capacity as opposed to a smaller one. So capacity inflection points, right? So how do we look at this in a semi-pseudo mathematical way, right? Not a knee jerk, oh crap, we're out of capacity. So there are points at which the infrastructure utilization rate indicates that more hardware should be procured, right? So this is what starts the process, right, of provisioning additional hardware and additional capacity into your cloud. These inflection points are calculated based on the total size of the cloud, number one, the user growth patterns, and the total end-to-end lead times for provisioning, right? So we put those three together to generate what is our capacity plan model, right? So on small clouds, less than 50 nodes, we're looking at about 50% utilization. So the inflection point is 50%. Once we've utilized 50% of the available resources in the cloud, we trigger the process to provision more hardware. And then we add a 90-day growth number, right? So in that, what we're looking at is what's been the growth over the last 90 days? And we add that on to the top. And typically, that growth is about 50%. On a small cloud, right? So we were basically doubling the hardware or at the very least adding 50% of the original number, so it would be 50% growth, not 100% growth. On a medium cloud, which is 50 to 500 nodes, which is still kind of small in this scale, but we consider that medium, you're about 65% utilization, same number, right? So once you hit that number, you trigger the process again. On a large cloud, this is where it gets a lot easier, so over here, you're at over 500 nodes out there. You can get to 80% utilization, so that means that you can wait all the way until your buffer and your capacity is only 20%, and then you trigger that process, right? And a lot of times, at least on the 80% utilization number, we found that a 20% growth factor on the number of servers was just right, right? So if we go back to... If we go back, well, so these numbers are all based on the observation of numerous customer clouds, right? And it can vary depending on the workloads, but if we go back to this slide, you can see, right? So right after November 2014, that's where we started backing off, right? From the 50%, we backed off to the 80%, and left the 20% buffer. And from that point forward, instead of adding 50% capacity every trigger point, we only added 20% capacity every trigger point, and that still allowed us to keep ahead of the total capacity needed in this cloud by quite a bit. So that blue line ends at just above 12,000, and we've kept the available capacity just above 18,000. So we have capacity for, you know, 6,000 instances, right? These are not servers. And I can tell you from the next six, seven months now, those lines have become closer together, but we still manage to keep ahead of the curve, right? So we've not had to get in a position where we used overcommit ratios to solve this problem. So pretty easy way to keep track of it. So developing the process, how do we approach this, right? So the first thing is make sure you have a viable way to gather and store historical data about the usage of cloud resources. Almost all of this information you can get out of Solometer, right? So it's not any kind of special trick. The one thing you may have to kind of pull out separately is your physical infrastructure stuff to understand how many servers you have and if you're using different kinds of servers. So these numbers can also be broken down by different kinds of flavor types, right? So you have host aggregates with different types of hardware, entirely different hardware, then this capacity planning process will have to be applied to each of those differently, right? These numbers are looking at homogenous hardware, which is usually what we recommend. However, your workload may not be that, right? You may have workloads that can fit in a very generic homogenous environment and you may be running VoIP or media encoding of some kind that has to be on a certain type of hardware that's high performance, right? So you manage those capacities differently. Same process, just different numbers. Establish a demand planning process. So this is actually one of the most important things we found, right? We could sit there blind and just use numbers to say, okay, well, every month we're adding this much capacity, we're using this much capacity and just try to keep those lines separated all the way through that graph. What's a lot easier is to go to our big users and say, hey, we noticed that you're one of our major users, you use 1,000 instances this month. Do you have any kind of plan growth for next month? Are there, what about three months from now, six months from now, a year from now? Are there any major initiatives happening that are gonna cause your usage to spike? And some of these actually were huge. So we had instances where, or we had situations where some of these applications did, they gathered data from mobile access points and several times a year these mobile access points were put in things like this summit or other large conferences and they were gathering all of this data. So the normal workload this customer had might have been 1,000 instances, but whenever these large shows or summits or gatherings happen, that could spike up to four or 5,000 instances for a week or three days and then it drops back off. Now on a monthly scale, you're not gonna see that in these numbers, right? At the end of the month unless they just happen to have those instances on when we grab this data, but you will see it the day it happens. So that's the reason you need to keep those lines at least somewhat apart. You can't have them writing each other. The next thing is understand the full end-to-end lead time to bring the capacity into your cloud. Fairly simple, talk to your VAR, talk to your manufacturer, ask them based on this specific build, this bill of materials that I want from my servers, what's the average lead time, what's the longest lead time you've seen and try to work in the middle of that for your, at least for the procuring of the hardware part. Set automatic triggers to kick off the new hardware provisioning process when the capacity utilization reaches specific specified markers. So that's very simple, right? You've already decided where these inflection points are just based on the size of your cloud and you will trigger at that point the acquisition process for new hardware. And you also know that you've got the provisioning process afterwards and bringing all this stuff into the cloud. The last thing is revisit all of the above points on a regular basis and adjust the process as necessary. So this goes back to when those lines diverged on us or they kind of went opposite ways for a second there, we realized that our 50% planning model was not, well, that's great. We realized that our 50% planning model was not what we needed at that point, right? So we went back in there and we adjusted it and that's where we backed off to 80%, right? So you'll have to go back in there and do that on a regular basis. I think that once a month is enough but if you have the time to do that once a week that's even better, makes it easier for you. So the last thing is any questions. I'm sure there are gonna be some. Absolutely, you know, some of our customers did that. They built their own tooling and built their own algorithms to kind of monitor this and trigger it on its own. That does work. The only issue with that is usually when you're building an algorithm like that, you're on a set, you're on a set path, right? So you're either on the 50% of the 65 or the 80 and you can go in there and adjust it at any time to change but it's harder for an algorithm to see something like, you know, we had a spike, for example, we had a spike in November and December, right? And in January, that utilization is gonna drop significantly. However, the algorithm will see that as a spike on a projected path and automatically trigger a provisioning process for hardware that you may not need, right? But you'll understand your customers needs better than that and their usage patterns, right? So in a very large cloud, especially one that has workloads that deal with retail, an algorithm would have to have a lot more logic in it to understand the retail cycles, right? And the expansion and contraction of workloads that are servicing retail online outlets versus somebody who's running, you know, a bunch of Hadoop workloads or something like that. So you can do that absolutely and we have some customers that did. You just have to keep in mind that one algorithm is not gonna suffice for every type of workload, every type of customer you have out there. It's there's still a human touch that's required, at least for now. I think at some point we will get to where we can model this better with just algorithms. So yes and no, right? I mean, if you hit, you know, there are logical limits to what these services can handle, right? We find them all the time, especially on the large scale clouds. And so we kind of know where that demarcation is where we have to say, okay, you've kind of hit, you've not hit the capacity, but you're about 20% off, so stop putting stuff in this cloud, right? And build another one or build another region and just keep going like that. Most of our customers have kept well below those just because their needs are more of scaled out clouds, having many of them as opposed to having one huge one or two huge ones or anything like that. So, but yes, if you start seeing performance problems or you understand where the limits are of performance for any of these services, that's absolutely a logical boundary where you can, you know you need to build another cloud. Yes and no, it depends. So we've got things like, for example, we've noticed that regions seem to hold together fairly well as they're, well this is a question slide, as they're approaching about 2,000 nodes, right? Now that's very, very anecdotal because, again, these are very homogenous clouds with, you know, similar if not identical hardware across all of them, right? And the workloads while they are varied are not dependent on certain things like CPU pinning for anything or, you know, heavy media encoding or decoding VoIP, anything like that. Those same rules do not apply. So we have customers that have heavy media encoding and decoding workloads and VoIP. They absolutely do not apply to them. Those limits are much lower in terms of the performance threshold where they have to break off into another cloud, right? So that's part of it. Cinder and Swift are also very subjective, right? So depending on what you're backing the service with. So usually, you know, Ceph does really well up to a certain point. But in terms of high performance, you know, bringing an all SSD back to Ray from, let's say, SolidFire or someone like that will get you further along in terms of speed and performance than Ceph will. And then of course, a lot of our large enterprise Fortune 500 customers are still trying to utilize sands in their environment, right? And those will get you so far, but those are limited in scale as well, right? So actually, in most cases, the limit of a sand will come way before the limit of the sender service, right? Yes, absolutely. So the answer is yes, right? So if you're running Neutron without DVR, and you're not using any kind of ML2 plugins for hardware offloading or anything like that, there's absolutely an upper threshold to that. I think that in most of our clouds, we found that approaching about 10 gigabit per second of throughput is kind of hitting the ceiling. Actually, we hit it before that, but getting close to it is hitting the ceiling for an x86 machine trying to act like a network device with real ASICs in it that are routing traffic or passing traffic through, right? Also, other things come into play that, you know, shared networks, right? Cause all kinds of problems if you use those, right? So trying to get your users to move on to only networks that are using floating IPs as opposed to shared networks where one tenant can affect the performance of another tenant, right? Those kind of things come into play. But I think that anything up to an approaching of 10 gigabit per second is okay. In general, we've been okay with that. But we did see problems as we approached it, at least with some of our customers, yeah. So it does, right? And that's kind of a, that's actually, it's good you brought that up. Although it's not part of this specifically, how you manage IP addresses in your cloud is critically important, right? And we typically, well, number one, we don't recommend that anybody take floating IPs and expose them directly to the internet. But the other thing is how you set up those IPs, right? So one of our recommendations in general is to take every cloud, no matter the size, and put it behind some time of layer three routing device, right? And the reason is because you then get full control of the network behind it and can do whatever you want behind there, right? So your only limitation will be if you're running either a double NAT scenario or however you're getting your traffic out to the internet, right? Not inside your corporate network because you fully control that. But if you're exposed to the internet at any point, your only limiting constraint is what you can get from your ISP in terms of publicly routable addresses, right? So that's important. And as you know, those are super limited at this point, right? So we do have costs, especially in the large, like telco segment where they have hundreds and thousands of these things, right? That are publicly routable addresses that they have to pay special attention to how they're doing that. But that's more of an education issue with users, right? Because, you know, AWS, if you have a lot of users who are coming from AWS, they've typically, at least prior to the last couple of years, have come with the conditioning that every instance had a public IP of some sort, right? Now VPC and all these other things came up where they got their private networks and they were able to not do that anymore. But the mentality never shook that a lot of these things should have public IPs. So you have users putting floating IPs on things that don't need them, right? So I think it's two things at once, right? Approaching it from managing those publicly routable IPs in a better manner and educating the users as to when to actually use a floating IP that maps to that or not, right? That's a better approach than just trying to, you know, only worry about the number and utilization without looking at why they're using them or how they're approaching that. Anything else? All right, thanks. Looks like we finished on time.