 So welcome everyone, did you notice the fog last night that kind of played into my cards? Because what I'm talking about today is something that has been dogging open stack for quite a while I have seen it myself. I go to customer sites regularly and one question that I get by asked by pretty much every management member at one of the Sessions is so and how am I going to see what the cloud is doing? I'm going to pay two million bucks for this. What do I get for it? And this is something that opens like at the moment cannot answer so I Came up with a forecast system initially I Just started as a Just an exercise for myself to make my work a little bit easier And then Colin my co-worker who is unfortunately not able to come today. So He has a big part of this He came up with the idea of actually casting that into code and asking the cloud for advice So clouds are being built by architects and engineers. It has been the way since I joined Mirantis four years ago We have in the beginning we were talking to mostly all technical people and we were by building relatively little clouds that were let's say test To sort of dip your toes into the pool of cloud computing nowadays we are talking to very large cloud customers that are building million-dollar clouds and There with these clouds obviously have to be paid for somebody and Our managers want inside and we are what we want to give that give it to them So this is what I've seen some quite a few customers You make a little extra spreadsheet you get a couple of values from the different days and Then you make your curves you to come up with all this might this looks about right and If you present this to management management is going to say yeah, and who tells me this is right This is well, it looks good, but it's I Don't really have confidence in that this is going to be the right Forecast so well you want half a million bucks. You only get 150,000 And the other thing is obviously you just put a couple of values in for in wrong and the whole forecast is going to be Unusable so next thing who would know who would have the values that you are looking for Values, of course are in the cloud the cloud knows what is being used how much Of the resources that you have that you theoretically have Actively being used how much of the resources are open how long it will take to get there To the point where your resources are not going to be enough anymore The most important keyword here being obviously automated because the last thing you want is something that is very complicated to use And that obviously nobody is going to use for that very reason So big challenges. We have a challenge in tooling. We have a whole bunch of different Software that helps us see what the cloud is doing right now. We have one Miranda's open-stack has Stacklight there's a couple of other companies who also make tooling for that But this is also always only a snapshot you have a An operator who looks at the cloud sees everything is healthy or this is broken. We need to fix that Yeah, okay, this is not good and so You cannot really use that for reporting other than again copying values out of it and putting them into a spreadsheet At best case you will get a little bit of history, but it will not help you with the forecast So what's in reality what happens is this? You just take the value you say, okay, this year's budget was X next is what we want a little bit more so that's how we come come up with the number for next year and In many cases, you know, okay, this department is going to be onboarded they have a lot of compute resource requirements and Putting that into your forecast makes the thing exponentially more complicated. So you have that curve that you're plotting in Excel and then you have to somehow add that workload jump to make sure that you will get to the point where you can again have an accurate forecast or reasonably accurate forecast and So what this leads to is that projected budgets usually are wrong and I mean you probably most of you probably have seen that you Pro you forecast X amount of money Dubai something The forecast is was incorrect you actually need more and then it goes to rush order Instead of a six-week procurement cycle you have to have this in a couple of weeks costs a lot of money and Obviously management is going to be rather unhappy about that and The impact of that is almost always only seen when it's too late unit, you know Okay oops, we are getting very close to the Ceiling but when you know that it's already usually already after the procurement cycle the last procurement cycle is over So the business effects, of course, and this is what the managers again are going to be interested in is Budget overruns you spent more money on stuff than you would have otherwise if you had ordered it in time late delivery of products or services if Department can be onboarded three months late because you don't have the hardware to onboard them then you will get to the point where You simply they simply cannot Do their job anymore until the onboarding happens outages, you know You run a whole bunch of systems at peak the risk of an outage Goes drastically up You have Arrows due to overstressed operation staff, you know things like somebody pulling the wrong heart, this things like somebody Nuking the wrong server for for onboarding and Then at the end of what this all leads to is lost or unrealized revenue Nobody really Wins in this it in this scenario management does not you do not and and then we are what we're getting to is Everyone is angry at everyone without knowing what really went wrong So this is the ancient history. This is what I built myself to Calculate clouds for customers Can see at least somebody is kind of interested in that this is Basically this got me found from doing calculations for a week down to a couple of hours or less to project reasonably accurate Cloud model for a customer this obviously only works at the very beginning when I'm starting out At a certain point in time when I'm when I calculate something for an existing Requirement so what do we do? We Introduce weather station And this is Going to address or is this supposed to address the problem that we have here Stacklight is great for monitoring But what we really need is a stack light over time and this is what weather station is and as we have clouds the weather station came from we are Monitoring clouds forecasting clouds and this is what the weatherman does so instead of building for operators and technology people we are now tooling for business we do the We project cost for the cloud scale out we predict when hardware will be necessary. I'll show it slides to that later on We can compare scenarios and obviously also integrate with ERP systems and provide the reporting data for Okay, you need to prove to management that you hit your 99.95 99.99 SLA so you can just pull the report say, okay, this is how long we were down This is why we were down and this is still less than we are allotted to to meet our SLA So here you go. How does it work? We have a core Unit, this is a flask server that takes data from the Influx DB from Stacklight and this is Was something that was rather useful for us when we started developing this Instead of deploying another agent to each and every one of your hundred two hundred five hundred cloud nodes We are just using the data that is collected anyway to for our Monitoring and take them and use those as Historic data on one side and then predictive data on the other side So we keep values longer than Stacklight does we have we have our own database to Put data. This is a thinned out values So we do not have a database that will explode and become terabytes and terabytes of Reporting data, we've only keep the values out of the Stacklight database that are going to be necessary for our different data point reporting reporting and We have configuration of course and that this all Ends up in a graphical user interface. I Initially thought about doing a live demo today. Unfortunately. This was a little bit Late and it's getting a cloud on the plane was not doable anymore. So we just basically just said, okay I'm going to provide Pictures of what what this looks like first of all we have the input output cycle of weather station We have of course the project manager and the architect the architect It communicates with the weather station by adding on for instance at a certain point you expect a raise in utilization and He also gets the values out of the cloud to analyze for future purchases For instance, it will show you that you are getting very close to the limit of Vcpu, but you still have plenty of memory left So what you can conclude out of that is that the servers that you have may be may have more memory then They need to it to be Balanced optimally you have for instance a server with 768 gigabytes of memory and only 24 CPU cores So you can see okay our workloads are not that memory heavy so we cannot so we can buy cheaper machines with less CPU we have on the other side the project manager who Obviously knows about the requirements for the cloud and also there's a direct line That's missing here between the project manager and the architect where data comes in and goes out and We have on the other side we have our Operations center and the all the data into the into weather station comes from there So the idea is to support business not the technology the technology is supported enough by other tools by Stacklight and We want what we want to do is we want to be the monitoring cloud monitoring for business and important is a head of time planning and Configuration tuning as discussed you have a command line interface if you like or automating things and you want to pull data from the From weather station you can do so or you can use a graphical user interface First of all pre-deployment capacity planning there is weather station can actually do that you can plug in a new cloud and You can plan or what what the What the cloud? Parameters that you need to give to the to the customer are going to be so You have To do that you have to be able to also store and compare cloud configurations. You can see okay. I If I do this Compared to that. This is what this is the result of that and The downside to this of course is if you go back to go back to that or if you remember the spreadsheet that I showed before You will see that the configuration for that is not trivial It looks pretty easy and I have given it to a bunch of co-workers who wanted to have it to do their own planning And then found out that if you change one parameter This can have very far-reaching consequences and there is no safety net. There's nothing in there So we have put some some of the safety net into the capacity planning for weather station, but we have not This is not to the point where you can just give that to anyone and say let's say to Well the project manager and say okay plan my cloud Okay, so this is a input panel for the cloud we have As an input we always have an Assumption of what the workload is going to be like. I mean In many cases, it's not as easy as it looks you can say okay I expect most of my workloads to be something like this you have To vcp use per instance and four gigabyte of memory per instance, of course this bird is varies very much by what you are actually doing with the cloud we have we're currently working on a configuration that uses hadoop and where there's Average instance is two-thirds of the size of the machine that it's running on so it really depends and it pays to sit down and With all the shareholders stakeholders that are going to be in this cloud environment and Come up with a reasonable estimate of what the average is going to be so One thing of course and this is something that we have also seen time and time it again is that we have shareholders who have very Love the ideas of what this is going to entail Let's say, okay. We have we need a thousand instances of the 16 vcp use and 64 gigabytes of memory and one of the Things that you can do then is you can actually Figure out There's a slight missing that sorry sorry about that. There's a slight missing We will figure out a rough estimate of how much this is going to cost So the idea is to say okay Yes, we can support your thousand instances with X vcp use and Y memory but You this is part of this the cost goes to your cost center and as you are taking up Two-thirds of my cloud you're going to call to cover two-thirds of the cost and in many many cases You will find afterwards that the Estimates become quite a bit more realistic and this is where this tool is good for being able to predict a rough estimate of what It would cost if you have a realistic estimate and if you have the lofty estimate and say, okay This is this is the Delta. This is what you need to cover and we have shown that to customers and in many cases The result was that people were actually sitting down. This is the result that I actually want to see not they're not People, you know saying, okay, we are replacing one rough estimate with another rough estimate, but actually sitting down and Examining their workload. I mean in most cases you have pre-existing workloads You have let's say a server farm that you are running your Instances on they go back to that and actually look at what is this going to look like in Real life and do I really need what I what I said I would need so this is a set of comparable cloud Configurations that we have made and you can see here There's this is this is a little bit hard to read but I can point it out a bit So you have three configurations that you're comparing in terms of a lot of metrics that you want to see you have For instance, you If you have a different configuration see you would see for instance, you need 58 node 62 node 70 nodes and Instances based on CPU just tells you okay in this case you can actually see the delta you have Okay, and this case the estimate for CPU was different between those Between those three configurations and so you can see in these cases you could run 15,000 instances on this cloud With the vCPU forecast that I have made versus 9,500 instances with another for forecast metric and The important thing of course are the constraints values for instance you look at My instances I have a value for total vCPU. I have a value for total memory and I have a value for Total storage capacity value for storage performance and then you look at those four values You have for instance it says on vCPU. I can run 10,000 instances on the cloud for on memory I can run 6,000 instances for on the cloud on Storage I can run nine thousand on storage performance. I can run 5,500 so what you find out is you have two parameters that are low and the other parameters are high in Which case you can say okay our storage is going to be too slow for what we are doing here so we need to either come up with more spindles or we need to come up with another way to make my storage faster and The same thing goes for memory I have more vCPU value more CPU capacity Then I have memory capacity So if after I fill up my cloud with workloads I will see that I still have overhead on the CPU side, but no overhead on the storage on the memory side So if I say okay, I add I'm adding X memory to in this configuration then I can Remove that bottleneck, but I still have the performance bottleneck so in other in Conversely, it might be useful if you do not have the requirement for these 10,000 instances for instance to say okay instead of boosting my storage performance. I just give the cloud a new Ceiling 5,500 instances and instead of raising my memory. I reduce my CPU count and they can say okay and then later on when I have I have a certain balance now and later on as Simply scale blocks of this out. I scale out my storage and I scale out my my CPU and computer like for instance, I put in a fourth rack a fifth rack and as I Maintain the balance. I can see okay now we have the money is Better spent than if I spend it spent a lot of money on one on optimizing one parameter and Later on hit the other Limit before I can use my expensive Expensive CPUs for instance, especially with CPUs. That's very important because high performance CPUs are usually very expensive Like we compare a mid-range CPU is half the performance of the top-end CPU But only costs 20% of that of that value Okay, so the next thing that we want to do is The capacity planning and reporting after deployment. This is where the meat of the matter comes in you have a Cloud that's already running and you want to know when will I had hit the ceiling? so we can in this case we can compare hypothetical scenarios what will happen if I That let's say I want to see What happens if I change my memory over commit or I? I change the sorry When I change my memory over commit when I change that this disk over commit or something like that and It will tell you what? When you will need to do your upgrades This is at the moment that tie into ERP and workflow systems is still in Development we have We are working on this very actively at the moment. The project is six month old and it is in some Respects still a bit incomplete, but it is already very usable and this is something that where we will work on The deficiencies and and build up from there I don't think any of the projects and I mean you a lot of you know from the early days of OpenStack How long it took for some projects to reach maturity? I think we are going to be quite quite a bit faster than that So this is another interesting thing this is where you have Your SLA reporting you can find since in this in this cloud We had no downtime we had We were obviously hitting our SLA and all the individual APIs all the individual opens the API's reported no downtime during the cycle that we are Looking at we have 87.5 percent CPU used and about 50 percent RAM used what you can again see here that is in in our cloud is CPU is going to be the bottleneck and So we in the future and expansions we can Either use smaller CPUs or my or more memory or less memory of course bigger CPUs or less memory okay, there we are and The interesting thing is also this pie graph when you see downtime and unfortunately I Slide that I had that does not have downtime. It would show for instance that a quarter of the downtime it was as with the three API and a quarter of the downtime was Cinder API and so on so you basically have an Explanation to what went wrong and where you in the future can Can or should make adjustments to improve that Of course, we are also looking at this can this the pie chart is visible on my monitor But not on the projector. There's another patch out there Unfortunately, it is held a light gray and apparently the light gray did not show very well and so this is a As we are not really doing anything with this cloud. This is basically just a demo cloud that has a couple of workloads that are Have no bearing on Realistic cloud configuration you can see this is a this has seven instances and Fmrl storage for seven instances and so This is this would be Let's say more useful if you have a larger cloud So the next thing is we can add Capacity we can say okay before we do actually do our Expansion we can predict. Okay. We are adding on the first of January. We're adding two new servers 17th of July. We are doing three new servers and then the graphs will show that differently I will show the graphs in a minute The same thing as we can add a usage plan list where we can predict onboarding of instances We can say okay on this state. We are going to in on what 50 instances on this state We are onboarding 20 instances. So you will see that the graph will also change for that and here we are Capacity planning so we have at the left here. We have the threshold we have how many these total VCPUs are available and this obviously changes as we are adding servers to the configuration and We have the actual use of that VCPUs. You can see this is climbing here You can change the prediction model in the software. This is This is some an average graph and you can change to linear that you can have different different models there and what this shows us in this case is that at At the allocation threshold, we have this is the amount of CPU that we have and we have let's say 80% of that We want to know when we hit that we hit we will hit this according to this prediction on this state this time and So you obviously know That you need four weeks to get a server in all of the server get the server in Could be longer and So you have to order ahead of time and you have at this point you have to add CPUs simply because you are going to hit your set threshold that you do not want to Pass so The other thing is of course you are planning for this duration. Let's say this is three months and You want to see what happens at the very end of those three months because that will tell you how much you need to order to not hit To move the threshold that's currently here to up here so you get out of this you get the order quantity, so you will have the delta delta of 1525 this is that 15 About You need about 13 vcp use it in addition to that of course this will the numbers will be much louder for your cloud But it is for demonstration purposes That's fine So how do the workload onboarding? predictions Come in this was actually This is the one so you have if you have more workloads if you start onboarding your curve will change then in a straight curve, it will also move from a straight it will get steeper and so of course in this case you are Predict that passing of the threshold is earlier than it was would then it would be in an environment where you have Where this workload was not on boarded? so What you can do for it do here is you can take the new value to see it Predt projection is different and we also may need to pull forward your orders, so you do not overrun your Current capacity so if we do not add here This is the adding of the capacity if you do not add the capacity then at this point you will hit capacity will be not only Overrunning the threshold, but you will actually hit the capacity and your cloud will be Fall you will not be able to onboard anything more unless you change something so in this case We have added workloads in this case. It's the other way around. We have added services We have added service at a certain point and this is Where we at the allocate the capacity we say okay instead of There at a certain point We will The other threshold line is missing there still this is we have we have that now So basically this was the old old threshold line and the threshold line would have been it would have hit here so we would have gone over the 80 percent and Instead we are adding capacity so the new threshold line is here, so in we are hitting the threshold line in the Father out because we have added a capacity to the cloud Okay, so But we are planning still planning to do is One that's going to be a lot more parameters at the moment. We have CPU capacity. We have memory capacity We have storage capacity and storage performance and that's going to be more parameters that are going to be managed and We have We are going to also come up with a graph and this is that something that has been a bit more of a Graphical user interface problem than a technical problem where we say okay We will have we will have both memory and CPU in one graph where you can see where you can compare okay on CPU We still have Headroom but in the on memory we are going to hit the threshold earlier on and you can and then you can adjust your configuration accordingly But this is not done yet and this we have we are working on getting this fixed the one thing to remember is this is meant as a this is meant as Add on to weather at a stack light because we use the stack at the moment we use the stack light Infrastructure to extract the data from the cloud in the future. We will also be able to put other Metrics engines into that so we will be able to use other Monitoring tools, but at the moment this is what we have to live with and The the other thing is that when you are interacting with weather station It is important that the people who are not Let's say Engineer is not supposed to make the forecast for for architecture We are basically when you are logging in you can get different different levels of Access to ensure that we do not that people do not project for produce forecasts and calculations that are not That will not reflect the true state of the cloud Okay, thank you very much. I hope that it was useful for you I hope that somebody wants will be interested in trying out weather station and also giving us feedback on what should be done better What what should be improved in the future what? what additional predictions you would You find necessary and And so thank you very much for coming and have a great open stock summit