 Well, thank you for that introduction. I guess that makes my first slide and my second slide redundant. Yep, I am Dave. I'm from Stark and Wayne. I'm a cloud engineer. So I come from the ops background. We're going to try to do the business thing. But really what it's going to be about is using the resources that you have effectively and making sure that you can get the most out of what you have. So as all of you are here, I'm sure that you know, cloud foundry can be expensive. You roll out your foundry and then you get that first bill from your IaaS or your vSphere team comes to you and says, what did you just do? How did our bill just go way up? Well, that's because you're accelerating the process. And in order to talk about cost, we have to talk about relative versus actual cost. So relative cost is what your app teams care about. You're doing it with, you're showing that to them via showback or chargeback or some form of showing your app developers what they're using. So they care about how much their apps are costing, whereas the operator side care about the actual cost. They care about that IaaS bill at the end of the month that's coming regardless of what you're doing, it's how much you're using, this is what it is. So the first thing we're going to talk about is that relative cost. So it's the cost per app. So how do we bring that down? And this might be a bit counterintuitive, but the answer is run more apps. The more apps you run on the platform, the cheaper each app instance is. And that's because of economy of scale. Cloud Foundry has an overhead to it because you're running the routers, you're running the cloud controller. The things that make Cloud Foundry wonderful and allow you to just do that CF push, well, that's on top of what you need to run the app itself in terms of resources. So what does overhead look like in Cloud Foundry? Well, this is kind of what it looks like. If you're running a Foundry with only two cells, assuming that the cells are force APUs, 16 gigs of RAM, the usual standard, 87% is that overhead. The thing that is allowing you to propel your business, but it's not really the economy of scale that you're looking for. The real break-even point is about 15 cells. Once you're at about 15 cells, you're doing pretty well. At that point, your actual app resources are more than the platform overhead itself, which means you're really starting to take advantage of it. And then once you hit 100 cells, you're at almost 90% efficiency, which is really, really good. So what does that look like in terms of cost per app? And this is for compute cost. How much does it cost for that CPU, for that one gig of RAM that I'm using? Well, it looks sort of like this. When you're at two cells, each app is at about $180 just in compute costs alone. But once you go from two to four, it doesn't drop by half, it drops by more than half. You get down to like that $45, and the theoretical max is right about $11 in compute cost, because that's how much it costs to run a one gig app on that instance. You've almost gotten all the way down to just the bare costs of the resources. So the logical thing is, well, why don't I just scale out to 1,000 cells? This graph shows if I have 1,000 Diego cells, my costs go way down. Well, your relative costs go down, but you're not using those resources, you're just wasting them. So that's where your monitoring comes in. And sometimes when we come into a place, we will see your Prometheus dashboards, and they'll look like this, where you're using 5% of your utilization, you have 200 gigs of RAM, and you're using 10, or you have 2,000 gigs of RAM, and you're using maybe 100, you've way over provisioned what you need. And that might be because you have a really busy holiday season and you need that. You need that for those couple of days where you've got this peak load, but then throughout the rest of the year, you're idling and you don't realize it. So the cool thing about Cloud Foundry is, you can scale up and you can scale down as you need to with respect to your app instances, but you also have to do it with respect to the platform. Because if you don't do it with respect to the platform, you're just throwing the money away because you're running things that aren't going to be used. So what would a better dashboard look like? Well, it's probably somewhere around this if you're not doing a ton of real peak up and down volatile scaling. You're at somewhere between 60% utilization, maybe 70% utilization in Diego cells. In terms of disk capacity, we can do even better than what we have here. This is over provisioned, we can scale that back. Disk is cheap, doesn't mean we have to waste it. So it's all about how we manage our resources. So let's say we scale back all of our Diego cells, we've got that about right, but are we utilizing each VM to its full capacity? And that's where we have to talk about resource-based sizing. How do we size our VMs appropriately in order to use the most of our hardware? So if you've ever been on GCP, you've probably noticed something like this when you go into the dashboard, because you can save $145 a month by doing something with this. Well, what does that mean? How do I just save that? Can I just click a button and it'll save me money? Why don't we all just do that? Well, if you click that, you'll see something like this. It'll say, hey, we recommend that you resize this instance because you're not really utilizing it effectively. We see that you've got three gigs of RAM and you really only need 1.7, two gigs of RAM. You can cut that down. Well, in your foundation where you have 1,000 VMs, trying to do this manually one by one isn't really effective. And even if you go and do it one by one, Bosch is going to realize that something changed and it doesn't like that. Bosch controls everything. So you can't do it out of band. You have to tell Bosch because if you don't, Bosch is going to put it right back for you because that's what it's really good at. So how do we do that? Well, you have to look at each instance type in your cloud config and make the appropriate changes. So for the IaaS is that support it doing custom resource types for the resources that you're looking at is really important. You can say, well, here's our base. It's an N1 standard one, but we can customize it more. We don't necessarily need or can use a pre-baked size. Let's cut it down. So when you cut that down and deploy, it will handle all of those. And then as you scale, it will scale that appropriately across all of your VMs. So you don't have to do that one by one. So now that we've got that, what if you're not on GCP? Because what if you're on vSphere, AWS, the whole world runs on many, many different platforms? Well, then you have to look at your monitoring because your IaaS isn't doing it for you. And that's where a tool, I'm going to use Prometheus because it's what I have and it's what I've been using, but there's plenty of monitoring systems that'll show you more or less the same thing. How much CPU am I using? How much RAM am I using? Am I using it effectively? So in this dashboard, you can see CPU utilization is like 2%. Well, if it's 2%, what's the other 98% doing and why am I paying for it? You probably shouldn't be and you should probably scale appropriately. So looking at each VM type and ensuring that you have enough for bursts, for scale, but also that you're not just throwing that away. So this one, we may be a bit over provisioned on RAM, 75%. That's good utilization, but if you get a big spike, you might not be able to handle it, but that's where your heuristics come in. You can see over the past month, here was my peak. All right, let's scale to a bit less than that or during this time of year, we have to scale up by this much because of last year. So you can use those heuristics to better figure out what your sizes should be. And then same thing, go ahead into your cloud config, make your changes accordingly and just keep deploying and iterate. So if you start by going all the way down to what you might need, you might run into some troubles because you may have underestimated. So just take it one step at a time and just make sure that everything keeps working as you go and that you're still fulfilling your needs. But at this point, what if that isn't enough? What if you've cut things all the way back? You're down at one CPU, one gig of RAM, you can't go smaller, what can you do? Because that's a real concern for environments that you need like sandbox or maybe non-prod where you're not getting a ton of real active use, but they have a very important role and we don't wanna skip them. So what you can do is you can co-locate cloud foundry components. So you can take, say, one job that used to run on its own VM and co-locate it with another job that runs on its own VM that may have balanced issues. Maybe you have one that has a ton of RAM but really low CPU and one that's really low CPU and high RAM. They may go well together and then you can bring your utilization to a more manageable level. So what does cloud foundry look like? Well, cloud foundry has a lot of VMs by default. It has a lot of components and this only has three Diego cells. What does the picture look like as you scale out? Well, some components scale pretty horizontally like Diego cells, they will just scale completely out. They're the one thing that you really have to keep on top of, but there's other components that scale just maybe not as fast. So for your sandbox environments, the easiest way to go ahead and cut cost, remove them all. Anything that's high availability, instance it down to one and you just saved a ton of money but you also lose out on your high availability. And for production use and things like that, that's not really viable but for some sandbox environments that may have to live around to test upgrade paths and things like that, it may be a good option for the things that you're not testing that all of the different components need to come up at once. But what if this isn't enough? What if you, this is where you scaled all the way down to one but you can't scale any smaller. This is where co-location comes in. So you take all of those VMs and you push them all onto logical boxes that kind of makes sense. And once you're at this point, you are as small as you can get with Cloud Foundry. If you wanted, you could co-locate every single piece onto one box. I don't really recommend it. It's not really a great solution because everything's on one box, you can't scale your components and test different things. This will still allow you to scale the components that need to scale because each component that is co-located is related to the other components it's co-located with. And when you need to scale one, chances are you'll need to scale the others at a similar time or very shortly afterwards. But at this point, what if you want to leverage this but what if you want to make it highly available? What if you want to take those benefits from compressing the size and co-locating your jobs and making your things more efficient but you also want that high availability? Well, it just scales just like this. So you scale but you still keep the density and you can keep pushing this out all the way up into your environments that require the high availability but maybe not the granularity of configuration like production. So we've gone through a lot but what does this look like compared to the initial? What's the actual change? Well, there's a big difference in terms of VM size, especially at scale. But let's put some actual numbers on it. We all like graphs. Let's see what they look like. So this is what your cost looks like as your Diego cells scale out. And what you'll notice is with your high availability versus your smallest co-located environment, it's more or less a constant savings. You'll save a constant amount of money and as your Diego cells scale all the way out, you're not necessarily saving more as you scale. And that is because in those big environments, the cost of your Diego cells are outweighing the cost of any of the overhead because you've shrunk that all the way down. So you're saving as much as you possibly can and you're using all of the resources that you possibly can at any given time. So what are the drawbacks? There's gotta be drawbacks of co-location, otherwise that would be the default. Well, the drawbacks are, it's difficult to scale individual components according to usage. So for example, if you have one component of Cloud Foundry that's struggling and you have it on a co-located box with seven other components that are totally fine, well, now you scale that to a whole nother VM and you're running way more components than you actually need to. So it's about figuring out which components logically work well together and need to scale together. And that's how you provide that sort of balance. And the other thing is scaling becomes, it's according to the first, with scaling becomes based on the first component that needs to scale. It's the minimum viable. And then from there, that's determining when you need to scale versus more granular and average. So that's pretty much for co-location. But what I think a lot of people have heard about, I think it's called spot instances. They're way cheaper. Why don't we just use them and immediately save a whole bunch of money? So for those of you who may not be familiar with spot instances with AWS, that's where they really got coined and keyed. So you bid for a price on a VM and it's lower utilization. So instances, you get them as bids. So you say, I wanna pay this much. And as that price goes up, you may lose those VMs. They are not dedicated to you, but it's good for ephemeral workloads. And pricing varies based on demand. So let's say I'm running a Foundry and you wanna run one too, there may not be enough VMs for both of us and you wanna pay more. So I just lost the instances. So if you lose an instance, you need that back. And that's what Bosch is really good at. You've got the Resorector. So if you lose a VM, the Resorector comes in and says, wait a minute, that thing disappeared. Let me try to bid for another one. And it will keep trying and it will keep building that back out. So you may experience some downtime, but for lab environments where that's okay, this could be a great option. And the same thing is true on GCP. GCP is a little bit different. You have these things called preemptible instances. So it's a flat 80% discount. There is no variable price across regions, it's flat 80%. The real kicker here is that preemptible instances are terminated after 24 hours regardless of use. So you get 24 hours and then they're gone. But if you time that accordingly and have Bosch rebuild in an environment where you don't care about it, you're saving a flat 80% and that's great for testing. It's great for ephemeral things where I wanna push it out, I wanna see if it works, it doesn't need to live around for a long time. This is how you can save money on those ephemeral environments. So what are the gotchas there? Well, the gotchas are these, this type of bidding, you're saving a bunch of money, but those instances can go away and there may not be instances left. So you have to have a use case where it can be ephemeral and downtime can be okay. And that's nice for the upgrade strategies where you just wanna deploy something, you wanna deploy the upgrade, you wanna tear it all down, make it go away. And that's where this can really excel. And from here, it's pretty much onto the conclusion. Basically the conclusion is you need to look at your monitoring and that is what is going to be the key in figuring out what you're actually using. Because a lot of times you'll build this thing and it will work and it will work great, but you don't know what you're actually using. Or you may look at it and you go, look at all this green, look at all the stuff that I have, look at all the capacity that I have. Well, I mean, if you've got an unlimited budget, run everything, scale to the wall because you can. But in smaller environments, this will allow you to run a sandbox where you may not be able to. This will allow you to have those intermediate environments that don't need to run 24-7, but will still give you that peace of mind that, hey, this upgrade's gonna work because I've tested it before you go enroll that to your developers, to your production environments, those kind of things. The other thing is capacity planning is important. Figure out with your app teams, what do they actually need? And look at that year over year. It's not something that you can just do once, it's something that you have to constantly do a couple of times a year, a couple of times a quarter even, depending on your volatility of use case. And the last point is if you're in an environment where costs are really, really tight, scaling is what is either going to make or break you because you're going to need to be able to scale in order to have that business. But being able to utilize your resources to the highest capacity possible will allow you to do things that you may not otherwise be able to do. That's pretty much all I've got. So if people have questions, be happy to have them. Thanks for the presentation. You mentioned the botch-riser actor can be used to bid on prices in the spot market. Is that some type of common add-on library? How do you get started with that? So that's actually built into the native Bosch CPI config. So it's in there under cloud properties. You can set that you want a spot instance and you can set what you're willing to bid to. I know AWS is changing some things with how bidding is going to be working. And I'm sure that the CPI is going to follow whatever those changes will be. When you show those statistics, is that like for production system? Like I have like four different environments, right? So I'm pretty sure the lower environments are using the least amount of stuff. Is there some way that I could combine it or combine the instances or? Yeah, absolutely. Where I could be running my dev test environments maybe in one container or whatever you call it, field. Cause I think what's happening is I'm getting charged for each one of those environments and they're probably all underutilized because I don't even, you know, I don't have the traffic in it like I do in my stage and production environments. That's where I'm at, is that? Right, absolutely. So what you can do there is you can use that co-location or the ability to size your resources according to what you need. So for your sandbox and maybe your dev environments where you're not getting all of that traffic, scaling your VM sizes down from the default can really net you some really big gains and looking at your monitoring and figuring out what you're actually using is really key there. Cause if you don't know, you can't plan. So being able to look and then plan and then implement and then iterate and having that cycle, the tighter that cycle is, the more tuned you can get down to the lowest cost that you can possibly pay. Anyone else? Yeah, absolutely. So using orgs and spaces instead of completely separate foundries and planning that, according to your security, as well as your upgrade strategies and making sure that all of those line up. Yeah, that's really key in terms of planning your strategy according to your budget. Yeah. Yep. We've done some interesting things to shut down scripts to manage that cost. If you shut down the Bosch director first and then the subsequent end points and components after that, and you bring those back up and then you bring Bosch back up, it allows you to save a little bit of cost in your lower environments too. If you're not worried about having them turned on all the time too. Yeah, absolutely. There's one caveat there. Depending on the eye as you're on, when you shut down your instances, you lose your ephemeral disks. And then when you bring your VMs up, they don't have any configuration on them, which means they won't come back happily. So just be careful which eye as you're on and look at the particulars. But yeah, that's absolutely a great way on the eye as is that supported, like vSphere or especially the on-prem ones. That's where those will really net you the biggest gains. So to what extent are the Bosch VMs auto scalable? What do you mean by Bosch VMs? So as far as, for example, like auto scaling Diego cells based off the usage? So I have seen some work done with automation and metrics where you can automate deployments and things like that. I don't believe that Bosch is anything built in, but you can absolutely have a monitoring system that looks and has thresholds and then modifies manifests and pushes accordingly. The real trick in there is to make sure that that's very well tested, especially as you start to get into upper environments, because if you run into a bug with your monitoring, which can happen, you don't want it to scale down all of your Diego instances and crash your foundry. That's the danger with the automation of that scaling at a platform level. Okay. Is there anything that you're aware of, like either open source or something that we can use as a reference to get started there? In terms of that, I mainly saw pipelines and internal tools. I don't know of anything that's publicly available in that side of things. Okay, thank you. All right, thanks a lot, Dave. Perfect. Thank you very much.