 Rhaid i ni i fi i chi allun ac yn gweithio ar y wyniol, yr che symud y gallwch i mi. Rwy'n rhoi sy'n gyhoedd y gallwch yn gweithio. Rhaid i ni? Mae eich cyflawn cy deeplyynion. Rwy'n rhoi, mae erbyn bechi Porlyau ac mae'r eayr ysgoliant serffaf gyda sylfydd i gyflawn cyffredinol yn ysgoliant. Rwy'n rhoi, mae'r eayr ysgoliant serfaf gyda ysgoliant. A'u bod laser ysgoliant yn ysgoliant. Rydyn ni'n ei fyddi. are different levels of cloud native maturity. But what that also means is that we get a peak behind the scenes of many, of many different companies. We can see what works and what doesn't work. And the problems that other companies have faced are facing. This year, many of our conversations have been primarily around cost. We've had questions like our cloud bill is too expensive. How do I reduce it? bydd y cyfleu cysylltu'r hollwn cyfnodd yn cael ei ddweud ymlaenol a'i ddweud y cwyddon yn ei ddweud? Yn cael ei ddweud y cyfleu cysylltu'r hollwn cyfnodd. How many of us have been here? Monday morning and we get that dreaded email from the CFO asking just why our cloud spend has increased by nearly 75% over the last three months. o'r hwn yn amlwg ac â'r unrhyw ydym am hyn i'r fwyllt i'r holleg yn ddim yn ymbygol. Ond oeddo gwaith o uchodaeth ar y cael. Mae'n angen i'r hyn o dweud eich yüz yn ymweld i'r holleg oherwydd bydd yma'n unrhyw gwrthwyno'r hyn o'r hyn o'r eich ddweud. Dy gael yn gyffredelloedd a'r cyflwyn i'ch cydweithio'r gyflwyno i'r holleg oherwydd. Ond oeddo i'r cyfleoedd o'r holleg oherwydd, dwi'n oed yn ystod ychydig. yma yw'r cyfweld o'r cyfweld ar gyfer cyfrwyr i'r cyfweldu cyblwyr yma, ac ymhell oedd ni'n gael ei wneud i'r syniadau a'r teulu i gweithio ar gyfer y cyfrwyr i'r hollel. Felly, rydyn ni'n dda i'r cyfrwyr i'r cyfrwyr i'r cyfrwyr i'r cyfrwyr, ac y gwirio arian i'n gweithio ar gyfer, ac rydyn ni'n i gweithio i'r pethau i bwysig i'r hynny. Rydyn ni'n ddweud i'r cyfrwyr i'r cyfrwyr i'r cyfrwyr? Mae yma bod hynny'n meddwl â'r iawn yn gwahodd, mae'n cael i ymddangos i fynd i'ch yn fath? Mae gyfnod y dychydig, rydych yn ei tymiad gweithio. Mae'r rhai cynnig yn fathio ymddangos. Mae'r pwnffas gwych chi'n gwneud i'ch rhaid fel yma, yr hyn ymgyrch yn dderwedd ar gyfer hyn ymddangos. Pan allu o'i ei ymddangos? Mae'n ymddangos a'i rhaid i chi i gael bwysig o'i ymddangos i gael'u gwbl iddi. a wedyn pa ydych chi'n gallu curvedd arbennig cenderfynodol iddyn nhw'n gweithio'r gwahodd sy'n cael ei gweithio ar y gwahodd cyf impresio am ffawr yn unig o'r cymru. Y tuctifat, o adon Maoriadau sydd gweithio'r wybod yn meddwl i weld ar fod yn amlev. Felly oeddwn i'w gwahodd a ddechrau'n gwir neednog, yng Nghymru o esbyg, i'w gweithio ar gyfer y ddechrau. Ac i fy modd cyfrifau hwnnw, mae'r ddysgu fisiwyddiadau. ein ffalu arferwadiau ac Mae Gwyddıyorum, yn y cynnwys lenth o Gwydd, yn cyfrifau hwnnw, felly mae'r ddysgu'r ddysgu'r ddysgu fisiwyddiadau yn gyfrifau hwysgol ar gyfer gweithiol, a'u cyfrifir yw argymau fel aws, a'u cyfrifir yn gyntaf, ydych chi'n lleol yn �ym. When our conversations focus on value rather than cost, we think more broadly than a single dollar figure or an amount. Our focus isn't just on what we are spending but on the return of investment for our spend. Now that's not something to understand overnight. It takes a lot of time to build those metrics. Tyd ar spend to value. The typical examples that many businesses will look at is cost per click and cost per user. But as we change our language and find metrics that work for our business case, it enables a shift in focus. That also helps us to think differently about how we choose to invest our spend. For example, a reduction of all the provision development environments can be reinvested in improving our resiliency or latency to critical production services. So a shift of focus from savings to value, but do not fear this does not mean we need to abandon all hope of greater cost efficiency. Part of cost optimization is avoiding costs that we don't need and getting a good discount on the things that we do. One way to start thinking about this is reducing cloud waste. If some areas of my spend are more valuable than others, then where am I wasting money and resources? Many of us will have heard concepts such as FinOps, but where does that fit in? It depends who you ask really. Some people are advocates strongly that cost optimization is about savings and that FinOps is very much its own thing. But the FinOps foundation itself lists cost optimization and FinOps as different words, but they mean the same thing. So today we won't go into great detail on that. But whichever term you use there's a shared focus on business value and reducing what you use and finding discounts on what you do spend. But when we talk about FinOps in particular we're talking about that shared cultural practice and a focus on collaboration between different business personas and using the data to drive our cost decisions. This graphic on screen and there are many more resources on the FinOps foundation's website. Cost optimization, we are shifting our focus from savings to value. A focus on getting that value at the lowest possible price point and we can lean on FinOps practices to bring this into our organizations. Great, now where do we actually start? I'm going to say it again, not with savings. Why am I hammering this point home so much? Well if you need any more convincing then these words from Google Cloud give a clear warning of the risk we take when all our focus is on cutting costs. If you're feeling intense pressure to save on cloud costs you may be inclined to take drastic optimization decisions. Don't. Making decisions without visibility and context into cloud spend can create unnecessary chaos. It may compromise productivity and contribute to a subpar customer experience. In other words if we focus only on cutting costs it can cause real harm to our business, our speed of development and our reputation. Visibility and context are key. Making decisions about our cloud spend without first understanding it, observing what it is doing is like playing a strange and really expensive game of whack-a-mole or jenga with your cloud bill. The alternative is to invest time in visualizing your cloud spend and the FinOps foundation specifically call us out is part of their informed phase of their cycle. In its most basic form many of us may already be doing this but not realize and periodically review your cloud spends and cost management or billing dashboards. Does it give us a starting point for identifying cost spikes and trends but there's usually a manual and reactive process. All too often this may only be visible to privileged members of the platform team or even management. So how do we go about getting this visibility in our teams? It starts with cost allocation and what cost allocation means is getting our cloud bill and splitting it up amongst the different teams that use the cloud. To get this information it starts with good organization so that might be our AWS accounts, our Google folders and projects, our use of namespaces. Once we've got these good groupings we need lots of metadata so tags, labels so that we know ownership and which application is using those resources. This empowers teams to make decisions proactively rather than waiting for that awkward reactive email. This is most useful when it's as near real time as possible and familiar to software engineering teams. So perhaps we use Grafana dashboards, notifications in Slack or teams or we go down more the dedicated cost observability tooling route. Now once we've got this information we need to show it. It helps to visualize cost per application per team, efficiency and utilization but perhaps the most powerful metric we've seen to display to our teams is one we've already talked about, cloud waste. To illustrate this point here is David's bestie Barney the dog getting very comfortable on the sofa and he's here to illustrate the point that we get very comfortable with what we spent. In a non-technical confession several years ago I got really into breaking bad so obviously I got a Netflix subscription then new series of Game of Thrones came out I needed a subscription to HBO. Before I knew it I had quite a few of these subscriptions dotted around some I used some I didn't but I got used to spending money on them and it was only one day when I looked at my bank account I realized that I was wasting my money in some cases because I hadn't used them for months. How does this bear any relevance to the cloud? Well in the same way we get used to a base level of spend this is what we normally spend every month and if that stays stable we're doing well but what's lurking behind the scenes? Hubonetti's clusters running at as low as 10% utilization in production. Hundreds or even thousands of persistent disks lying there unattached. Wasted resources surfacing metrics around cloud waste is a really powerful motivator for teams because we don't like to be inefficient. I first heard of this mentioned by Andy Bergin who's done some great talks on FinOps including at platform con earlier this year. Cloud providers have tooling that helps to surface this way so disused IPs, unattached persistent disks, work loads with low utilization making it easier for us to identify and remove. So cost allocation and then we present information on efficiency, waste, value. There is actually something quite important missing. I don't know whether anyone's noticed what it is but maybe our first confession gives you a clue. The absent alert. So you wake up in the morning, Monday morning before your coffee you see that Slack has lit up with notifications and the QA team are reporting that their test environments aren't working. You start investigating and find nothing is working at all. You continue searching for the root cause and notice production isn't looking healthy either. Panic. You notice that this has been going on since Friday afternoon and there's now 16,000 deployments and over 100 extra nodes in your cluster. It's like how did anybody not notice that he's a grafted explain the problem. So clearly we had metrics in place to tell us that there was a potential issue but we simply lacked that alerting. Had we had alerting in place such as the one on screen now, would it be able to identify that sooner and reducing that wastage of the 100 extra nodes? And 16,000 images being pulled, it all adds up. Well, we want alerting rules around those anomalies and unexpected cloud costs and in many cases such anomalies might not be directly spend related. So an unusual increase in the number of replicas on extra nodes in the cluster are a clear indicator and definitely something to be thinking about in your alerting pan. At a basic minimum, we should be creating those alerts for spending significantly more compared to our previous month maybe or if there's a sudden uptick. Cloud providers now have given a clear focus over the last 12 months but you can also do these with Prometheus and alert manager or managed offering such as Datadog. There are many open source tools as I was mentioning. Our other free tool tiers, these can do the heavy lifting for you, talking to AWS, Google, Azure and getting that information to give you a true value. So you don't have to do the heavy work. Some examples here like InfraCost, can be put into your CI CD, gives you a really good insight in how much that extra cost is going to, how that change is going to change your spending. CubeCost gives you great observability platform including cost alerting and even governance to make sure that people are in the right, not overspending in their budgets. OpenCost is a great tool as well which is powering CubeCost at the core. Many of us will probably already be using Prometheus, it's a well established observability tool and all of these tools hook into it or use it as a back end. And we've got Crane as well, creates a platform directly around FinOps and we'll cover some of them today. So our second confession. Worker node sizing. During a routine review of the platform, you happen to notice that the cluster in Dev has over 100 nodes. You review the number of applications within each environment and discover you're running one replica per node. I'll repeat that, one replica, one pod per node. If that is what we are doing with our bin packing, something has probably gone seriously wrong. Now this could be the result of all sorts of things. Maybe some seriously unfortunate pod anti affinity, maybe we've got very small node sizes, maybe we've been very ambitious with our workload sizes. Whatever it is, it brings us very nicely to our next theme. Because it's time to talk about right sizing, we might have heard quite a lot about this this week. One of the original cells of the cloud is that you pay for what you use. But in reality, this is not quite so simple. In the terms of compute, we pay for what we provision and our Kubernetes workloads are scheduled based on the CPU and memory that we request, regardless of our actual utilization. This means that from a cost optimization perspective, we need to get the size of our workloads and nodes of everything just right. A really interesting report that came out earlier this year was Google's state of Kubernetes cost optimization report. What do you think, after studying many clusters, they said it's the first thing that we need to do? We have to go all the way back to basics to our container requests and limits. As many of you may know, when you define a pod, it might look something like this. For each container, we can set requests and limits. Do we all really? And how often are they emitted entirely? Best practice states that we should set these as our baseline or standard operating values to perform at a given level that we require. These values are critical during our scheduling to find the most applicable node to execute our pods. In the given example, we're asking for 250 millicars and 100 meg of memory. But we can also set those limits as well. There's also been a tremendous amount of contention recently within the community about setting those limits. But when it comes to cost optimization, limits are a great resource to set. They prevent our workloads from consuming resources indefinitely, and without those, a greedy container can cause your other workloads to be evicted, resulting in instability, excessive nodes within our cluster. In this example, we're looking for 2.5 full cores and 1.6 gig of memory. So capping our maximum usage but maintaining those requests is critical. All that being said, whilst our container requests and limits are just right, reduces waste, it's worth mentioning our resources do still need to be enough to run and run well in a well-meaning attempt to cut costs for inefficiency. We can actually end up introducing risks, and everyone's seen a noom kill, right? In the same way of rushing to size down at our nodes, it can be a huge risk. If our workloads do not have any requests or limits set, they can evict at any time in any order, and that's where quality of service classes come into the mix. We won't go heavily into those, but we highly recommend understanding the right quality of service class for each of your workloads and what that means for the reliability in production. But there is more than one way of optimising too far here. Whilst container requests and limits, those might be more obvious. There's another danger illustrated by the confession of the diligent time sink. When I was newer to the industry, I received an email from my CTO asking us to reduce our monthly Kubernetes Cloud spend. Depart determined to find the most cost-effective solution I invested time in a detailed analysis of different node pools, costings, diligently answering questions, including the possibility of migrating to a more managed alternative. In the end, the cost and salary of the time I spent on that activity probably cancelled out the savings I made for the first year. No one seemed to mind because the Cloud Bill went down, but I'd approached it very differently now. This sounds like right-sizing, not done right. What went wrong here? Where's the visibility? Solutionising straight away by reducing the size of nodes or changing the node type misses the opportunity to do better with over-provision workloads. As this person later realised, the cost of our time invested has its own value too. Once we've applied our current learnings across our platform, we can only then update our machine types. It's a better fit within our workloads. Using that information is critical to ensuring that we don't over-optimize and maintain a reliable and performant stable platform. Our applications and workload CPU are memory-intensive or a mixture. They can play a critical role in selecting the correct family of instances. We can also use that data to correlate the best instance sizes from the family available. If we're thinking about size, we should also be thinking about scaling. Particularly safe opportunities to scale down or to scale in when demand is low. For example, at a cluster level, if we have teams that are just across a couple of time zones, then do our non-production clusters need to be running 24-7? Scaling down outside of office hours can save considerable cost as long as we do it in a reliable way. It can be as simple as a Kubernetes cron job. Giving the right R back to a service account, we can scale down out of hours and then scale back up again. It can save a significant amount of costing. As mentioned, there's a number of tools out there to achieve various solutions that work for you and your workloads because you know them by now. The VPAs can be a great tool to reduce and give recommendations on CPU and memory requests and also the limits. But there's also great other tools out there as well. Fairwinds, Goldilocks gives you a really great visualisation representation of those VPAs and also helps you for those that you don't actually set yourself. HPA to scale based on CPU, custom metrics, and KEDA provides a really powerful event-based scaling solution. There's also some great features coming along in 127, resizing CPU and memory resources without restarting the container. This is just a huge step from our existing VPA implementation where it doesn't need to restart. Finally, application profiling. We really need to truly know what and where your application workload is using those resources and can they provide a significant insight into the inner workings and provide engineers with a key focus to improve. For an example, here we can see a cluster where we've set our own VPAs on services that we want to kind of manage. But what about our other workloads in the cluster? We can see that they're very optimised right there. If we deploy Goldilocks to that cluster, it will do a really good job of finding all of the services that we run. We can clearly see that that cluster can significantly be reduced in usage and also the waste. Here's what Goldilocks can provide us. This not only gives us a great visual representation that we can scale up and down various resources, but it also empowers those engineers that are maybe not too familiar with it. We can get examples of Yamel out of that for various different quality of services. We've done a whole load of work on optimising our workloads. How do we keep these things in check and how do we help our engineers not to stray too far from that path? Confession 4, the Wild West Dev project. I spun up a GKE cluster to test something that Minicube kind probably could, maybe didn't really fit the bill for. I did my testing over a few days or weeks. Obviously I didn't scale anything down when I wasn't in use and then my brain moved on from that matter entirely. Four months and $700 later, I finally discovered that the cluster and a bunch of detached persistent disks were still provisioned lying their idol. The best part, my org to this day, still hasn't asked a single question about it. What could have helped here? There are several things. Yes, we could talk about alerting, we could talk about some other things, but we could also introduce some guardrails. Guardrails done well make it more difficult to make expensive mistakes and leave resources like testing environments just hanging around unused. For example, we could create a registry of terraform modules or your other infrastructure as code tool of choice. That bake in cost optimization and other best practices for teams to consume things like preferred compute, preferred ways of doing our networking. We can also offer tooling to automate scaling down or periodically deleting resources in non-production environments. One great example of this is InfraCost, which has a really nice feature leveraging OPA for policy enforcement. This not only makes the cost of infrastructure more visible to our software engineers, but it can also fail PRs where there are large changes in spend. That's at the infrastructure layer at the sort of Kubernetes layer we can put in place admission controllers like resource quotas to restrict resource usage across a namespace or limit ranges that can be used to restrict workload resources or set sensible defaults. Picture the scene, we've been through this as an engineer creating or providing useful cost visibility things for our engineers and for the business. We've engaged in some cost avoidance by reducing waste and doing some serious right sizing, but what about from a business and finance perspective? Remember our CFO email? As well as the excellent tooling that we've created, we probably need to work with them some more and we need their help in the form of some rate reduction. Because everyone loves a discount and the world of cloud is no exception with providers offering different discount strategies, which means that we can get the services we want at a lower rate. But a word of caution, not all discounts are suitable for all use cases and we could end up damaging the reliability of our services or paying for things that we don't actually use. There are a lot of things that we can do here, but we are going to focus on two strategies, spot instances and commitments. So here we can see prices from spot instances across three of the largest cloud providers, AWS, Azure and GCP. Hopefully we can see, but generally you can get almost a 90% reduction compared to on demand prices. The catch though is that those instances are cheap because they can be reclaimed at any time and very quickly. So they're only suitable for those fault tolerant workloads that can handle those interruptions. We've seen over the last six months and maybe it's a bit longer, these instances have successfully reduced cost of development environments and even in cases achieved some great reductions in production where those workloads can actually get the terminations. But what about when we need reliability? So this is where commitments come in. Here we're using Google Cloud as an example and the idea about commitments is that you actually commit to a fixed term spend in order to get a discount of those cloud resources that you need. Typically you have a few choices of how long do you want them and how flexible do you need those commitments. So the longer the commitment obviously and the less flexibility, the higher the discount. To get most of these discounts we need to understand the current and the future resource usage. However that comes a very complex topic quickly. The sort of discounts are best managed centrally by a team who can leverage and apply them across a larger cloud estate and monitor their usage regularly. There you have it. Cost optimization in roughly 30 minutes. Visibility, right sizing, scaling and discounts. Where could we, where do we look? Next, well I've already mentioned it but Google Cloud's state of Kubernetes cost optimization report came out earlier this year and it's a really interesting read. Cloud FinOps gives an in depth picture of cost optimization, FinOps principles and practices and the FinOps foundation is a great resource for training, certification and comprehensive bank of resources. And if you'd like to know a bit about our work on cost, cost optimization, we've sort of been doing that under a large banner that we've termed fleet ops. And if you'd like to hear about some of the things we're working on and thinking about then you can find our latest blog here. So that's it. So we are identified jets that consult. If you're looking for consultancy training or advisory services or you want to just talk to us about Kubernetes, we'd love to hear from you. If you have questions, comments, feedback, we would also love to hear those too. There is a lovely feedback link where you can tell us what you've thought about this talk and we really do want to have your feedback on that. We hope that you've had an awesome KubeCon and thank you for listening. We've got about two to 30 minutes for kind of questions if anybody's got some, and there's two, Mike stands there. Don't be afraid. Perfect. That's all right. I can't even tell the time anymore for nearly 4.35, but thank you very much for coming. Thank you.