 Hello, my name is Matt Ray and I'll be your host for today's session. I'm the Community Manager for OpenCost. The CNCF Silver Member and sponsor for today's talk is my employer, KubeCost. Today, I'm going to tell you about cloud cost monitoring with OpenCost and Kubernetes optimization strategies. The complexity of operating Kubernetes efficiently is real, but you already knew this. Kubernetes brings technical complexity with higher level abstractions moving you further from underlying resources. These abstractions provide a standard interface for building, deploying software, but there are a lot of moving parts. Engineering teams are having to deal with lots of new concepts. There are tons of open questions. Where has our cloud been going? What are the cloud costs associated with building your product? How can we optimize our dev clusters? Who is the owner of this performance problem? And how do we know our team is provisioning only the resources we really need? Luckily, help was available. OpenCost is an open source CNCF sandbox project for measuring and allocating infrastructure and container costs in real time. Built by Kubernetes experts and supported by Kubernetes practitioners, OpenCost shines a light into the black box of Kubernetes spend. OpenCost is both a project and written specification for modeling current and historical Kubernetes spend and resource allocations. You can view your Kubernetes costs by service, deployments, namespaces, and much more. Right now, it supports Kubernetes clusters on AWS Azure and GCP through their on-demand pricing and on-premises Kubernetes clusters are supported as well. OpenCost is the engine for KubeCost commercial product. I'm talking about OpenCost because it's a tool for digging into what's going on in our complex and dynamic Kubernetes infrastructure. We want to be able to slice and dice our cloud bills by Kubernetes primitives, seeing who's using what and which services are costing the most. This will be the foundation for future optimization. The OpenCost spec is a vendor-neutral open source specification for measuring and allocating infrastructure and container costs in Kubernetes environments. The original specification came together with input from a wide range of Kubernetes practitioners. Among them, Adobe, AWS, Google Cloud, KubeCost, Red Hat, and SUSE. The OpenCost spec is for measuring and allocating infrastructure and container costs in Kubernetes environments. First, we start with shared definitions. Total cluster costs are everything associated with our Kubernetes deployment. This breaks down into direct and indirect costs. Indirect costs are cluster overhead costs. Measuring the overhead required to operate all the assets for a cluster. For example, these are the cluster management fees that you might see in your AWS or Azure bills. Direct costs further break down by allocation and usage. Allocation costs are generally things that have an hourly rate and usage costs are by the amount consumed. Examples of this might be storage, jobs, number of units processed. The asset costs are the individual nodes and overhead that make up our Kubernetes cluster. Resource allocation and usage costs generally contain the nodes CPU, RAM, operating system, potentially GPUs. Persistent volumes are the associated file systems, and there may be a load balancer and network egress costs as well. This is a simplification. Each of these breaks down much further in the OpenCost specification. An example of an asset cost would be a node, where there are CPU allocation costs which relate the cores, the duration of use, the price, and the total cost. The RAM allocation costs are also included. When a Kubernetes node is deployed on your cloud, it now has a total cost. Workloads are the actual applications and containers allocated and running on our Kubernetes nodes. These are the containers, pods, deployments, persistent volumes, et cetera. These are calculated by querying the Kubernetes API, CAdvisor metrics, the kernel, and billing data. They may be usage costs or allocation costs, but they are directly managed by the CUBE scheduler. We want to measure these at the lowest level possible so we can track this data along any dimension. Any unallocated costs on the Kubernetes cluster are cluster idle costs. You're paying for them, but not directly using them to run your application. When we talk about idle costs, we're talking about capacity that has been requested, but it is unused. Sometimes this is called waste, but it's impractical to think that would be at 100% allocated. Most Kubernetes usage is dynamic and some headroom is needed so we want to be as efficient as possible without spending too much. Workload costs are committed allocated costs. This is what's happening inside your Kubernetes cluster. They've been requested from the Kubernetes cluster so you're paying for them. From the cloud provider billing API, we get the numbers on the left. These are the raw metrics that you're paying for. Open costs allows you to view these costs by the workload aggregations on the right. You can see CPU usage by labels, GPU by deployments. However you want to query your costs. You can also stack these so you could see things such as container per namespace. Let's take a look at the open cost UI. This is a freshly deployed open cost UI. Open cost has been running for a while so this is the first page you're gonna see. We're looking at the breakdown of the last seven days by namespace. This is a rather small cluster. It's just two nodes on EKS on Amazon. It's been running for several weeks but right now I'm just looking at the last week. I could go to just yesterday if I wanted. We can also view by namespace. So here we see I've got three namespaces, my kube system, open cost, Prometheus. I can also look at individual controllers. Looking at last week by the pod data, we can see how efficient or inefficient our workloads are by CPU and RAM and how much they're costing total. We can also download the results if we need to. The open cost architecture is fairly straightforward. Open cost uses Prometheus as both the source and destination for data. A Prometheus node exporter runs on each node in the cluster. Exposing Kubernetes data and open cost writes this out to the Prometheus data store. The open cost service queries the cloud provider's API for the cost of each service used. The open cost UI or API may be used to query the information stored in Prometheus's time series database. Deploying an open cost is fairly straightforward. It's a relatively simple service. Open cost relies on metrics scraped by Prometheus. For express installation of Prometheus, we use the prom community helm chart which you can use an already existing installation if you want. To deploy open costs, the current recommended installation method is to simply kubectl apply the YAML manifest. If you prefer a helm chart, the community has recently created one and started exposing all the configurable options. It's under very active development so if there's something you need, be sure to join in with feedback. Once you've deployed open cost, you'll start collecting data almost immediately. Showing costs by any dimension enables action and fitting allocation, tagging and mapping into developer workflows is very powerful. Cluster cost efficiencies is a great starting point for Kubernetes cost optimizations. We'll get back to that in a second. The open cost API exposes cost allocations for Kubernetes workloads and cloud infrastructure supporting them. There's a swagger JSON in the open cost repository and we're fleshing out additional API documentation now. You've seen the web UI. kubectl cost is a CLI plugin for interacting with the open cost API. And if you really want to, you can always access Prometheus directly. The open cost community continues to expand and we're always eager to help new folks. We're mostly in Slack and GitHub with the fortnightly open cost working group calls. There's also a Google group and LinkedIn if you're so inclined. There's links there for the open cost calendar and the open cost meeting notes if you wanna get involved in the open cost working group. So now that we've got these numbers, what are we gonna do with them? Let's talk about Kubernetes optimization strategies. The first step is make sure you have a FinOps practice. FinOps is an evolving cloud financial management discipline and a cultural practice that enables organizations to get the maximum business value by helping coordinate engineering, finance, technology and business teams so they can collaborate with data-driven spending decisions. The FinOps Foundation is part of the Lennox Foundation. The FinOps Foundation provides guidance on cloud financial management through best practices, education and standards. Their FinOps framework is a set of organizational recommendations for building your FinOps practice. While it's not focused on Kubernetes, these guidelines help establish patterns for applying those cloud cost savings. In large organizations, cloud costs are rarely isolated to a single team. Many of the cost optimizations at the cloud account level are more effective when centrally managed by a cloud operations finance team, given their visibility of consumption patterns across the organization and their ability to quantify opportunities for savings. You wanna have folks on the same page when it comes to optimizing your cloud and Kubernetes usage. We take a top-down approach to optimizing Kubernetes infrastructure. Improving the efficiency of containers affects the efficiency of pods which in turn affects the efficiency of clusters. Once you've started optimizing your Kubernetes infrastructure, you can investigate higher levels of cost reductions like reserve densities and other commitment-based savings. This is an iterative process. You're never quite done. Workloads are the applications running on your Kubernetes cluster. Workloads may be pods, deployments, replica sets, stable sets, damage sets, jobs or cron jobs. Optimizations at this level relate to the requested resources made by the workloads to Kubernetes itself and whether they're still in use. When specifying pods, containers may be assigned requests and limits for resources such as CPU and memory. If these are not assigned or are over-provisioned, containers may be allocating more resources than they actually need, costing you money. Under-provisioned containers may cause CPU throttling or out-of-memory errors leading to poor performance. There are several options for working with these recommendations. There may be valid reasons for not adjusting requests and limits with respect to available resources or ensuring pods are never constrained in their performance. To ensure that all containers in a namespace have the default CPU and memory requests and limits, you may provide a limit range. You may provide limits and or requests for CPU and or memory as necessary. Containers with limits and resources provided in their manifestable work as expected, but defaults will be provided for containers not explicitly setting those values. Container utilization is fairly straightforward to measure and verify with open cost. All the data is captured and exposed via the API, CLI, and UI. Optimizations at the Kubernetes level are related to the configuration of the Kubernetes cluster itself. When the Kubernetes cluster is over-provisioned or using inefficient node sizes, there are opportunities for improving the effectiveness of workloads on the cluster. An over-provisioned Kubernetes cluster may have nodes that are primarily idle, making them strong candidates for reducing the number of nodes in the cluster or decreasing the resources allocated to them. Workloads may be redistributed across the cluster to ensure availability while reducing over-spending on unused capacity. There may be situations where the capacity of the cluster node is over-provisioned, but a minimum number of nodes may still be required for workload durability. In this case, it's recommended to deploy smaller cluster nodes and then drain the over-provisioned nodes in favor of their more efficient replacements. Once nodes have been removed from the cluster, the cluster should continue to operate more efficiently and additional nodes may be added later in response to increased workloads. OpenCost makes it easy to investigate your cluster's nodes for efficient utilization. Operating systems. These are the changes at the operating system below the Kubernetes control plane. Once again, always look for resources associated with your cluster but unused. Check the allocations of the machines you've provisioned for hosting your Kubernetes cluster. Orphan resources may still show up in your cloud bill, and we don't want to keep paying for those. Resize your local disk with low utilization and see whether you may launch new nodes with smaller disk on the next node turned out. There are many use cases where ARM CPUs are highly capable replacements for x86 processes as significant cost reductions. Not every workload is migratable, but most Kubernetes and open source tooling is ARM compatible at this point. OpenCost makes it easy to compare workloads across nodes with different architectures. Cloud infrastructure optimizations. Now this is where you should probably talk to your FinOps team. These savings are generally most useful when you have the holistic view of all your organization's cloud usage and can make informed decisions. OpenCost works with an on-demand pricing, so it won't reflect these kinds of savings. So in summary, remember, this is an iterative process. There are no silver bullets, but there may be low hanging fruit. As you work through optimizations at each level, there will continue to be savings opportunities below. Improving the efficiency of containers affects the efficiency of pods, which in turn affects the efficiency of your clusters and coordinates your savings across the org. Thanks. If you have any questions, please contact us at opencost at kubicost.com.