 You come by and replace it. I think you did. Let's see if this is right. Test, test, test. Yep. So this is much better. And let's just, I will test it on you if you want to wear it. That way. Test, test, test, test. Gerardo? Gerardo. Gerardo. OK. I want you to try it. I'm going to see if I can put it on your better. For this you will use this, but I will try testing, testing. So you probably just have to hold it like this and talk. Testing, testing, one, two, two. Uno estrés, one, two, three. It is a testing. We're just waiting for a few more folks to trickle in and then we'll get started. All right, I think we're going to get started. I'll give just a quick intro on our speakers. We'll be talking about psyllium and observability. We've got Gerardo Lopez-Valcon and Fabrizio Segura. I hope I didn't butcher the names. Gerardo is a Google Developer Engineer. And he's based out of the beautiful Costa Rica. He's also an organizer of a cloud-native Costa Rica, as well as KCD Costa Rica, and cloud-native applications. He's interested about blockchain technology, cold chain, and pharma apps. So. Hello, people. Good afternoon. I want to be closer to you, so I will be there. So nice to meet you again. It's a big pleasure to be here. Thanks for coming. How many of you has experience with Kubernetes? I see the faces. I see sad faces. Yes. No, no. I hope that this session is useful for you. So today, we want to explain what this elevate security and observability with psyllium. It is the agenda. A little bit about EVPF, what is psyllium, how or what technology we can use in order to set up the environment and start to troubleshoot psyllium. We have a couple of demos, psyllium observability and psyllium security. So we are, again, Gerardo Lopez. It's a big pleasure to be here in the next slide. I'm a Principal Engineer in Veritas Aromata, Superfan of Soccer, Movie, BBQ, and my colleague Fabrizio Segura, Chief Engineer. So again, very happy for this chance. OK, thank you. So let me explain a little context. Into the landscape, all about Kubernetes microservices is a big topic. Focusing on networking on security and, of course, network observability in Kubernetes is a paramount due to the growth exponential usability and usage in production environments thanks to Kubernetes. So the higher end complexity of distributed architecture right now is causing that we need to pay attention to different things. First of all, performance issues. Second one, we need to pay attention about possible vulnerabilities. So it is the reason that we need to enhance security and improve observability. Why? Because allow us identifying and mitigating possible vulnerabilities and performance issues as well. Practically, we can detect and resolve network and performance issues. And lastly, this contributes to maintaining the integrity, reliability, visibility, and availability of application deploying on Kubernetes. So again, work on distributed architecture is a big challenge. It's beautiful. You can have different things. But we have challenges to resolve in the backbone, the network. So in the next slide, we can explain the main topics that we are using. And why not? You can involve with us in this adventure. So what the EPF is? What EPF is an interesting technology. It's an open source and flexible framework that in simple words, you can insert dynamically snippets or code into the Linux kernel space using a way and secure manner. So you don't need to modify the kernel at all. You add dynamically and room time, the code, and add new functionalities in simple words. Overall, EPF is a versatile technology that empowers developers and administrators to extend and enhance these skills or these capabilities in Linux kernel in a safe and efficient manner. Opening up a new possibility for performance optimization, observability, and of course, security informants. Different use cases. In the next slide, we can see about how the business of different use cases that people are using about EPF. So we can start about three important pillars that we believe all application needs to have. First of all, the networking. So it's very important. Right now, there are people that order project, open source project, and private project that is incorporating network policies, enforcing the network, and of course, the security itself. So networking, security, and observability and tracing, four and three pillars, very important that EPF is doing into the cloud-native ecosystem. So if we go to user space, you will see we have project and SDKs. So most important project, of course, BC. You can continue creating code thanks to BC and incorporate or add into the kernel. Cilium, Falco, Catron, and Pixie. Of course, there are a lot of other projects that we can use. But if you see right now in the cloud system or the cloud architecture native, you will see this project that people are using. So about the SDK, we have, I think, so important range of technologies. How many of you has experience using the GoLand? C++? Yeah, it's normal, no problem. C++. Of course, Rost. OK, Rost. So right now, there are the SDK the EPF has. Why? It's totally normal, because this programming language are more focused to the low level. So for the reason, different SDKs. And probably, in the next years, new programming languages will appear in this list. Of course, we can optimize the application using tracing, profiling, and monitoring. So about the kernel runtime, we can see we can use technology like verifier, mapping, and others. So again, observability is not the same as monitoring. Thanks to observability, we can go beyond. Why? Thanks to EPF and Cilium and these kind of technologies, we can make decisions. What make decisions? Why? Because if you don't have metrics, if you don't want measurements, how you can make decisions is hard. So I need data. I need measurements in order to make decisions. So EPF, in the next slide, thanks to EPF, appears new technologies like Cilium. Cilium is a powerful networking and security solution for Kubernetes environments. And so regularly, you will see that people say, Cilium, ah, Cilium is another CNI. Cilium is not another CNI. Why? For all the capabilities that Cilium offers. So we can probably discuss Cilium versus Calico, Cilium versus Flannel. We can this later in the coffee in order to continue with the presentation. Obviously, if we have time, we can discuss a little bit. But Cilium go beyond to the CNI. Why? Because Cilium offers advanced features in the Layer 7. So it means that we can capture metrics flow in the Layer 7 and Layer 4. So it's a great thing because you can capture all the flow that what is happening in the network. By leveraging EPF technology, Cilium provides efficient packet processing and scalability while ensuring robust network security and observability. It enhance Kubernetes networking capabilities, enabling seamless communication between microservices while enforcing security policies and scale. So we have all in one. We can improve security, observability, and of course, all related with the monitoring stuff. A little explanation, high level, how Cilium works. It's very, it's totally common. It's the fact right now in Kubernetes environments, different applications are using the operator. The operator is responsible for managing all the concepts. Cilium has the operator. And thanks to the operator can communicate with all the nodes. Cilium, using the approach Daemon Set, it means that in all different nodes in Kubernetes, installing agent. It means this agent is responsible for create and monitor the networking communication between pods into the node. So it means that if I have 10 nodes, I will have 10 agents. And responsible for maintaining and updating all the life cycle of Cilium is the operator. How the operator communicate with the agent using the key value, communication. So similar with the ATCD in Kubernetes, it is the communication that is happening. It is a little bit high level architecture about Cilium, two important components, operator and the agents. In the next slide, we can see all the advantages that we can use in Cilium. Network visibility, server discovery, rich metrics. I will see example in a couple of minutes how we can get metrics from the kernel thanks to EPF and expose in Grafana. So thanks to the rich metrics. With these metrics, again, we can make decisions. How is the performance of our application? How is this performance about our network? Layer 7 visibility and, of course, flow visualization. I don't know your case, but in my case, sometimes I see developers that need more tools in order to see what is happening behind the scenes, because Kubernetes is a complex concept. So Cilium offers UI. Interesting service map UI that I will explain to you in a couple of minutes. And the developers can make decisions, can see what is happening between different components. In the next slide, so it is other high level how Cilium works in this case. Cilium intercepts all the traffic, all the network traffic that is happening behind the scenes between our application or apps or bots and create tools for us. Tools like service map, mobile UI, in order that the people can see easier what is happening in the network. So it is a little diagram in order I hope that we can explain this. But Cilium is the heart, is the backbone, is getting the information and creating for us. OK, in the next slide, I will move to the terminal. So in the meantime, my colleague Fabrizio will help me to create the environment. So we want to recommend two technologies. First of all, if you want to work locally, you can use KIND. KIND is a good tool in order to create Kubernetes in your computer. KIND uses Docker in order to simulate the nodes in your computer. It's so beautiful. So in this case, we are right now adding different images in order to work that. And the second technology is we are not getting any promotion to say that. Rancher. So Rancher uses RK2. RK2 is a good technology because you can specify in a Jamel what CNI you want to use previous to create the container, sorry, the cluster itself. So OK, let me get started with the demo. We have this script. If you see, bless you, if you see under the networking, we are disabling the default CNI from KIND. We are passing or putting the true value. It means that we want to create a cluster with three nodes, one node. You can create. We can create three nodes, one node. The control plane, the E2 nodes are worker. But the idea with this example is create a cluster, Kubernetes cluster, disable the CNI. We execute. So no problem about that because at finishing the activity, we can pass you all the scripts if you want to run in your local. Right now, we're creating the control plane. We're installing the storage class and joining the worker nodes into the computer and waiting a little bit what will happen. Of course, we'll create a cluster so quick with that. So if we clear this screen, we will add all the different image versions. Preview version, why? Because the idea, in order to respect the demo, we need to accelerate that. The beautiful things of kind is that you can reload images previous into the technology. It means that when you create things, the nodes will use the preloaded images and will create the things so quick. Right now, we are exposing or viewing the Jamel that the cluster created. And so waiting a little bit, all the images are preloaded into the kind cluster. And so what does it mean? It means that in the next script, we will create, right now here, exactly thanks Fabrizio. So if we run QBCTL get nodes, you will see that the tree nodes are not ready status. Why? Because it's pending the installation of the CNI. If you see the pods, no any pods about Cilium. So all is pending, of course, mainly the core DNS because core DNS needs the CNI in order to work. So in the next command, Cilium status, Cilium has a CLI. It's beautiful. Because thanks to CLI, you can create around commands behind the scenes communicate with the cluster. And you will see what is the status of the Cilium. In this case, errors, of course, because nothing is installed. In this command, we will install Cilium. Cilium, you can install Cilium so easy using a Hemchart. Right now, we are enabling the basic features about Cilium. We run the command. And thanks to the previous preloaded images, this configuration will be so fast. What is happening behind the scenes is happening. Maybe you can run other commands in order to see get nodes. If you run the command, you will see the star is ready. Thanks to that, all the nodes are ready to use. If you run the command again, the getPot, you will see we have three pots, four pots. Cilium, we dash some specifically definition. Every, if you please, run the dash all wide in order to see what node, white, white, W, E. OK, OK. Cilium status, you will see. Can you go up, please, in order to see the Cilium status? If you see or run again clear and clear, please, and Cilium status again. If you see, OK, right now, Cilium is OK. The operator is OK. Different and uns are not installed. For example, all relate that my colleague will enable very soon. But important things, deployment, the Cilium operator, desire, ready, and available pots. Cilium is checking that right now, the operator has installed two pots. Demonset, the Cilium, desire, three, why? Do you remember, I mentioned at the beginning, why install three pots about Cilium, the Cilium? What is the answer? Why? Demonset, appreciate it, excellent. Thanks to Demonset, we have three nodes, and we install it, one per node. And that's it, the Cilium operator, we can see all the information, HammerChart, in mind versions, and that's it. So it's so easy to use, but of course, it is the beginning. My colleague will explain more about observability and security, and so Fabrizio. I will run commands, because I like to do that. And I know that you like to see. So I will use this ice cream. I hope that you hear me, can you? Yes? So we installed the kind cluster. Now it's time to go for setting some Prometheus, Grafana. What we will do, we'll create dashboards, or show dashboards that we previously installed, because Cilium is instrumented for observability. So at this time, like Gerardo started with the cluster, I am going to add more commands. And for this, I will comment this Helm command that we'll install. Initially, the Prometheus Grafana, we have a gist here that is in our repo. All of the scripts that I'm using here, they are available in a repo that we are going to share with you as soon as you add us to your LinkedIn, because there is a price to pay. And it's shareable, so it's a gist that we have. Everything is working. Basically, the 00 are the scripts that are installing all the kinds. And similarly, it's ready to use. It's something that is working almost perfectly. So I will run that 03 script. And then I will run this one. I'm using a modified version to show you colors. But basically, this other command that is enabling the relay for the Hubble UI, so you will see that we can instrument and see in the dashboards and in Hubble UI things that are happening. The idea is to install this Prometheus stack, so it will use the content of the previous script that I showed you, and then enabling the service map. That will be something we will use to see the output in Hubble. After completing this part that is related to the service map, we can port forward the Grafana. I will just show you that the pod should be running here. Otherwise, we fail something. In the monitoring namespace around there, there are Grafana and Prometheus available. So we have our observability embedded in kind. That is always a good point to learn how these tools are working. So we installed both Prometheus and the service map. Now it's time to do something that is the port forward of Grafana. We can see that here I have the command ready to gain time. But basically, we are going to expose the service endpoint that Grafana is creating with a port forward on the same port 3000. I left that for visibility. It can be omitted, but really it's something that everyone will appreciate. And I should be able to run a browser somewhere. Let me check if I can find the browser to have it on port 3000 and share with you the output of Grafana as we prepared previously. So I'm going to put Grafana here. There is no trick. It's really the Grafana that I installed. As you can see, it's running on local of 3000. And we will see that. We already have tools that are able to provide us some metrics. So if we go here, this is already embedded in Cilium. So you don't have to bother about building your dashboard, squaring the data, just installing Prometheus and sending everything. We can scroll down. Probably it's not really readable at the moment. But there is a lot of metrics related to eBPF, to the network itself, and many other information that can really be useful when you are troubleshooting and you are being not just at DevOps, but also whatever you are doing with the CNI, eBPF, whatever. As you can see, it's a lot of data. You just have to get familiar with that. And it could really surprise you. A part of the Cilium metrics that we have here, there are other dashboards related to the Hubble UI. We will see that the UI is something that is coming on after running a comment. And you see, it's very useful. So I will not enter the detail of all the dashboards that we have here. But as soon as we use the stack that we created, it will be filled up with a lot of information that are obviously very important for each one that is working on this kind of detection. So if we come back here, now it's time to expose the Cilium Hubble UI. I will use the typical Cilium Hubble UI comment, because it's probably the faster one that we can use. So I will switch terminal here. Cilium Hubble UI. It should open something if I have no typos. No, it's OK. I will share with you the screen of what it opens. So it's a primitive interface. They are working to improve the content here. But basically, we have all the namespaces that we can query. And here, you will see that there are the components that are running and the status that is interesting, because we will see in the next section that is related to security. I will be very brief on security, because we can do a lot of things with the rules that you can use for layer 7. So I can announce the screen, but it's not something that you can view more. I can do the control plus. But I think that, really, it's something that you can experiment downloading the scripts. So the content here, basically, of the Hubble UI is giving us a situation of what is happening across the pods that are running inside the cluster. Are you familiar with Hubble? Someone that is familiar? No one is familiar. So you are. Cilium generally started with Hubble embedded. It was the initial idea for giving visibility to all the things that were happening among the different pods running inside the Kubernetes network. At that point, they didn't announce that a lot, but it's still a valid option to filter out what is happening in the column where you see the green, when we will run the commands to query a specific rule that will be forbidden, you will see that there will be a red one. Or we can also use the Hubble observed commands using the Hubble CLI. But for the moment, let's leave it as it is. And let's go for installing a pod. Now I will be using another one, so this one. And I will install a pod that is just running Nginx. So let's see what it is. These pods, this script, is just installing the classic pods that you have in the Kubernetes documentation. I kept it easy to read for that reason. And I will also use a policy. Initially, let's go for the Kubernetes one, this one. So if you are familiar with network policies in Kubernetes, you see that here I'm using a classic easy one. So I am just saying that I'm allowing ports 80 and 443. And I'm also allowing DNS traffic. The difference that I want to show you at level 7, it's very simple, basic. You can do much more things with rules on the HTTP level. So in example, you can filter on Nginx the root to a specific page. If you have the index.html and you want to see it, but not the about.html from specific filters that are related to labels or annotations or other strategies that you can find in the Cilium documentation related to the rules, it's really flexible. What you can do in Cilium, you cannot do in Kubernetes policies. So if we use this, let me see, because I don't see what I'm writing. So Cilium policies, yes. This is the difference. As you can see here down, I'm specifying only the endpoint of Google.com that is querable by the DNS. So after the testing, we will not be able from that container to run a query to another endpoint. That means that I will be able to filter at level 7 the application layer for querying a specific DNS without blacklisting it in another way. This is not possible in network policies. So it's a big difference. As I said, you can do that on the HTTP level. You can see here that there is a specific DNS reference in the list of YAML. I hope that you can see the characters, but probably not. Let's see if I can get bigger. Not that much, but I want to show you specific entry that is describing the ability of using the Kube DNS for that port and down here you can see that we have the DNS entry here. So I can specify the effective name of the endpoint that I want to query. Now let's, we created the pods of NGINX. So what I will do is run in a couple of comments, but before of doing that, I will show you the comments. So we will use the classic Google is responding and we will also query another endpoint that is familiar to everyone. We will use Microsoft. It is responding. At this point, I will apply the Kubernetes policies. No, nothing will change. It's the same policy translated to Selium later. So I'm applying that policy that was the first one that I showed you and you will see that Google will answer, Microsoft will answer too. Now let's delete the Kubernetes policy. So I can use my specific command line. We deleted that one and we apply now the other policy that is the Selium policy NS because it's based on namespace. I forgot to comment that this policy is a specific of the pods that I'm using for NGINX here. So it's a pod called HTTP test namespace and web NGINX pods. At this point, I will run again the two queries. One to Google. You remember that it should work and not answering. But look at the message. It's not a curl appending. It's a cannot resolve host. So it is acting exactly on QB DNS. It is not resolving for that specific host. Saying that we could look at the browser where we have these specific entries and here we see that in this case, it is still forwarding. Probably I need to refresh this. Or maybe I will use the Hubble UI, the Hubble Observe comments because for this specific thing, we will be able to see a red login line. So let's go here, login line, let's see. Now, you see here it's full of green forwarded messages. If I go to call again, the comments that we were using previously, so I should have them saved in this terminal. I can use them. No, I will open another terminal. It's a little bit. And I will run the two curl that you saw earlier. After some time here, it should not show, it should show something that is traced. So not forwarded anymore, but traced by something that is in color of brown. That means that the behavior is not forwarded anymore, but it's sent to trace because it's observed in the previous dashboards. All the things that we did now are available, obviously in the Grafana interface that should be still available here down here. And we should be able to see a lot of informations related to the tracing and the activity that we did on the celium metrics and captur. You can see that data has been feeling for all the time that we were effectively running comments and it's confirming that we can observe what is happening on the network. So saying that, I think that we completed the path related to the operational demo. We can come back to the slides, skip the part related to the black boxes here, but I would like to focus to one point. So how you can do to create these celium policies for network. I've been using the editor that is available online by Isovalent. You can go to the page and you'll see this is not something that you can run locally at. As far as I know, I've not been able to do that, but on the site of the Isovalent company, you can run this specific interface, a page, and create your own rules. They will be available here as a Kubernetes policy, network policy, and as a celium network policy. So I decided to include what was the same YAML files that I used and I generated here. As you can see that in this case it's very visible and easy to use. If you go there and you are familiar a little with the namespaces, the isolation, and similar, this will produce an output here down that you can copy path in your cluster and do the same experiment. I think that is everything that I can share with you for today and go ahead with adding specific things. So Gerardo, I will give you the ice cream. Thank you, Fab. Before I go to the questions, I hope that probably there are. So, celium is very proud to say that cycle-less. Cycle is a good pattern that regularly, I like it, no problem about that, but the pattern of cycle is at an extra pot, a one specific pot for, for example, proxy or collecting things. Irregularly, the famous init containers. But thanks to EPF, celium doesn't need cycle. Doesn't need cycle. Why? Because all is directly with the kernel. Similar with Envoy, for example. With celium, we can go directly to the kernel and apply things. So, it's an important point that we want to mention that availability don't depend on about cycle itself. So, I don't know, questions? Yeah, questions. So, I'll go around and use the microphone for you so we can record the. Is this on? So, is this bypassing IP tables and Q proxy to do its networking stuff altogether? It's just going directly into the kernel, creates a VM there. And that's where you're getting all this telemetry information. So, nice question. Effectively, for the people that are working with firewalling generally, IP tables, EB tables, hex tables, all that kind of layers that are embedded already in the Linux network stack are running also inside the kernel. You know that ModProB IP tables and so on, it's necessary for performing all of those activities. EBPF is giving Cilium the ability to write the specific K native usage of that interface. So, you are filtering the information thanks to EBPF from the kernel to the network stack. And that example that I was showing you about Google versus Microsoft is not just to say that one is better than other, but especially to say that we are able to create an additional layer that generally in IP tables would require resolving because it's only on the IP layer. It's not layer seven or the OCI standard. So we don't know the application layer in IP tables, while here we know. It's not really a replacement. It's probably an additional resource that Kubernetes can count on thanks to the EBPF technology. Everyone familiar with EBPF should say also that you can use that outside of Kubernetes too. It's not a must. But for Kubernetes, the advantage is that you can really replace and be more specific on what is firewalling. If you have been using a platform like WAF of AWS web application firewall, it's very similar to the concept that you can add a sort of proxing, not specific proxing, but I was not showing you a rule that you can add to the HTTP layer that is, in example, on routing. So slash index HTML possible, slash about HTML, not possible for that pod that is running in Kubernetes filtered by annotations. This is Sumad here. My question is in a service mesh world, sidecars are critical with Selenium. How it is gonna impact the service mesh kind of use cases? So service mesh, in this case, is something that Selium is able to provide as a native resource with the ability of using resources like Getway and similar. In the case of specific Selium network policies, you can effectively filter the resource, but it's not really what you can do with the, in example, the Selium API Getway. The API Getway is able to give you the ability of using the service mesh also for canary deployments. We haven't covered that here because it was expressively focused on the network security, but obviously on service mesh, the observability that is running on Grafana would have been tracking also the service mesh ability that you have when you create a specific Selium service mesh. It's another custom resource definition of the Selium bundle API extension. So if you go to Kubernetes and you start looking at how many resources in the CRDs for Selium are added by the operator that is basically having a lot of controllers on each of those resources. Doing a reconcile logic when you add a service mesh resource in the CRDs of Selium, then it will immediately trigger all the things that are related to observability, but not necessarily to network policy if you don't specify something that is labeled or annotated that way. So they are independent but coming from the same stack. There is another thing that we could add. Selium is also a replacement for Istio multi-clustering. We are not covering that here, but in your case the service mesh and the multi-clustering are also isolated from that kind of policies that you can have and you can just rely on network like it was for IP tables or for a squid proxy that was filtering a specific route. What we cannot do with a specific HTTP route is the URI, yet. You can filter the whole URL and the route, but not the URI. So what is coming after that with query parameters, we cannot. One Kubernetes flavor is CoreOS, OKD. Will Selium run on there on CoreOS? So on CoreOS, I don't know who of you is using and knowing CoreOS before Fedora. CoreOS was initially a very good operating system that was running containers initially with Rocket Engine. Then it was acquired by Red Hat and Fedora and maintained. Initially it was using a security stack that was called Tecton and it was a very good alternative for running Apache measures. If you run Kubernetes on top of CoreOS, it's like running Kubernetes on top of Fedora Atomic. So you rely on a lot of features that are implemented on the network stack by Red Hat to run containers and it's quite hard. In my case, I've not been trying in the last time, but I've been using CoreOS a lot in the last 10 years for measures specifically, not for Kubernetes. The Rocket-based kernel and the fact that they adapted to Docker, the fact that they are using Podman and Cryo as a container engine, should be disposable with Kubernetes, but for Selium, it can be tricky to have it up and running. And there is something that we were discussing previously with Gerardo about Selium's network stack. It's easy to run it on top of an existing RKE2 cluster, Kubernetes, but when you go to EKS, AKS and CoreOS, you could face specific problem related to the installation that are exactly for the kernel differences in drivers and probably it will be not yet the time for that to run the Selium on top of CoreOS as a Fedora CoreOS distribution. If you do that, probably you will find some challenges to face. When you guys start supporting URIs. Well, can you shout more? Can you hear me? Yes. When you move to implement URI support, able to see headers and things like that, or is it just gonna be the path like it seems to be today? So in our case, I will give you a real fast background to see how we went to the adoption of Selium. Our background was Calico and Istio. We'd envoy proxy, so everything related to sidecar containers. At that point, you know that envoy proxy is using that ability to filter URI. Also, for example, in canary deployment for Service Mesh that are working with 99.1 deployment or blue-green deployment, but sincerely, we are not yet using that feature. I can say that Selium is able to provide you the layer to implement the URI filtering when you are using the Service Mesh, not the network policies. So if you use the Selium Service Mesh, you can effectively work with canary. You have two versions up and running on the same namespace with a percentage that is establishing which one of the service is responding more than other. You don't have an instrumentation yet to track there. You should install Keali or Jagger like you do in Istio. At that point, you are able to see the URI filtering. The big advantage using Selium is that you see almost the same, but you don't install Istio, which is very painful. And second point is that it's straightforward. So it's something that really you can run a Helm install for that kind of resources and it will be transparent. Any other questions? Go ahead. You want to answer some questions? This was mostly around the performance between the alternative. So if you're running Istio versus running Selium, is there a lower footprint and lower overhead with Selium than there would be with Istio besides the sidecars? So basically comparing performances between Istio and Selium. Here we can say really easily that you can try that on kind and you will see that Istio is overwhelming the system. It's creating a lot of resources. When I used the command line, I was not really looking at the screen because some comments was there, then I was using auto-completion. It was really easy to work. When you use Istio, probably I would have been switching constantly the terminal here to look at what I was effectively writing. And the fact that, first of all, it's using envoy. So it's good because it's C, but it's evil because it's a sidecar container. So you lose the control of the stack. It's not an init container. In terms of advantages, the only thing that I feel it's still something they have to fix or find a solution is that in Istio, when you use canary deployment, you use Kali and Jagger. Here they are not ready to go. For multi-clustering, Istio is very complex and will require something that will need to implement a sort of additional layer like a VPN. If you have tried to do multi-clustering, is someone here familiar with Kubernetes multi-clustering? It's a challenge. I don't know if you tried, but imagine that you have to connect two different clusters and you have two challenges. One is the one that we are facing in our current reality that is pairing a cluster to another for a distributed system. So not for a co-operational cluster like Istio, having a pod here, another here that is running in China, one in the US, and they are doing the same activity but balanced, that is Istio multi-clustering. In Silium, what I can do is to have a peer running here in China, another peer running here in the US, they are peering, not co-operational activity, master-slave clustering, but they are really distributed. So in our use case, in example, or what I work more in daily activities, I can do blockchain thanks to Silium inside Kubernetes. That is still a big challenge, but it's not possible with Istio. So keeping on your answer back, Istio, they have to work better on removing the invoice support, doing something more lightweight, and then we have to say in any case that Istio is paired to Calico and Tiger since a lot of time, so you will depend on the Calico stack for block affinity of the nodes and it will add more complexity. Calico is doing the same, let's say the true. It's not doing something different, but the point is that you have to edit the YAML files for doing things correctly while here it's very straightforward and there are no tons of customer resource definitions to call, but it's just something that you go and do. It's easier. So for me, if you have Istio, start to do the isovalent courses that are available for free and you will immediately realize in one day that it's very simple. So probably before finishing, one thing that I want to add is, our idea is don't say Istio doesn't work, Istio is a good project, but depends on your needs. Depends on what you need in your company, in your job. I want to remember, Istio recently was graduated, Istio has way. So compare, enjoy it, the reason that we are using kind is because you can destroy it, your local cluster are no problem, so I hope that you enjoy the presentation. Destroy the cluster. Yeah, destroy it. And so for us, it's a big pleasure if you can come back to the presentation. So it is, okay, destroy the cluster. Yeah, with kind is magic, you can delete the cluster and have all the cluster itself. So again, it's a pleasure. Thank you so much. It is our links. Thank you. Hello, yeah, that works. I think we're gonna get started. People will continue to trickle in. So we welcome Eamon Ryan. Eamon Ryan is a field principal engineer with Grafana Lab and he brings a pretty impressive background from VMware and pivotal days as well. Something we'll talk about later. Originally from Ireland, he's been in the US for 10 years, slightly up north in the Bay Area in Oakland and he'll be talking about scaling Grafana, right? Ramithia, sorry. Hi folks. Yeah, so this is a series of fortunate events, open source tool of solutions for scaling Prometheus. This is Grafana's little mascot grot here, helpfully with a logo from the con on his hat. As Faisal said, my name is Eamon Ryan. I'm a senior principal field engineer at Grafana Labs. Been at Grafana for almost four years now, literally like two weeks away from four years, which is an eternity in relation to the company link. But yeah, we get to work a lot of cool technology there. Big fans of Prometheus. Grafana actually employs over 40% of the Prometheus maintainers. So we're big, big fans of it. We build a lot of solutions around it and to work with it. And lots of them open source, of course, which is what I'm here to talk to you about today. So I have a little disclosure up front. Obviously I work at Grafana Labs. One of the main solutions I'm going to talk about that helps you scale Prometheus is a project that's built and maintained by Grafana Labs. That might create concerns of obvious bias, which is why I'm not actually going to recommend a particular one at the end. I'm going to showcase a couple of more main contenders that includes ours, but I'm not gonna tell you to use a specific one. I'm gonna tell you about a few things about the different ones to help you make a choice there, but I'm not gonna say one is objectively better because they're not. And if I said our one was the best one, you'd say I was biased. If I said somebody else's one was the best one, you'd think I was not planning to have a job for very long, probably. So I'm gonna try to keep it unbiased. Just gonna talk about objectively true things and then you can take that information and kind of make your own decision from there. So in case you walked into a talk about scaling Prometheus and you haven't actually touched Prometheus before, just a quick refresher. First of all, who uses Prometheus in the room? Everybody, okay, almost everybody. Not everybody's hand was up. So Prometheus is an open-source time series database solution. It implements a multi-dimensional data model. So all time series are identified by a metric name and a set of key value pairs and then obviously a value for a metric. PromQL, which is the query language that lets you slice and dice, the collected time series data so that you can generate ad hoc graphs, tables and alerts and all that good stuff that we need to keep things running. Prometheus has, whoops, it's too far. Prometheus has a few different modes for visualizing data. It has a built-in query browser UI, which is a bit more limited, but it does function if you were just trying to verify that a query is usable and does return. The most common way I see people talking to Prometheus is via Grafana, but that's not the only way you can do it, but Grafana does give you a lot more visualization options than you would otherwise have. And then of course, you can address Prometheus directly via its own APIs and pull out whatever data you want in that manner as well. It's got pretty efficient storage, so it stores time series in memory and on the local disk in a really efficient custom format. You can scale it in using Prometheus Federation, which I'll talk about in a little bit, but it has limitations there. Each server that you run in Prometheus is independent, so from a reliability perspective, they only rely on each other and on their local storage. It's all written in Go, very easy to deploy. You can deploy it on Linux, you can deploy it in a container on Docker, you can deploy it in Kubernetes cluster. Doesn't really matter, it works on all of them, although don't run it on NFS storage. It doesn't like that. I'd like to warn you if you try and start on NFS and it will run, but it will tell you that your stuff will get corrupted and it's not lying, it does do that. Let's see, alerts are defined in Prometheus's PromQL language and they maintain all the dimensions and you can send alerts to alert manager and deal with notifications and silences and all that. It's got a ton of client libraries if you want to instrument metrics for Prometheus. 10 different languages over 10 are supported and you can add custom libraries and do your own custom instrumentation. Pretty easily. And there is a bunch of existing exporters and integrations that allow you to pull in third-party data into Prometheus. So if you have something that does not output data in a Prometheus format and it's something that at least other people use, it's not something you build totally in-house, there's probably an exporter for that out there that will convert data into a Prometheus format that you could then ingest into Prometheus or something Prometheus compatible. So that's all the good stuff that a lot of you probably already know about Prometheus but it has some limitations and these limitations aren't necessarily deficiencies of the project, they're just non-goals of the project. So they're not things that the project was necessarily ever set out to solve, leaving those things to be solved by other solutions which is why we're here today. So the first is around size. So Prometheus is a single process or single machine which means you can only scale up, which means you can add CPU, you can add memory, you can add data, which you can't scale out. You can't run a cluster of Prometheus servers. You can run a HA pair but that's not a cluster. So you might be in a position where your metric needs are growing in your work, in your company, in your house, maybe you have a really big house, a lot of stuff and you keep adding CPU, you keep adding memory and now this is like too big for a machine or you're running it inside of a cloud provider like AWS and you're doing bigger and bigger Amazon instances and maybe you go all the way up to something like this, the R7A.metal48XL with 192 CPUs and over one and a half terabytes of RAM and that's still not enough and maybe it's still gone on fire and exploding and falling over. I have seen this. I have spoken to a customer who did this and literally was at the largest AWS instance you could get with some number of millions of series inside it and it was still falling over and then they came to talk to us about it. So people do do this but eventually you run out of headroom. It was only so far you can go. So if you've then run up to the top of what you can fit in one machine, what do you end up doing? You end up having more than one. So you have a second server and maybe you send half your metrics to server B and half your metrics to server A and that works okay but maybe the results in a scenario where now you have okay you have two or maybe you have four, maybe you have six, maybe you have eight and you connect all eight of those up to one Grafana instance and now what happens when you're trying to find metrics from those servers? Which data source in Grafana are you going to query? So you're gonna be looking at Grafana going, oh, which one do I pick? Oh no, please help and makes for a sad experience. This is also something I see pretty commonly so people add loads and loads of these servers and then they either have to have the overhead of figuring out which one to query or they end up running some kind of proxy in front of it so you can run, there's a project out there called, I don't know how to say it, it's prom xy like prom proxy. But the problem with that is if any of them are slow then your response is always slow. So it becomes hard to identify where there's an issue if there is an issue because you're trying to query like n servers at once. So also not really a great kind of setup and one that people commonly run into. The third limitation here is around retention. So if you've used open source Prometheus you know that their retention is gonna be a week, two weeks, three weeks mostly. You can set it longer but it doesn't really handle that very well. It's not designed to and so if you wanted to run a query inside of say Grafana that went back more than a week you'll end up with a graph like this where you only have as far back as your retention actually is and you're standing here looking around for the rest of the data inside your graph and it's just not there. This took an unreasonably long amount of time to get that gif in there properly. The next part is around tenancy. So you have your Prometheus over here at the bottom. You might have team one up here, team two, team three, team four, team five at your company and they're all getting metrics sent into the same Prometheus server but there's no segregation in there and there's no limits. It's either on or it's off. Aside from like rules you can make around filtering of metrics but that's not really a tenancy. It's not really a team limiting thing. So you're effectively operating like you have just open floodgates all of the time just pouring all these metrics into the one server which means you can easily run into situations of like noisy neighbor teams. So one team just blows up the amount of metrics being sent into the system and now everyone else has to suffer because of that and they knock the whole thing over. And the next limitation here is around HA and resiliency. I briefly mentioned having a cluster before so you might have two Prometheus servers that you run as a HA pair but as I mentioned like one of the early, early slides they run totally independently of each other. So they don't know they're in a cluster. They're basically there with a brick wall between them and they have their own sets of config, their own sets of scrape targets even if they're the same scrape targets but they are maintained independently. They have separate disks, separate TSTBs and I say no backfill here. There is methods of backfilling data into Prometheus but a HA pair does not sync itself like there's no automatic syncing of data. So if you take one down to upgrade it it's missing data while it was down comes back up. It doesn't get filled in by the other one. So that's something you have to manage yourself it's a totally separate activity. So what do you do? What are you supposed to do to like solve this problem? So there if you go to Prometheus.io there is a page of integrated remote storage solutions for Prometheus. It looks like this. There's a lot of them and they're not all good candidates and they're not all up to date candidates either. So this could be really misleading if you just looked at it at a glance. As kind of a zero with round elimination ones that I'm not focusing on here out of this list are first of all things that are not open source because we're at an open source conference things that are SAS only because they're usually that's usually goes hand in hand with not open source so you can't run it yourself. And so that precludes things like time stream Azure Data Explorer, BigQuery, Splunk, Wavefront all these kinds of things. So not really gonna focus on any of them. There are some items in here that I'm gonna mention because otherwise somebody at the end is gonna go but what about X and just gonna head that off up front. So here's the first round, laminations. The other one was the zero with round but here's the first round. So CrateDB is this SQL performance database. It has a Prometheus adapter which means it's able to accept data in Prometheus format and it's able to translate a query in PromQL into the SQL format it needs to return the data. So on paper should work fine. From what I could glean from it, I haven't used this. There's one maintainer. It's running on a 0.5.1 release version which doesn't mean it doesn't work. Just means it hasn't existed for terribly, terribly long. I have never spoken to somebody who was using this and so I didn't go into like further evaluation with it because there's just not many references of people actually using this for something. So maybe a future one which is why I have that up there but it's not part of the larger analysis I have here. Next one here is Elastic. You can actually write PromQL or Prometheus format metrics into Elastic but you can't get them back out in Prometheus format. So they'll go in in Prometheus format but then you have to query it in Elastic's query language which doesn't feel like a solution for scaling Prometheus. That's a solution for moving stuff out of Prometheus essentially. Elastic as people probably already know and Observability World changed their license to one that isn't OSI approved. So it doesn't really feel in the spirit of open source to have that in that list as well. Noki or Nochi, this one lets you write data in but doesn't let you read it out, kind of like Elastic. So that's not ideal. There is a Grafana plugin for Nochi but it's really, really old. It hasn't been updated in forever so I wouldn't say it even works still. Do you know you could write Prometheus metrics into Graphite? I wouldn't recommend it but you can. So you can write it in but again you can't read it out as Prometheus. I don't know anybody moving on to Graphite. I only know people moving away from Graphite so I don't think that's a good direction to go in anyway but you could. GreptimeDB is an interesting one. I had never heard of it before I started writing this talk but they are working towards a full PromQL compatibility. They're at 82.12% now. So it looks pretty promising but by their own admission like they're not like ready for like production prime time yet until they reach that point but it does look like a cool project. So maybe the next time I do a version of this talk that will be further inside the deck. You can write PromQL or not PromQL, Prometheus metrics into InfluxDB and you can query them back out. There is plugins there but Influx decided that clustering in HA was no longer gonna be in their OSS version so that kind of precludes it from being a valid option for scaling Prometheus I thought. M3DB, I think previously used to be a really good candidate for this. This is the one that was originally created at Uber but they haven't had a release since April 2022 and that's unfortunate but obviously you don't wanna use something that hasn't been updated in two years so I don't really think that's a good choice. Similar for OpenTSDB hasn't been, hasn't had a release since 2021. Probably not a great option either. Doesn't look like people are still developing it and PromQL definitely not still being developed. That one was actually, that one actually has an announcement that says this is done now whereas the previous two don't, they just don't have any more things going into them but PromQL said we're done, sorry, goodbye. So that's all the first round ones. So before I get into like the main four I ended up evaluating I wanna talk about Prometheus Federation. Is anybody here using Federation? Nobody, good. I'll talk about it a little bit and why it's not really the best solution for scaling. It has its uses but for just scaling to a larger setup it's only useful if you have a very specific setup in mind. So you might have, we'll say a bunch of Prometheus servers and you could have a top level of Prometheus that's then gonna reach out to these servers on their federate endpoint and it's able to choose which series it wants to pull in from the others into itself. That's great in a scenario where the top level one is meant to hold things like global aggregates and then the other servers are meant to hold the more localized granular data. That's exactly what it's for. But if you were trying to go for a scenario like oh I just wanna pull everything up into the top level one from all my like leaf Prometheus which is the correct plural by the way. From all my leaf Prometheus then what's gonna happen? It's still just a single process so it's gonna reach out to all these pull in every series from all these Prometheus and then it's still just one machine so you end up in the please help scenario again. It'll just fall over. It's not some super magical thing. It's still data going into a system and that system only scales vertically so there is a runway that is only so long there. So the federation is like a lot of management and overhead at best. If you only have a small setup it's okay but can you imagine trying to figure this out for like 20 plus Prometheus servers and which things are being rolled up and which things are not and now this becomes this like overhead nightmare. So that's why I don't really consider it a good contender here either. So the actual contenders I ended up with, there's four. So we got Cortex, Victoria Matrix, Thanos and Grafana Mimir. As I said in the beginning I'm trying to do this in a really unbiased way. I'm not even gonna recommend a particular one as I said they're all very valid choices. I'm gonna let you know their factual status on several criteria that I think are important when you're considering a production rollout of one of these and then you make your own decision. All the information that I pulled out is coming from the public docs of each solution or something I found in the relevant GitHub of the project which means if I'm wrong it was wrong up there. So there's no opinions here. This is literally what I could just find online so don't come attack me if you're really attached to one of these products. But if I'm wrong do come tell me and I'll tell you I actually in my speaker notes have the link for where I found every bit of info so I will link it to you and you can go fix it. One other thing, so performance which you might think is the most important thing here isn't enough. There's a very recent article on motherduck.com called performance is not enough and there were two lines in it that I thought were really, really pertinent in this scenario. So one is that performance in general and general purpose benchmarking in particular is a poor way to choose a database. And I think that's so accurate because a lot of databases end up commoditizing further and further the longer that they exist. There isn't some secret arcane ash that some programmers know and some don't where they're gonna make something a hundred times more efficient than everybody else and nobody else can do it. If somebody finds out a novel new technique or a novel way of accomplishing a particular task that's way more efficient, its competitors are going to do that eventually too because that's just the natural way that stuff works. Somebody figures something out, people emulate it in their product so the performance ends up converging over time. So solution A might be 30% faster in their newest release because they figured something out but then BCND will all move towards that level of speed in the next few releases anyway that tends to be how it goes. So choosing a solution purely on who's performing fastest right now isn't necessarily gonna help you when whatever choice you make you're probably gonna keep it for at least a few years and all the other ones are gonna be at the same level anyway or at least something close to it. So don't make performance your only criteria. The second line here is you're better off making decisions based on the ease of use, ecosystem, velocity of updates and how well it integrates with your workflow. So can you use it in conjunction with your other tools? Is it still being actively updated and stuff actually gets fixed? Is it being updated to run on the latest versions of whatever dependencies it uses for security purposes as well? Is there a good ecosystem of plugins around it? Is it portable? Does it introduce any kind of technical debt or lock-in? All these things I think can be more important than the performance as long as the performance is at least some kind of baseline of okay. And this came from Jordan Tagani at Mother Duck who yeah, I thought that was a great excerpt so I came across this like days before I wrote this and said, cool, I'm gonna borrow that. So these are the criteria I'm gonna go through for each of the four and I'll explain why for each one. First is the operational mode. So this is whether it's centralized or using some other kind of model. This is just so you know how to actually deploy and how it works like pull, push, all that kind of stuff. The second is how the long-term data is being stored. So is it being stored on disk or is it being stored in object storage or is it being stored in SD cards or how is it actually being stored long-term? This has implications for reliability, it has implications for cost and so those are worth considering because you're gonna be paying for this infrastructure anyway. The third is does it support open telemetry native ingestion? So there are solutions for converting Prometheus metrics into hotel format but maybe you're pulling in metrics in hotel native format anyway and you'd like to keep it in hotel native format which is a totally valid ask and so you wanna know do any of these back end support it or only partially support it or what's happening there. Good for future proofing. The next is around promql compatibility. So some of the solutions including the commercial SaaS based ones will tell you that they accept Prometheus metrics and let you query in Prometheus query language but they don't actually have full compatibility of promql which means you can't run all the same queries, they just don't work and anytime you have to deviate from a full promql compatibility that means you accrue technical debt because you're gonna change your queries to match this other syntax that's needed for it to work and it means if you ever move off of that solution you have to change them back. So this can be a very dangerous spot to end up in if they're not fully compatible. Next is the known scale. So I said performance is not enough, scale isn't everything. You might know what scale you need as a maximum and maybe all these are larger than you would ever need in which case it doesn't matter but maybe it does. So I tried to pull some kind of known online reference that says how far these can scale at least in write. By write I mean like write load. Quering is much trickier because while you can say oh we can ingest 15 million series into our solution it's we don't really necessarily have a standardized benchmark for the read path. Like we can publish benchmarks that say we did these kind of reads but then the other solution says we did those kind of reads and the data can be different and the queries are different and so it's hard to measure those directly. So I don't have that one as one of these but it could be something to consider as well. The next is multi-tenancy. So I said in one of the limitations why that's a problem or why it might matter to you. I have spoken to plenty of people for whom it doesn't matter. They just want to dump everything in one tenant and they don't care. But if it does matter to you I have that here. The next is can you utilize limits on a per tenant basis because you want to prevent team A from influencing team B. I got a couple more. Native histograms support. Native histograms are the updated Prometheus histograms. These are the ones with also called exponential histograms. I think there's another name but it escapes me now. Native histograms is the official name. This is histograms that can have as many buckets as you want and only correspond to a single time series. That feature is actually still experimental in OSS Prometheus so that means that any solution that has it implemented it still officially can only be experimental because the underlying feature is experimental but some of them have it and some of them don't. Down sampling, this is when you want to reduce the resolution of the data over time. That's important to some people for different reasons. To some people it's important for querying. It's easier to query less data points if you're trying to query over like years. And for some people it's for cost. If we reduce the resolution of the data we reduce how many data points we're storing and so it's cheaper. Or that's the thinking anyway. And the final one I have here is how many releases have they done in the past two years? And I did it on minor releases. I didn't count patch releases. Minor releases only. Not saying everybody does semver correctly. But I counted it on minor releases. It's also not the most accurate metric though because some people put a lot into a minor release and some people don't. And some people put in breaking changes into minor releases and some people don't. So it's kind of not the best gauge but it's an indicator so I included it. So let me jump into them. So we'll start with Cortex. Not for any particular reason. The mode that this works in is a centralized cluster. So it actually switches the Prometheus model around from scraping to a push model. So that means you have to send data into it via the remote write protocol, which Prometheus supports. You can send that from not just Prometheus but anything that can support that protocol. You can write your own curl script that does this if you want. But the main things people do this with our regular Prometheus servers, they can be configured to do that. They do that with the Prometheus agent. They do that with Grafana agent. You can configure the hotel collector to write in remote write format. There's an exporter for that. You can do this with Telegraph. You can do it with tons of things. So as long as it's in that format, this solution can accept it. I think they all can, actually. But this one accepts it in that way, but it becomes a central cluster. You push the data out to it instead of it reaching in and scraping out. So it inverts the model. The second here is that the most recent data in Cortex is stored in block storage. So it means it's on disks. And then the long-term data is stored in object storage, which makes it really cheap to store. And so the long-term data is just pulled in from there. There's fancy indexing going on and all that kind of stuff. There's not usually a huge concern with that, but if you are deploying, say, on-premise in your own data center and you don't have object storage available, well, this will probably be a problem for you. So you should look into getting that. The next is OTLP ingestion. So for this one, it's work in progress. I was able to find an issue that indicated they are working on it, but it's not there today. PromQL, 100%. Perfect. So nothing to worry about there. I will say the PromQL indicator I have here, there is a third-party company called PromLabs. And the last time they ran a big battery of these was a couple of years ago, but they tested a ton of backends with the same exact compatibility test. And that's how we get these results. So they basically said all these ones are 100%, these ones are not, and so on. And so I'm pulling from there. They're able to submit, the vendors can submit their own result to that site and have them published. So I'm just going off of the latest one that's up there. No one's scaled for this one. I couldn't find a definite number. I found it to go up to millions of series, which was fine, but I didn't get a specific number. They had a, there was a Medium post from, sorry, no, was this the Medium one? Do, do, do, do, do. No, they had a case study published from a company called Gojack and they had listed millions, but not how many. So that's all I had to go on. The multi-tensi, it is there. They do it via headers. So you use the Xscope or guide header. There isn't necessarily auth around any of these for any of these solutions by default. You can add your own auth layer, like you can put a proxy in front and have it authenticate and then have that attach the header. You can do that and that works, but this doesn't have any auth by default, but the tenants are separated by this header data. So if you just write metrics with this header set, it just goes into that tenant. If a tenant doesn't exist, it will exist once you write something to it, just automatically. Does have pertinent limits. You can set all sorts of limits like how many series can be ingested in each tenant. You can say how many labels can be on every metric, a maximum. You can say how big they can be in various different types of sizes. You can set an ingestion rate. You can set all sorts of things. Native Histogram is not supported in Cortex yet, but it does seem to be under active development. It looked pretty far along, so that'll be there soon. They're also working on down sampling. I think it wasn't as far along, but it is something that's also currently been worked on in that project. And I found that they have about two minor releases per year over the last two years. So, pretty good. Next one here is Victoria Metrics. Bit different in a few different ways. In the mode, it's the same as the previous one. So central cluster, remote write, accepts all the same things, works fine. Storage mode here is different. There's no object storage used in Victoria Metrics. Everything is done on disks. And this can mean, like, this can be viewed as a good thing or a bad thing. You could say it's a good thing because there's no dependency on object storage. That's one less thing you have to deploy. Some people might say it could be a negative thing because object storage has a much higher reliability score than disks do. Disks fail more often than object storage does. So it depends on how important that is to you. Regular disks can be more expensive than object storage. Object storage is historically quite cheap, but of course that depends on how you use it. So good thing to consider. They have OTLP ingestion already released. So that's great. The PromQL compatibility is 74.16% on that last PromLabs test. The way that this differs from the Nokia one that I mentioned in the prior eliminations is that while Nokia is working towards 100% compatibility, this appears to be a deliberate decision, which is kind of a more unique take. So if you follow the GitHub issues, you'll find that what seemed to happen was the authors of Victoria Metrics didn't like the way PromQL did some things. And so they essentially have like a partial fork of the language called MetricsQL where they did the things differently that they thought were better, which is fine. That's a fair take, but it does mean that if you move to it, you have to change some query stuff and if you move off it, you have to change it back. So if you like how they do it, sure, go for it. On a non-scale, I did find an article on Medium from Criteo, I don't know how to say it, Engineering, where they said they got it up to a billion active series, which is really cool. So pretty big scale. Multi-tenancy is there, but they gauge the multi-tenant rules in their enterprise edition. So you can send stuff to different tenants, but if you want alerting rules on a per-tenant basis or recording rules, you have to pay, which I thought was an interesting choice. Per-tenant limits, same thing. They only exist in the enterprise edition as well as getting statistics on a per-tenant basis. Native histograms are not there. I couldn't find anything that indicated they were planning to add them. There was a GitHub issue where they were suggesting people to use their histogram version, like it was called like Victoria histograms or something like that, but they couldn't find any indicator that they were working on this. Down sampling is there, but it's in the enterprise edition. Their velocity is really high. So I found between five and 10 miners per year. So very active project, which is pretty cool to see. Onto the third one, Thanos. This is probably the most well-known. One that I'm aware of, I don't think anybody who talks about scaling Prometheus always knows of Thanos. So it definitely has the most name recognition. This one has two modes. So it can run as a sidecar attached to Prometheus processes where it queries those Prometheus directly and sends it up to the cluster. That means that recent queries are actually live queried from these leaf Prometheus servers, but you can also send data directly into it in a centralized model via the Thanos receiver mode. And you can run both modes at the same time, depending on your use case. And maybe you have limitations around network traffic directions and firewalls and all that stuff. So kind of versatile from that perspective. Storage is similar to three out of the four. So block storage for recent data, object storage for other data, which is fine because it actually, there's a lot of code shared actually between Thanos and Cortex and Mimir around how data is stored in object storage. So that's kind of cool to see, very, very open sourcey. OTP ingestion doesn't seem to be being worked on right now. I found an issue where somebody said, I would like to work on this and nothing else. So I don't know where it stands. It's certainly not anywhere I could easily find. ChromeQL 100%. Perfect. I found for scale, I found they have a case study published on the Thanos site from a company called Medallia where they say they got it up to a billion series in this with a crazy architectural diagram. But they said it got up to a billion. I'm not here to validate the numbers. I'm just telling you, this is what I could find. So that's cool. We've run stuff up to a billion and like it's, you can do it, but like it's work. Multi-tenancy, they have it in Thanos, but it's done slightly differently. It's just using external labels. So it's a little looser. You are storing everything in like a big TSTB instead of separate ones per tenant. But it's still a trend function, but it does depend on how you do your, like who has the control over your metrics infrastructure. Is that all controlled by a central observability team or is it controlled by individual teams? Because if it's individual teams and it's only done via labels, then there is room for them to do things that they shouldn't really be able to do. But totally depends on what you need for your environment. Part tenant limits, they have them as experimental, but only for the receiver mode because of the nature of it. So receiver mode is receiving stuff via push from remote right, so they can push back. But if you're doing a sidecar mode, well now you're scraping and that's a different entire setup to have to put limits around. So it's a bit trickier to do that in. Native histograms, they're in there, already done. Down sampling, in there, already done. So you can just take that and do that straight away. And the velocity is about three to four miners per year over the last two years. So pretty good velocity as well. And on to the last one. So this is Grafana Mimir. The mode is similar to most of them. So it's a centralized cluster. You remote right up to it. Prometheus agent, Grafana agent, telegraph, whatever, doesn't matter. The storage, block storage for recent, object storage for the older data. Nice and cheap. Old TLP ingestion, already there. PromQL, 100%. Nonescale, a billion series. We have a blog up on our site about running it up to a billion. There's at least a couple of customers running it up. I would have run it up to a billion. I don't know if they do it all the time, but they have done it. Multi-tency, this is the same as for Cortex. If you're not familiar, Mimir was actually a fork of Cortex. So it inherits a lot of things from that. But this happened in 2022. So in 2022, they were the same and they've diverged since then. But the tendency system existed prior to that. So it's the exact same system. As a result of that, same system for pertinent limits. We might have different limits now or added more, but it works the same way. Native histograms are already in. We don't have down-sampling today, but it is something that we do plan to add. The way that, like we actually put off working on down-sampling for some time because most people who wanted it that we spoke to wanted it for cost-saving reasons. But with the way that Mimir works and its use of object storage, the storage cost is actually less than 10% of the TCO. So it's not the area to spend a lot of time optimizing because the cost-saving isn't gonna be that high. Now, the other use case, which is if I down-sample the data, it will make long-term queries easier because there's less data to iterate over. That's more valid. That's why we are gonna add it anyway, but definitely pushed it out because it wasn't gonna help with TCO very much. And the velocity is about four or five minor releases per year. Over the last two years, where 2.12 I think is about to come out. I saw a release candidate up there already. So yeah, here's the big table. Which was fun to try and fit all this on the same page. I knew I'd see a few phones go up for that. Yeah, so this is the big table. This obviously is a point in time. This is how they are today or this week. This will obviously change over time. Some of these features could move from enterprise into OSS or some of these numbers could change or architectures change, anything like that. But as of this week, this is all true from what I could find online. Again, if I'm wrong, it's wrong online. Don't blame me. So there's lots of good options here. There is no single best answer really. It really depends on what's right for your needs and what's important to you in your environment. There are things that you might care about that we don't care about, things that we do care about that you don't care about. And the big thing is we're all just trying to use Prometheus and scale Prometheus because we like Prometheus and just want to keep using it and stay in that ecosystem and just keep everything working well and are monitoring functioning properly. All of these solutions work with Grafana, which is great. Love hooking up everything to Grafana as much as we can. And this keeps our little mascot here very happy to see it in use. So that brings me to the end. I can do some Q and A. Just wanted to highlight in the top left my colleague Ananya is doing a talk on incident response with open telemetry at 6.15 today in room 107. We do have a booth in the hall. If you haven't been in the hall, come to the booth. I'll probably go there for a bit until Ananya's talk. So after I leave here, if you wanna come ask me stuff, you can. And yeah, if you wanna check out some of our other open source stuff, it's on there. We've got something for logs, for traces. Metrics is what I talked about today, but we have logs and traces as well. So I can talk about any of those with you. So other than that, I'll do questions. I'll come up to you so we can record the questions as well. Is this working? Okay, go. First of all, is this slide deck available anywhere for us? Not right this minute, but I probably could, yeah. Okay, I was just curious. And then the second question. Thank you. First of all, great talk. Love getting the overview of the whole ecosystem. Why do you think so many of the solutions you presented only use remote write sort of as their ingestion method whereas Thanos has the sidecar alongside remote write? So that wasn't like super loud because that mic is kind of weird, but it's okay. The question was, why do I think so many of the solutions use remote write as the ingested method and Thanos is the only one using the sidecar method? Which is a great question. I know why that is for Cortex and Mimir. So the reason for that is Thanos existed before them. Cortex was created in part to deliberately have a slightly different model to what Thanos did because we like Thanos, but we thought there were things about it that didn't work as well. So that model that Thanos has does function, but the fact that it then creates this reliance on all these leaf Prometheus servers means that if you have an issue with those, you actually don't have a full picture of your data. So if there's an issue out in this network that this Prometheus is and you want a query data that was coming from over there, it's not gonna be present. Whereas if everything is pushed up to a central cluster, you always have access to all the data that's in there. It does mean that you can also have your query time affected by one of these being slow, of these outer like Prometheus servers. And so when Cortex was being created, we wanted to not do that. And so we did it the other way. So it was a deliberate choice for that. I can't speak to Victoria Matrix. I don't know why they did it that way. Thanks for putting this together. I have two questions. The first question is, do all four of the contending solutions all use Prometheus adapter? Is it all these Prometheus, what sorry? Prometheus adapter? Prometheus adapter, no. I think Prometheus adapter, when I looked, it was only intended for, it was like open TSTB and graphite, and I think like one other one. That's what's listed on Prometheus IO for the Prometheus adapter. But do you mean like just for writing metrics into them? Or you mean like how they're translating everything? So there's like the custom metrics scaling? If you wanna scale off of like Prometheus in Kubernetes specifically, is it compatible with like, you know, Mamiir or Spanish? Oh, so can you use the, sorry, I was thinking of the Prometheus adapter, which is different to the metrics adapter. Totally different thing. So like the question is like, can I use the metrics adapter so I can use the, like Pod Auto Scaler within Kubernetes? You definitely can for Mamiir and Cortex, because I was working with Mamiir before when it was still Cortex and they both supported it. I don't know for Victoria metrics, because I haven't tried, but it should just be passing through queries like normal. So as long as your queries were valid, I think it should still work. I don't know how it handles the Prometheus but not quite PromQL part. That part could be a little different or maybe it just passes them through normally and it works fine. But the other three definitely do work. Okay. And then the second question was, I noticed that the Grafana agent and the like the open telemetry collector are really similar, but they're not quite the same. And I think there was like a fork at some point. What was like the, is there gonna be like compatibility between the both of them like you're out? So compatibility between the Grafana agent and the hotel collector? Yeah, so could we just use hotel collector instead the Grafana agent? So the answer is you can, the short answer is you can do whatever thing you want. There's lots of ways to do it. What happened was we originally created the Grafana agent as something more lighter weight than a full Prometheus server when you want to remote write metrics out. Cause for many people, if they're gonna remote write metrics out, they're not actually gonna query that Prometheus server by itself. So that Prometheus server having its own TSTB and query engine is redundant and a waste of resources. So the Grafana agent started out as Prometheus, but we ripped out the TSTB and the query engine and just kept the service discovery and the scraping and said, that's a Grafana agent right there. It's come on a lot since then, but that's what it was like in like 2020. In relation to, oh, and as a side note, the Prometheus agent was actually created after the Grafana agent as a subset of the Grafana agent. So we went to the Prometheus project and said, hey, do you want like an agent? And they were like, we don't want all this extra fluff that you put in there, but we'll take like the core bit and that's the Prometheus agent. So it actually came in reverse order from Grafana out to Prometheus. Hotel collector is interesting because like it was a totally separate thing, but the Grafana agent actually includes now a ton of hotel collector components that you can optionally use. So you can use the Grafana agent and only use Prometheus based components or you can use the Grafana agent and use a ton of hotel components. And some of these are direct code copies over from the hotel collector project. So they work identically. Some of them aren't because some of them are tweaked slightly to work better with like our solutions with less work on the user. And so it's kind of a mishmash of like whatever you think you'd want to do. There's an effort on our side to like we're kind of shifting towards like the Grafana agent is basically our distribution of the hotel collector because it includes all these pieces. So you can use either, you can certainly write with the hotel collector directly to our stuff. You can use a Grafana agent. You can even do a mix. So I've seen people do a setup where they are collecting with the regular, what do you call it vanilla hotel collector close to their workloads. And then what they're doing is they're forwarding that into a Grafana agent. And then they're doing some, you know, modification of the data in there, some enriching of the data in there. And then they're sending it out to Mimir or to our cloud or to something like that. And so like you can kind of combo these things together in whatever Lego brick fashion you really want to and everything works. Yeah, do you see concern with a split between Cortex and Mimir? Do you see concerns with the future of Cortex like development? Concerns with the future of Cortex development. So that's a great question. If you were to go to the GitHub page for today, you'd see that the maintainer list is four people now. It used to be a lot larger when we were directly on it, but when we forked it, we all, I'm not, I don't directly code it, but all the Grafana people moved off of it after the fork happened. I think like three of the maintainers work at Amazon because I think it is the basis for Amazon managed Prometheus. And then one of them is at like Adobe, I could be wrong. There is an issue I found around like the CNCF was asking questions for them and they said that they have like a lot of extra contributors but they haven't added any new maintainers in a bit. So like, I don't think I'm in a position to make like a lofty statement there but that like, I think that is a valid question. It definitely is still getting releases, which is why I included that they are doing two minors a year. So it's still moving. Looking back at the table though, it doesn't seem to have the same velocity certainly. Like it's doing releases, but not as many. And I would say some of the major stuff they have in flight has been in flight for quite a while. So I'd say it's a slower moving project, but I wouldn't say it's like in massive trouble at the moment. Any other questions? Thank you. Thanks folks. Space is productizing it, it's interesting. Okay, well it's on and I'm not getting any. Oh, no I am. I think I just need to put it on my shirt. Hello? Hello? That's how. Test, use the, or how's that? Is that picking up? I don't think so. Hello? I'll use that. I'll prove this one. Hello? Yeah. We have a holder too if you want it. Oh, okay. Perfect. It means that your face is holding. Yeah, that should do the trick. Thank you. After us all. Thanks for joining us. Tyler is visiting us from New Mexico and I'll let him go on and introduce himself in his topic. Thanks Tyler. Sure, thanks Steve. Hi everyone, it's late on Saturday so I appreciate you all being here. And I'm, yeah, thrilled to be here and it's my first scale honored to be presenting to you today. So, yeah, I have an intro slide I'll get to in a second, but today I'll be presenting self healing clusters, game of nodes and scaling the throne. It's kind of some horrible word play but might resonate with some of you. Yeah, mostly talking about a couple different open source projects that we can dig into to understand how to make more resilient, stable Kubernetes environments. And yeah, just feel free to interrupt me at any point if you have questions. A little bit about me. I've been working hands on with Kubernetes and Go since 2019. I'm in charge of SpectroCloud's advanced projects team so I like getting hands on with all the latest Kubernetes tech building POCs and also working on our backend and when I'm not doing that I try to get outside and play with rocks. Yeah, that's me. Feel free to connect on LinkedIn. So today I'll just go over some of the challenges that people face. You know, what makes clusters unstable and common pitfalls and then talk about the heroes of the story, these three open source projects and then we'll have a demo and maybe there will be a shameless plug at the end. We'll see. All right, so the challenge is that hand. Basically, stability is key and as everyone knows with Kubernetes getting more and more traction we are now seeing production grade workloads that are running mission critical software in a couple different categories. So AIML use cases are big, video streaming, we have gaming platforms running in production, medical imaging and when you have that kind of situation you really don't want any downtime. And also we basically are seeing a proliferation of clusters, bigger clusters, more clusters, more stuff going on as the ecosystem explodes. I don't know if the last time you looked at the CNCF landscape poster but there's just so much to choose from and it leads to just increasing management complexity and growth pains. And with that being the case, how can we prevent service outages and degradations? And there's a few simple things we can do. So a lot of the time you look into environment and you realize that every single pod is receiving the best effort quality of service because people don't necessarily even know about quality of service and that just relates to whether or not pods have specified resources and or limits. So there's three different quality of service classes, guaranteed, burstable and best effort. But in a situation where everything's receiving best effort, quality of service and you start encountering node pressure then anything could get evicted. So if you want a stable environment you probably should be delineating what needs a higher quality of service versus what maybe not might not be so mission critical. So that's just like one very brief example of something to be aware of. And I guess I'll just dig into it a little bit more which is that if you specify limits and requests for a pod and they equal one another you'll receive what's called guaranteed and then if you specify at least a memory request or a memory limit that'll be considered burstable and then best effort means you just didn't specify anything and then when there's node pressure and things start getting evicted by the kubelet you will be evicted sooner if you're in the best effort category and there's lots of things that get taken into consideration but that's just one of them. Okay, so what else makes a cluster unstable? Really when you start thinking about stability you have to understand pod eviction and pod preemption and who's doing that when? So when node resources get low aka node pressure the kubelet will start evicting pods and when it does that it takes into consideration things like priority, quality of service and many other things. Pod preemption on the other hand is where if you have nodes that have filled up and the scheduler is trying to place additional workload it needs to make decisions about how it's gonna do that. It may evict running workloads that have already been online because there's a new workload that needs to be scheduled and it might have a higher priority. So priority gets taken into consideration with preemption as well. And you know there's a bunch of different stuff in the cluster that you can configure to impact the types of decisions that get made by the kubelet and the scheduler when you're running into a resource crunch. So some of that I've already touched on but understanding priority is and priority classes is a good place to start. By default everything is gonna receive the same priority but you can assign a priority class named any pod and then associate that to the integer valued priority that is associated with the class and you're in an ideal world segregating your workload based on how important it is to you in defining a higher priority to workload that you don't want to be preempted. And then resource quotas, they're double-edged swords so they're important because you might be a multi-tenant you might be operating multi-tenant Kubernetes clusters. A common pattern is like to have a team per namespace or you might be running third party workloads truly multi-tenant cluster. And in that case you want to prevent noisy neighbors you want to ensure that namespaces are receiving a specific amount of capacity and you do that with a resource quota. So you can define the total number of objects like config maps, services, secrets that can be provisioned in a namespace. Also you can dictate the total CPU requests memory requests, storage, et cetera. But in order for resource quotas to be implemented you need every pod to define requests and if it doesn't it won't get scheduled. So this is another common pitfall is like an application developers trying to apply a resource into a cluster and administrator has applied a resource quota into the namespace that they're trying to schedule their pod into and it's not getting scheduled and they don't know why and it's because they haven't defined limits or they've just exceeded what's available and then that can take time to debug. And that's where limit ranges come into play. You can create a limit range which is a namespace resource which will define defaults that will be auto assigned to every container in every pod that gets scheduled in that namespace. But that's also not necessarily ideal like it's a good catch all to ensure that you're tracking everything and defining resources and limits but you also probably don't want the defaults all the time so it's something to be aware of. And then network policies that another double-edged sword if there are constraints around which pods should be talking to which other pods then network policies are how you do that because the default in Kubernetes is everything and everyone can talk to everything else which is fine and dev but if you're trying to achieve any sort of segregation then you end up implementing these and anyone who's dealt with a misconfigured network policy might know that they can be kind of tricky to debug and when things stop talking to each other because some policy change that that's gonna lead to instability. Stateful applications I could talk about this whole talk so I'll just gloss over it but if you're doing things wrong you're gonna lose data and there's entire solutions that I would recommend you probably look at because if you wanna roll your own in-house you have to understand snapshot controller backing up your persistent volumes best practices around stateful sets there are paid tools like portworks that do a good job there and open-source solutions but the long and short of it is if you have a stateful app and you haven't configured some of these other things and it gets evicted then you're in trouble. Lastly logging and monitoring this one's just obvious you need to know what's happening in your cluster and by default Kubernetes doesn't have a cluster level logging solution so if vanilla Kubernetes cluster you're gonna lose logs over time not everything's retained. When a pod gets killed the logs for the totality of its lifespan are lost unless you're aggregating somewhere upstream. So yeah and then one last thing is just the resource quotas can be scoped and that relates to priority classes this is just getting a little bit deeper if you wanna be defining quotas in a namespace but you don't wanna be applying those quotas equally to all the workloads you can stratify it by priority class for example. So that's another kind of more advanced usage with resource quotas. Okay so now what can we start to talk about for how to achieve stability? Node problem detector is one of our three heroes that we'll talk about shortly but what it does is at a high level it performs real-time health checks of low level processes that would normally fall beneath the kind of observability of the Kubernetes control plane. And it can be useful to understand the health of things like the container runtime that you're running on your nodes, the kubelet and other things and you can surface events and node conditions which would tie into other alerting systems or even automated remediation systems. And then for topology management there's a whole suite of different solutions. Cluster autoscaler, vertical pod autoscaler and cluster proportional autoscaler all fall under the Kubernetes parent project, the autoscaling sub project within Kubernetes. And I'll be talking about cluster autoscaler a fair bit but I'll just touch on these others first. So a hint if you're trying to understand your workload is to use vertical pod autoscaler with the off mode so it has four modes and by default like it's auto mode is it'll actually adapt your workload based on the resources being consumed but that's actually a disruptive behavior because if your initial spec is not in alignment with what's being consumed like maybe if you're running a Java app and it's using a lot of memory then it'll actually get bounced so the API server will resize the deployment and it'll do a rolling replacement so there's downtime potentially which is disruptive. So I have a link here and I can share these slides later but there's an open pull request to take advantage of a new Kubernetes feature as of version 1.27 which is in memory vertical pod scaling. So basically the ability to patch the sorry online patching of CPU and memory specifications without having to do a rolling replacement of that workload. And you can yeah like I mentioned you can run VPA in off mode and it'll purely generate recommendations for how to size your workload but it won't actually do anything disruptive so that can be a good way to if you're trying to figure out how to right size the things you have in your cluster you can do that. The recommendations will just show up in the status field of the resource and you can pull that out later and then subsequently change your manifest. Keta is another thing that if I'm giving a talk about Kubernetes scaling I kind of have to mention so it stands for Kubernetes event driven auto scaling and it's also a whole talk unto itself but it basically builds off of the horizontal pod auto scaler which is like a native Kubernetes primitive and allows you to scale based off of external metrics. It also can perform behavior like it has scalers that relate to internal cluster state but primarily it integrates with dozens of external systems like AWS all the public clouds for instance like SQSQ you might wanna scale based off of the like Q length metric and it supports scaling to zero which is cool so if there's no messages in that Q you can bring your deployment down to zero replicas and then bring it back up accordingly. Then with regard to internal scalers there's two worth mentioning which are the Prometheus scaler and the Cron scaler. So Keta can basically just scale workloads based on a fixed Cron schedule which if you're trying to operate workloads on a particular follow the sun model or what have you then that can be useful and then the Prometheus scale will let you scale using cluster internal Prometheus metrics. So that's Keta. Cluster proportional auto scaler is in beta. It's kind of an interesting concept though. It's basically you take a fixed allocation for a workload and then as the cluster size changes then the number of replicas for that workload and also the requests that they have will be scaled out proportionally. So if it had basically a request for one CPU and there was one node in the cluster and then you added a second node it would go down to half a CPU and there would be two replicas one on each node. So it's kind of an interesting concept but it's in beta. And then the other final building block is de-scheduler which is one of our other three heroes so we'll hear a lot more about it. Is anyone here familiar with the vSphere DRS? Yeah, so distributed resource scheduler. I would say that de-scheduler is basically analogous to DRS a lot of people come to Kubernetes and they're wondering why there isn't a solution to automatically rebalance workload and that's where de-scheduler shines. All right, a few more building blocks. So policy enforcement, another big change with Kubernetes not so recently but as a version 1.25 they deprecated pod security policies in lieu of, well now the new thing is pod security admission and pod security standards but they're sort of woefully insufficient for any advanced use cases. There's three buckets baseline privileged and restricted so privilege is anything goes restricted as you can barely do anything and then there's sort of a midline option but this doesn't meet the needs of a lot of enterprises so what they end up doing is just disabling this completely and then going with a policy as code solution like Caverno, OPA, JS policy. I'm a big fan of loft labs that's why through in JS policy but it's not as mature as the other two. I would recommend Caverno if you aren't already using Rego which is what OPA uses. Caverno's the most kind of Kubernetes native and it's not quite as mature as OPA but it doesn't require using Rego so that's nice but these just allow you to define very granular access control policies via web hooks and it's just way more granular than pod security standards. Logging and observability. Everyone knows about Prometheus and Grafana I think or if not, I'm sorry but it's very common and you need cluster level logging like I mentioned on the previous slide. Fluent D is an open source project that does a great job of that with this sidecar model basically uploading all the metrics for all of your pods to a central location for querying. If you don't know what's happening when your pods are dying or things are crashing then you won't be able to debug anything so that's something that's mandatory in production and then if you think you're doing everything right or at least you think you're doing a pretty good job and you want to understand whether you're able to actually live up to the SLOs that you claim to provide you might want to start playing with chaos engineering chaos mesh is a project in that space that can basically just break things in your cluster. It can simulate node failures, take workloads offline at random, all sorts of different things and of course you'd simultaneously monitor uptime for all of your mission critical services. Okay, onto the heroes of the story. The first one is NPD, the node problem detector. This is also within the Kubernetes project so open source and has been seeing more traction recently. Basically you can install it as a daemon set, you can also just run it, in some cases it's best to have it running directly on the nodes but that's a little bit more complicated for provisioning, although the advantage in that case is that it's not subject to the availability of the container runtime interface which it's actually monitoring so it kind of in a lot of senses makes sense to do that, it's just more work but what it does is it submits events and node conditions to the API server and it also can export metrics so it has a concept of exporters. The default or kind of the main use case would be to export events and node conditions to the API server but you can also export metrics to Stackdriver and Prometheus. Events are for basically less severe or temporary issues and then node conditions are interesting because they tie into the sort of the synchronicity of these three projects which we'll talk about shortly but if you set node conditions on nodes that can inform de-scheduler to basically take those nodes or take workload off those nodes and then cluster autoscaler can kick in after a certain point and decommission the node completely. So that is node problem detector and a little bit more but it has a plug-in system so the system log monitor monitors logs on the nodes that where this is running there's a kernel monitor and a container runtime monitor and one of the example node conditions that would be generated by the kernel monitor would be a kernel deadlock. So basically this just means you have a list of node conditions that you wanna set on nodes and you set those conditions when you find logs in certain files matching certain regex patterns but this can be useful for instance detecting that a kernel deadlock happened on a node and then annotating that node with that node condition which can then allow you to take action upstream and realize that that node needs remediation because you might be able to react faster in that case versus just waiting for the node to completely die and the kubelet to stop responding. The health checker monitors the kubelet and the container runtime interface so docker or container D and likewise can produce conditions on the nodes like kubelet unhealthy, container runtime unhealthy. Then there's a custom plugin which allows you to basically just run any shell script you want and they have an example in their repo for basically indicating that a node has an NTP misconfiguration. So it's very flexible, anything you could do with bash and any condition you want to is the other thing worth mentioning. These are just examples but you can come up with your own that are meaningful to your own organization. And then lastly the stats monitor. By default it just populates metrics that are exposed via a standard Prometheus scrape endpoint and they're for storage, CPU, memory, all the typical things you'd want to know about the behavior on that node. And they can also be exported to Stackdriver. So the second hero of our story is de-scheduler. So it's basically DRS for Kubernetes. If you're a vSphere fan you can think of it that way. But what a lot of people don't know is that the scheduler doesn't actually evict pods for rebalancing purposes, it just does one time scheduling up front and then you get what you get. And in a lot of cases that's fine if the cluster isn't experiencing any disruptions but what might happen is that a node goes offline and you basically have all of your workload on one of your nodes and it's way overutilized. And then if you introduce a new node because your original node failed, nothing happens, you're gonna have to manually go through the effort of reallocating workload or bouncing pods. So de-scheduler will do that for you. This prevents bottlenecks, enhances efficiency and can save you money. So it has a bunch of different strategies we'll get into because there's different ways you might want to use it. That's de-scheduler in a nutshell and you can install it like anything else in Kubernetes world, has a Helm chart. You can run it in different modes. In the demo we'll be seeing it running as a cron job so it runs every one minute. You could just run it continuously though. So six default policies that ship with de-scheduler are here and then you can also write your own but these are just sensible defaults that you might get value out of. They're all pretty self-explanatory but low node utilization means de-scheduler will detect how busy nodes are and if it deems them overutilized it'll evict pods to try to reduce the utilization. The goal being to balance the workload across the entire cluster. And then high node utilization is the opposite so you want to detect low, it wants to detect low utilized nodes and then evict workloads off of them to try to bin pack the workload basically. And then that ties in with cluster auto-scaler because if you have a de-scheduler configuration of that nature then cluster auto-scaler can start taking those nodes offline when there's so few pods on them that it determines that it's okay to get rid of them. And then I'm not gonna dig into the anti-affinity stuff but it's there. And then this last one, remove pods violating node taints. This is where the sort of trinity of these solutions comes into play. So right now there are limitations but there's been active work on this linked pull request here in the last few months. Basically what happens today is that the node controller and Kubernetes will only mark a node with the no-schedule taint if certain conditions are set which are sort of the default conditions that are set by the kubelet, see PID pressure, memory pressure, disk pressure, et cetera. And in that case, if that taint is applied then de-scheduler can remove pods with the no-scheduler with the no-schedule taint. And that's super useful but what we wanna do is use NPD to set node conditions that are not the same as these kubelet-based node conditions. We wanna talk about custom things like a kernel deadlock and that's not quite ready today but it should be merged soon and at that point we'll be able to leverage these three tools together. And it's not that node problem detector isn't useful today, it just can't specifically be used to achieve that goal. All right, cluster autoscaler. So this runs on the Kubernetes control plane typically in the kubesystem namespace and it has integrations with all the major clouds and if we're wanting to do this bin packing thing that I've been talking about, one thing to consider is how you've configured the node resources fit plugin for the kubescheduler. By default it'll use the least allocated strategy so this is how the kubescheduler decides where to put incumbent workload. It'll pick a node that has low utilization but if you choose most allocated it'll do the opposite and then that just basically increases the aggressiveness of your strategy for bin packing workloads because de-scheduler will have to do less work in that case. And what it does at a high level is just changes the sizes of the node groups that make up your cluster whether that means adding or removing nodes and it's configurable through annotations as well as a lot of other options I'm not gonna get into but some of them are good to know about like if you wanna prevent a node from being taken offline completely you can give it the scale down disabled annotation and then pods likewise can be protected with the safe to evict true or false annotation and daemon sets with the enable DS eviction. And then another thing is the pod priority cutoff so some workloads you might just want cluster autoscaler to disregard completely and you can do that by setting a priority for those workloads that's beneath the cutoff and the cutoff is configurable but it defaults to minus 10 so you'd wanna have a priority class of like cluster autoscaler ignore and set that to minus 15 or whatever and then you could throw 30 pods with that priority class onto your cluster and cluster autoscaler wouldn't even if some of them are pending it wouldn't increase the size of the cluster. And how exactly does this cluster autoscaler do what it does? Well it depends on whether we're talking about scaling up or down but with regards to scaling up it has this concept of an expander and there are a few different strategies for expanders but the gist of it is if there are any pending or unschedulable pods in your cluster then cluster autoscaler will try to make a decision about how to add capacity to one of your node groups. The default is random so a lot of clusters don't, well and also node group isn't like a Kubernetes native concept there's no, like there's no primitive called a node group this is more meaningful with regard to like managed Kubernetes in like for instance GKE or AWS. A node group would correspond to an autoscaling group in AWS or a managed instance group in Google but then it also relates to cluster API which I'll talk about a bit but yeah you can get complicated there like to considering price or priority for instance if you have like a GPU node group and you wanna make sure that new nodes that get scaled out aren't GPUs because you don't wanna spend a bunch of money there's ways to do that and then cluster autoscaler scales down nodes when certain things are true so all of these things have to stay true for by default 10 minutes you can configure that threshold as well but basically a node has to have no blocking annotations it has to have all movable pods and the sum of all of the resource requests for all of the containers and all the pods running on that node has to be below another threshold which is configurable and then for a pod to be movable that ties into pod disruption budgets so by default like all the pods running in the cube system namespace are considered unmovable unless you've specified a PDB that says no actually it's okay you can evict these pods so it can get a little complicated and you might be in a situation where you're wondering why cluster autoscaler won't scale down your node and these are things to think about and we'll see a demo of when and why that wouldn't happen lastly yeah it's compatible with 25 plus cloud providers and cluster API and I'll talk a little bit about an example architecture with cluster API so for those who aren't familiar cluster API and a nutshell is that you do declarative orchestration and provisioning of Kubernetes clusters from within a Kubernetes cluster using custom resources so the management cluster is where you would apply these resources which define the specification for your target clusters that you want to provision and then there are different cluster API providers for different backends like vSphere or AWS any cloud any infrastructure provider you can name there's probably a CAPI provider for that and so here in this reference architecture we would use cluster API provider Helm to basically finish off the provisioning process because one caveat with cluster API is that out of the box you don't get a fully conformant Kubernetes cluster you get a cluster that has everything it needs but the CNI which is kind of you know unfortunate you have to get a CNI in there somehow before you can start scheduling all of your other application workload that's where the cluster API provider Helm comes in it's basically an add-on system for cluster API that will allow you to specify Helm charts to put in on top once the cluster comes up and then you might layer in the rest of your application workload with Argo CD via like a get-ups approach this is just a high level picture of cluster API but cluster autoscaler in this case sits on the management cluster and that's also configurable it could be on the workload cluster or it could be somewhere else completely but cluster autoscaler really shines with CAPI because you can do things like scale to zero and just perform declarative management of you know a whole fleet of workload clusters and we'll look at a little bit of YAML basically we have a deployment for cluster autoscaler and it's in cube system and what we have here is the cube config for a theoretical target cluster CAPI dev and it's mounted as a secret into this cluster autoscaler container so that it can talk to that target cluster and understand what's going on within it and then there's this concept of node group auto discovery so like you know what is a node group it's not a Kubernetes primitive but in this case we're telling cluster API to just consider all of the machines within this CAPI dev target cluster as one node group but that can be more nuanced you can break it down by labels and namespaces et cetera all right so we have this deployed in the management cluster and all we have to do is add a couple annotations to the corresponding CAPI resource that was reconciled on the management cluster to create the target cluster so we just define the minimum and maximum size of that cluster and in this case one in 10 and there's a bunch of different ways you can define clusters in CAPI so it could be a machine set deployment or pool and I won't get into the differences but basically these are custom resources that define the size of the target cluster and the types of nodes that will comprise that cluster some CAPI providers have native support for scale from zero but even if they don't you can add additional capacity annotations that basically tell cluster API even though the provider doesn't support scale from zero this is basically the size of the node that will get provisioned in terms of its CPU capacity, memory capacity, storage capacity et cetera and when you annotate your machine set for instance with all of that information then cluster API has all the information it needs to understand when to add and remove nodes which is pretty cool all right let's make them work together all right so basically I have taken a few shortcuts I have an environment in Google Kubernetes Engine that I'll show you here in a moment and the first thing I'll do is just deploy I have a single node running in it and I'm going to deploy enough pods to create resource pressure then we'll watch as cluster autoscaler provisions a new node and subsequently de-scheduler will rebalance pods to have an even spread between the two nodes which wouldn't happen by default and then we'll update the de-scheduler config to do the reverse so basically to bin pack all of our workload and I'll also delete some of the pods that I created to initiate that resource pressure and then we'll watch as the pods are all allocated on a single node and cluster autoscaler de-provisions it and then lastly we'll write some messages straight to the kernel message log and we'll watch as NPD updates the custom conditions that I've configured and we'll see that in the Google console I start doing things in K9S I can't see this here I can only see it there so that's weird here's my scale 21x demo cluster I've got one node and I'll just show you a few GKE specific things first there's this concept in GKE of an autoscaling profile by default it won't be set to optimize utilization this basically means it tells cluster autoscaler because there's a managed cluster autoscaler that's being configured and operated by Google on my behalf so I didn't have to install it in my cluster it's as simple as just clicking a button in GKE to enable this feature and now because I have this profile enabled it will be aggressive in terms of decommissioning and commissioning new nodes it just happens faster so it reduces some of those thresholds and if I go to the nodes and I click into this default pool basically I'm just trying to show you that it's as easy as click of a button with GKE but it's being really slow and I've configured it to have a minimum of zero maximum of three nodes and instead of having you watch that spinner I will just move on so up top here I'm just showing all the nodes in the cluster and then down below I've got all the pods and I've created a bunch of pod disruption budgets so I mentioned earlier certain things will not get evicted by default unless you create a PDB so things in cube system are considered immovable and so unless you apply these PDBs then you'll see cluster autoscaler failing to scale down your environment that's why these are all here and I've got de-scheduler installed and it's running as a cron job so if I get the cron jobs you can see it's scheduled to run every one minute and if we look I've applied an annotation which basically says cluster autoscaler should not care about any of these de-scheduler pods so that you know they don't get terminated right away as the jobs complete so they start to accumulate and therefore I just want cluster autoscaler to ignore them so that it can be as aggressive as possible in decommissioning and then I've also got NPD running so I only have one node and it runs as a daemon set there we go so we can see node problem detector is running and so far it hasn't really done anything other than it checked for a bunch of conditions and said they're false so we can see by going back to the nodes and describing this node and going down to its conditions we can see node problem detector has said you know typically you wouldn't see a lot of what we're seeing here like no corrupt docker overlay kernel has no deadlock these are all things that are added by NPD based on the fact that it hasn't detected any problematic log lines and then we have the standard conditions that the kubelet creates like ready memory pressure, disk pressure lastly we have de-scheduler configured via this config map which is unreadable for humans so I won't show it to you that way but this is the current de-scheduler config and I'm not going to go over to excruciating detail but basically I have low node utilization enabled and we have some thresholds so any node whose total resource consumption falls with beneath the specs here and it has to be for all three will be considered underutilized so fewer than twenty pods less than a hundred percent CPU utilization less than ten percent memory that's underutilized and then anything any node that has all of those same metrics above these values will be considered overutilized and if that's the case de-scheduler will say you know hey this node if it's overutilized it will evict pods to try to reschedule them onto a less utilized node so that's what we have now I'm going to just create some arbitrary workload pretty soon here we will see pending pods so now that these are pending that's going to trigger cluster autoscaler to scale out the capacity for the node group we only have one node group but it won't take very long and then we'll have another one and we'll change the policy and and flip it will actually will wait a second to let de-scheduler balance everything but let's give that a minute and maybe well we wait this would have loaded yeah in Google or GKE it's as simple as just ticking this box and specifying your threshold this is sort of similar to the annotations I was showing you for cluster API but it's just you know white glove integration and you can view the cluster autoscaler logs in Google Google's log explorer what I'm going to do is hopefully show you the scaling decision that gets made so what was that I think that just happened so is this can anyone see this is this how's that so we just triggered a scale up cluster autoscaler triggered a scale up because it saw this pod that was in pending and if we go back yep it's there and not only that but it's actually balanced so by default we would have had you know forty some odd pods on that first note and I apologize because I was being too slow and we missed it but we can go back and find it in the event if we look for well let's just go down to the latest events so we see a bunch of evictions based off of taint so these everything you see here with taint manager eviction this was de-scheduler evicting workload because a node was tainted but let's just go to the de-scheduler pods and look at some of their logs so the last one ran a minute ago okay it didn't do anything useful a minute ago we might have to go further back maybe I'll just show you on the scale down because I was a little too slow but the gist of it as you can see is that this rebalance happened what I will do now is flip the policy so if I show you the de-scheduler high policy it's just the inverse we have low node utilization disabled and we have high node utilization enabled and that will cause the workload that we have currently balanced to become bin-packed so I will apply that oh that's not what I want to do so that's been reconfigured and then I'll go delete a few of these things so the high memory and the high cp low cpu deployment I'll just kill completely I don't get rid of them then there's still be just too much going on in the cluster for it to be deprovisioned and then the high cpu deployment I'll just scale it down to one and what we should see soon is that the cluster autoscaler will add a condition to the node that gets rebalanced and a market for deletion and it'll get taken down but this time we'll watch de-scheduler hopefully do something interesting so de-scheduler detected an underutilized node and evicted some pods and you can see the ratio has shifted and if we give it another minute that ratio will shift even further so we have to sit tight for thirty seconds but once the number of pods on the node gets down to about nine it'll have sufficient the sum of the resource and requests so the cpu and memory requests on that node will be sufficiently low that it falls beneath the threshold the cluster autoscaler considers it a candidate for deletion and then another thing we can do is just look at what's on that node so fortunately for us these guys won't get considered at all because of the annotation that I added and then what remains is almost entirely default pods that are going to be there all the time because it's cube system workloads for the control plane and this is just again to reinforce where the PDBs come into play so if I didn't have the PDBs this node would be blocked from a scale down event just because these workloads are here but because I do it was able to update that node and if I go back it is starting to kind of thrash a bit but if we give it another minute I would say it will mark that that node as a candidate for deletion and while we wait I'll just tee up the final thing I want to show you before we do that I'll just show you one more thing which is some different cluster autoscaler logs which is sometimes cluster autoscaler isn't making a decision and you can debug that by expanding some of these log lines where we see no scale down so okay why did no scale down happen in this case there was no place to move pods so that's one example of why a scale down event wouldn't occur and this is another example which I didn't touch on earlier which is that you don't want to ever have pods in your cluster that aren't controlled by a replication controller so if it's not backed by a deployment or a stateful set those pods of that nature can also block scale down events because there's no guarantee that workload will get rescheduled so cluster autoscaler is hesitant to to take a node down if that type of workload is found alright so if there's a pod that's been scheduled on a node that was just an ad hoc pod like you just did kubectl run engine x or whatever and it's just one off and it doesn't have a parent controller then there's no way for cluster autoscaler to know if it kills that if it evicts that pod will it get recreated so in that case it'll not initiate a scale down that node than any node that has that type of pod of that type alright there we go mark the node as to be deleted and now it has a bunch of different taints and in a moment here it'll get deleted but I won't make you watch and wait I'll move on Google's killing my demo so I want to make sure I get the one that's not about to get killed and I'm going to dig into the underlying VM instance and open an SSH connection let's just see while we wait okay it's still there okay it's gone so yeah changing the policy caused the bin packing to occur and cluster autoscaler took that node out demo one last thing really quickly I might move on to my takeaways while we wait for this and then circle back well we have ten minutes any questions while we wait oh man this was working like during the last session totally so I will elevate and then over here I did that earlier okay we're going to tail the NPD logs and we'll just see instantaneously when we write this message into the kernel log that NPD picks it up and I think I copied some other no okay I got it so I basically simulated a kernel panic or sorry a kernel deadlock and NPD picked that up so obviously this is contrived but if that happened NPD would have noticed and if we go to the node and look at its conditions it now has kernel deadlock true and if we cut back over to Google we can actually see that got surfaced in their UI and now things are not good so that's just one example of what you can do with NPD and shortly here in a few months hopefully we'll be able to apply taints to nodes directly based off of conditions created by NPD and then de-scheduler can act on those taints alright so we want stable Kubernetes clusters and there's a lot of things getting in the way of us achieving that but there's also a lot of ways we can achieve it it just takes understanding a lot of these Kubernetes primitives and the pieces available to us in the ecosystem so configuring things like NPD de-scheduler and cluster autoscaler can get us there and we should be proactive by understanding how to configure these tools and understanding what they have to offer if we combine that with pod disruption budgets resource quotas and limit ranges we will have hopefully a more robust cluster and then lastly if we can leverage the power of the API server to provision our clusters declaratively using cluster API then I think we're in a really awesome place and shameless plug that's what we do at SpectroCloud so we're a Kubernetes management platform we have an abstraction called a cluster profile which you see here which models the cluster including de-scheduler cluster autoscaler and NPD and you can deploy this with a click of a button to basically any cloud so it's built on CAPI though which is we contribute we upstream a lot of stuff to the various CAPI providers like CAPI vSphere the cluster API provider for MAS we contributed so if you're interested in CAPI and autoscaling that's us and thanks for your time thanks Tyler does anybody have any questions I've got one is it possible to implement your own custom plugins to do these health checks of things that are running in the cluster or even health checks for cluster nodes if they're on like physical hardware for example yeah so NPD's custom plug-in custom monitor plug-in I believe just supports any shell script so if the shell script exits with this particular status it'll set a particular condition to either true or false so you can do that with NPD and then de-scheduler is also highly extensible like I just showed the six out of the box default policies but you can write your own and cluster autoscaler not so much but yeah there are ways you can everything with Kubernetes is kind of one of the cool things about it is it's very extensible okay anybody else got a question okay join me in thanking Tyler for a great presentation next in this room in about fifteen minutes we actually have another talk on Kubernetes resource management so that if you found this talk useful there's a fair chance that the next one might be you might find it interesting as well thank you for coming thanks everyone I still don't want to go over but I appreciate it they're good chocolates they come from okay is this thing on? not yet just testing I'm in Seattle is this thing on yet? sounds like it's kind of on I'm gonna get the height adjusted because part of this is gonna involve me having hands on keyboard and I'm not gonna be able to hold it so actually that's as high as it goes so I guess that's it this is like the last thirty seconds before we officially hit six fifteen for the last talk of the day okay I think they're there join me in welcoming Reed coming down here from Seattle to give the talk on Kubernetes resource management I'll let him proceed with introducing himself let us know whether he wants to take questions during the presentation or hold him to the end awesome yeah thank you very much I'll address that question thing up front since there's so like since it's such a small audience I would say absolutely you know move forward if you want to feel free to engage me during the presentation it is timed without any interruption to last about forty five minutes it might end up getting tweaked with your engagement but I'm okay with that so definitely feel free to ask questions during a lot of it is going to be a live demo not as live as Tyler's demo before mine if you saw that I was highly impressed very brave doing a demo over the Wi-Fi live on a conference my demo is going to be mini-cube it's gonna stay on my own laptop because I don't trust the Wi-Fi but definitely feel free to engage so with that welcome to demystifying Kubernetes resource management everything you've always wanted to know I'm gonna take this off for a second but we're afraid to ask resource well let's do this so agenda-wise just going to give you an outset for how I'm planning on addressing this topic I'm gonna spend a little bit of time up front doing basically level setting and also kind of just making sure we're clear on the question like why do why resources matter in Kubernetes most people who have done you know run their own clusters or run the platform probably have a pretty good idea of what the answer that is but I'll tell at least one anecdote that one of my colleagues and I've been kind of using as a basis for putting this together and once that kind of baseline is established and we've kind of just reviewed real quick resources and limits and the basics make sure everybody's kind of level set the majority of the talk is going to be two hands-on kind of live experiments or demonstrations one on CPU resources specifically and one on memory resources specifically there are of course other kinds of resources in Kubernetes I think I mentioned them in my abstract and then we start trying to trim a talk down to 40 minutes and it's the subject resources unfortunately they didn't get a lot of airtime they will get mentioned a little bit later there's mostly CPU and memory today and there's a little bit at the end just talking about okay great we talked about some of the you know the problems and what happens during contention and so forth and then just we'll spend a tiny bit of time talking about what can we try and do you know what did my buddy Rafa do to try and point everything in the direction to do as well as you can given the realities of the resources abstraction and kind of what's going on there little bit about myself my name is Reed I as as was mentioned I'm up in the Seattle Tacoma area I've spent about 12 years plus working primarily with IT automation problems I've been doing Kubernetes for the last little over two years I guess now at this point prior to that I mostly worked with Puppet loved the transition one declarative management tool for kind of more traditional single operating system Linux stuff into hey this is the same philosophy but it just is more it's more what's the word where it means stuff doesn't change I can't remember it right now I've enjoyed that outside of work I think we're immutable that's what I was that's what my brain was going for outside of work I like to do a lot of hiking a little bit about nearing because Seattle is just a beautiful area for that try not to work too hard I recommend the same so getting started the Kubernetes resource abstraction again the level setting resources in Kubernetes are things like CPU things like memory things like ephemeral disk that's another built-in one things like GPU is not a built-in one but it's the most common one besides those and in Kubernetes the idea is that nodes have resources pods to run they need some of those resources and your cluster is a whole giant collection of nodes and Kubernetes job is basically to figure out how to put pods on those nodes in such a way that all of the resource requirements are satisfied so a little bit about you know great that's what it is but question why does it matter or sort of why is it worth kind of putting together giant talk about I say Giants 40 minutes so storytime at Storm Forge I work closely with a friend and colleague his name is Rafa abrito Rafael abrito and in 2016 he was tasked with leading a team to at a large bank to implement a new Kubernetes platform this platform was to serve a wide variety of internal customer teams Rafa's responsibility was to provide the platform but he wasn't the guy who was going to be responsible for running most of the workloads directly again lots of individual people he's just the platform manager so this being early days for Kubernetes at the bank and him still figuring out how all the stuff works his initial strategy for dealing with this idea of resource requests was pretty simple he called it benign neglect he figured that what he would do is he would use some built-in resources to try and point his users on the path to on a path rather by providing or suggesting some generous default request settings for CPU and for memory and then step out of the way and let the developers figure out and manage and maintain kind of their own resource request settings all by themselves mind you this was early days Kubernetes it was not on a cloud it was on-prem and so auto scaling wasn't really a thing and clusters capacity had to be pre-planned to some degree and rationed okay so he on boards the first few teams shockingly within a couple of days this is like not even the whole set of customers just the first kind of the pilot teams as you will within a couple of days he started getting calls from people telling him that or complaining that their pods weren't being scheduled due to a lack of resources specifically mostly CPU resources so he went to talk to his VMware team providing the nodes they told him that he was averaging about 15 percent CPU utilization across the entire cluster I think I've actually talked to a couple of people over the conference who's told similar stories I think 10 percent was sort of the number that I mostly heard so 15 was actually Rafa was doing pretty good but in 2016 Rafa's policy of benign neglect had revealed one of two of kind of two major problem categories that come from poor resource management in Kubernetes in this case it was cost from over provisioning of resources for the workloads that you're running the other major category of problem from poor resource management is under provisioning arguably that one's worse under provisioning would have likely shown itself up in the form of poor reliability or unpredictable unpredictable performance of the workloads that he was running so he kind of got lucky with cost it's expensive but at least things were working not that everything was working because you couldn't schedule the workloads but moving on so that's kind of just that's one story almost everybody I talked to about this who's been doing Kubernetes a while has a version of that of their own so it's not an isolated problem but I wanted to give kind of at least one for people to latch on to all right so diving into resources now that we know that they're important somehow for a couple of reasons cost or reliability most of the slides I put up are going to be talking at one of one of a couple of levels of abstraction here I've labeled in the kube api so the the user facing abstraction for requests and limits the stuff that you see in yaml for the most part I've labeled one kubelet but really this is just nodes okay great a workload got to a node or how did it get to the node based on resources this basically relates to pod scheduling and finally I've labeled one of the layers cgroup or cgroups this is basically okay great the workload is running on the node but it's going to have some implementation of that abstraction and it turns out for cpu and for memory it's basically all cgroup style stuff and so sometimes when I'm talking about what's going on it'll be at that level there's a process running it's on Linux it's in a c it has a cgroup and so forth final bit of level setting resource basics again reminder if anybody who hasn't seen it requests are the minimum resources a container is asking for guaranteed access to and that is an input to the scheduler it's going to try and find those requests somewhere across the cluster limits are a maximum that a container should be prevented from exceeding in the event that it tries the two are not the same so just wanted to emphasize that and as I said earlier requests are the one that matter in terms of allocation per the scheduler the scheduler is going to try and make sure that it puts workloads on nodes in such a way that it never exceeds the node's total capacity the node has two cpu's it's never going to the sum of requests of pods running on that node are never going to exceed two cpu's scheduler never over provisions according to requests over provisioning and I'm kind of getting into over provisioning because resource management is really simple if you never over provision so that's mostly what we're going to be talking about over provisioning is technically possible whenever the requests and limits aren't equal including when limits are not set for many workload types including raffas earlier on over provisioning to some degree for most workload types there are exceptions is usually desirable for cost optimization so the focus of the rest of the kind of the presentation in the demos is going to be talking about like how to over provision should we over provision should we not over provision when and why what happens when you do the reasons are you know over provisioning leads you towards some savings on cluster costs but you're basically getting that by sharing resources to a degree in terms of who they've been allocated to the other flip side is reliability that's great resource exclusivity is the kind of the far end of that spectrum but that costs it can potentially cost quite a bit and money has gotten very expensive in the last few years so last slide before we kind of dive into a demo and start focusing on this question is over provisioning safe what are the consequences of over provisioning for cpu and what are the consequences of over provisioning for memory i think that's about almost it for me talking without actually showing anything i meant so there's a couple of different ways to do demonstrations or philosophies i think in the talk before mine tyler started by saying here's what i'm going to do and what i expect to see and then he did it i'm going to do the flip side i'm going to run an experiment without necessarily telling you what i expect to happen up front feel free to kind of make your own guesses as we go after it's done i'll show you what i'll explain what happened and sort of what went into it but i noticed in the last talk there's two philosophies here and i've definitely chosen the opposite one see how this goes so my my lab environment is minicube it's running on my laptop at least i hope it's still running on my laptop i'll tab over in a second we'll find out and i've got a monitoring stack because i want to show you guys actually watching what's what workloads i have running and what resources they're using as i do it and i've also got a dedicated node to run these resources on this is important because i can kind of force exactly the resource contention and allocation situation i want in a real cluster this is a complex system you don't often get to force everything into kind of a magnifying glass and see it as you get where you want to see it so that's one note is i'm showing everything in a microcosm but when you blow this out into a complex system the interesting thing is you can't really predict exactly what's going to happen all of the time can this still work that still sounds like it's working excellent all right i also wrote myself a little driver script because i'm terrified of typing too many cube ctl and yq commands and so what i'm going to do is i'm going to do go run dot slash demo dot go i'll show you i'll i'll show you what it does for me help file i've written out a bunch of different steps i'm going to start by basically walking you through the inspect nodes point to just prove what kind of cluster i have running actually let me clear that out first so i have a demo cluster if i run cube ctl get nodes i'm going to exclude my control plane node which is running my monitoring stack as well and we see the one node demo m02 that's my test node for workload clusters it's ready to go because i'm talking about resources i want to show you how many resources this node has i'm going to show you allocatable and capacity for this node capacity is basically the raw measure of this node's resources it's got two cpu's about eight gigs of memory and i'm going to ignore almost everything else for the purposes of this talk because i'm already 25 minutes in and actually no i'm not i started at 615 okay i'm going to slow down we're fine all my practice sessions i started on the hour so capacity that's the raw raw resources the node actually has the difference between capacity and allocatable is the node was typically cubelet reserves a little bit of capacity for itself and for potentially other system level resources so that when workload pods get scheduled on to it they're not going to be interfering with that sort of like reservation slice allocatable is therefore typically different than capacity this is mini cube and a dedicated worker node i realized after i put the talk together that mini cube doesn't do that mini cube is basically just giving the entire capacity available for running workloads as well this is bad i'll explain why when we get to the memory demo um all right so that's just kind of an overview of the environment so i mentioned let's see that's the overview moving forward experiment number one focusing on cpu resource settings and cpu contention again i'll show we'll do something interesting and then explain the results so for the cpu demo i'm basically going to start by loading up some resources i also realized watching tyler's demo that i should probably learn k9s i don't operate that way i just run cube ctl commands maybe that's weird i don't know i should do a poll seeing some head shaking that's that's good so i've got a a directory full of resources cpu basically i'm going to create a couple of services because i got to talk to them and three deployments the first deployment is what's called the best effort workload it has no requests and it has no limits second deployment is one i called requests with no limits so it's a burstable workload and the last one has requests but it also has limits they're not equal so it's not guaranteed i'm not going to talk a lot about quality of service stuff to get to memory but just a tag in your head if you know if you're familiar with that so they should be running by now we can see that they are they're all up just to just to confirm they're all running on the same node this is important for the sake of the experiment and doing science because if they're on different nodes this would not be very interesting or at least i couldn't force contention um and then i said some have requests some have limits some don't let's take a look at what those are a lot of this these commands by the way i've said i'd script it because there's that's way too much yq i don't want to show you all of the yaml but i do want to show you the actual yaml so here we go cube ctl get pods and then you're going to ferret out just the resource requests for this this thing so there's that best effort one it's got no requests for cpu as expected for the ones that have requests they're both requesting 500 millicore and the last one has a limit set of one core um i promise i won't keep it all cli and yaml so the next piece will actually take a look at a graphical interface i'm going to show you the monitoring stack so we can kind of watch the consumption of these as we go um i got to start that port forward so let's run that down here that's forwarding and then this url is what i want to take a look at some workloads all right um i think tyler also mentioned grafana and prometheus cadvisor wonderful tools um almost certainly heard of them if you're using kubernetes at this point if you haven't definitely check them out this is let's switch over to cpu this is basically just a page where i can see how much cpu these workloads are using and how much memory they're using on this node they're currently using next to no cpu and have a baseline set of memory requirements these pods by the way are a tool called resource consumer resource consumer is a part of the kubernetes project it's used for testing um i assume because kubelet cares about resource usage i actually don't know what they use it for i just know what i use it for um and what it lets me do is it lets me tell these pods exactly how much uh cpu and memory to consume um just kind of uh on a whim before i switch over i'm gonna try something i'm gonna switch this to just the last two minutes because we're gonna be looking at really close events here so the next thing i'm gonna do is i'm gonna load these things up with some cpu usage every single one of these pods is gonna have a baseline consumption of almost half a core again the node has two full cores available um those are the curl commands but basically it's 450 millicore baseline consumption for each of these so we can immediately over in prometheus see that consumption begin um so total cpu usage over here is about 1.4 and i've got a stack chart showing each of these workloads so they're each consuming right around 450 millicore no contention so far um they're all under request if they have request and the pod has the node has available cpu okay let's do some let's make something interesting happen um what i'm gonna do is i'm gonna have one of these containers start over consuming the one that has no limits i'm gonna ask it to consume an additional two full cores of cpu remember the node only has two cores and it's currently already using one and a half of them so when this pod starts trying to over consume something's gotta give um either it won't be able to do that or it'll force uh it'll take resources from one or more of the other pods if it does which ones that can take it from and how much um i'm just gonna fire that off and see what happens all right so that's running and this is refreshing every five seconds but i'll hurry it along a little bit okay so we just saw that consumption start the blue one on top is the over consuming node or pod um as it starts this will run for about 15 seconds before resuming normal baseline 450 millicore um notes uh what happened so this top one it's consuming up to the nodes limit of close to two there are some measuring artifacts if it goes above two it's not actually using more than two that's just um some grafana and prometheus summing stuff um it consumes uh up to two it's only getting about 1.48 itself though and while it's doing that what happened the pod with no requests at all is looks like it's getting a tiny bit of cpu time but it's almost starved it's getting very little the uh one with requests and limits is actually cruising pretty even at its baseline of 450 millicore it doesn't really look like it was affected throughout this entire thing and then 15 seconds over that over consumption ended this blue pod went back down to 450 um interestingly the one that had been starved actually started over consuming for a minute um different workloads will do different things but note that this one was like oh no i didn't have any cpu now i'm over consuming and before leveling out again interesting um i'll show one more thing and then explain what was happening um and the last thing i'm going to show is just do something similar but i'm going to ask the one that has limits to over consume this one is limited to one core recall so in theory it shouldn't be able to actually get up to one and a half even if it tries which it will try all right that started we can see on the graph something very similar um it's not going as high it is limiting itself at right around one it looks like uh during this period um it's still taking it away from this green one at the bottom the one the best effort pod the one with no requests and no limits and after it finishes that green one is kind of spiking again before going back to normal that's the experiment um that is repeatable it's fun in minicube so what going back to um you know why this matters what exactly happened here what was that behavior and what can we learn from that or use um from from that knowledge to kind of inform what we do with our clusters i'm gonna start by deleting all of that because i'm gonna be coming back in a minute and i don't want those eating my cpu uh review no requests no cpu time during contention ish there's gonna be an asterisk on almost every single one of these by the way because this stuff is complicated i'm trying to keep it simple for 45 minutes um use if you're using less than your request to cpu it didn't look like there was any interruption not really even during contention you mostly got what you asked for um and for one of those workloads when they got starved it seemed to over consume later so that's kind of the key keyword there is this is a compressible resource if you don't get it now you might get it later assuming some is available that might suck for your workload it might not meet your service level requirements but it will happen so what's going on here um diving down into kind of the cgroup level again cgroups is where all of the rubber meets the road in terms of how request and limits actually are implemented in kubernetes uh cgroup and the completely the completely fair scheduler is the linux uh basically process scheduler task scheduler not sure what the technical term is um that actually runs uh processes on the core on the cores and uh what it is is a proportional scheduler you can assign shares to any cgroup that basically say you have this much uh this percentage share of the available uh schedule time in the event that you try and use it um kubernetes equates when it does this allocation it basically says if you've asked for one full cpu i'm going to translate that to 1024 cfs shares when i implement your container on the node um and kubernetes is going to assign kubernetes or cfs shares according to the request that you made if nothing else on the node is messing with cgroups this abstraction model can basically result in okay if you said you got one core and there's two cores available that's 50 of the node that's kind of set aside for you if anything else is making cgroups by the way and assigning shares all of this just goes all of this goes just is being recorded um it doesn't work so quick illustration um in the event that we assign let's say that you're on a two node uh two cpu node and you've got six total kubernetes workloads assigned to this node two of them have no requests and the other four have some level of request um in this case each of these will assume a number of shares relative to the number of millicores requested um because nothing else is running on the system with shares that has an interesting effect of it can actually consume up to in this case it's asked for 150 it gets 153 shares it's actually going to be given priority for up to 284 millicore of cpu as long as these are the only processes running on the node because it's proportional it's not actually about the number of shares of millicore you requested it's about the proportion of shares you get on the node based on that request so this is how kubernetes the cgroup would kind of ally or linux would decide priority allocation for these processes the other two that made no requests there's an asterisk here um i literally learned during this conference that it's probable that kubernetes assigns two shares to things that have zero requests which is really interesting as opposed to a flat zero um there's a lot of hard coded stuff by the way in how kubernetes implements this that you just just happen across i'll mention some more later um those those two ones are basically what i like to use metaphorically as they're flying standby um if there's a seed available if there's cpu time available they'll get scheduled they'll do great we saw that early on when the node had two cores and only 450 millicore per process if there's anybody that has shares is asking for those those cpu slots these guys get basically nothing that's why we saw that one process that no requests get completely starved out this has implications for thinking about how do we assign requests and specifically to pods running on a cluster especially if we're like in raffa's position and he's got to be telling all these developers what do you need to do um oops i'll get to that in a second i realize i have a slide i forgot uh but no no requests means no priority so be careful there limits are complicated enough that in a 40 minute talk i wanted to put a qr code on the screen for an excellent article that talks you through kind of all of the the interesting caveats and provisos that come with limits they basically work um for workloads that are latency sensitive there's some interesting implementation uh details that mean that they don't work quite as well as you'd hope and you can get really poor performance from workloads that are being throttled as a result of the limits implementation limits are again coming from a cgroup setting cfs quotas um but quotas are not evaluated continuously they're evaluated at intervals and if you exceed your quota at one interval you might not get scheduled again until the next interval occurs which is an unexpected lag time hence latency issues and so forth a lot of cool articles about this those articles are all very long the diagrams are all very complicated for our purposes what it really comes down to though at this level of talk is we learned that requests are pretty important if you have no requests you are basically flying standby you do not get guaranteed any cpu time at all especially in the event contention starts occurring on your node um and you're potentially subject to near complete starvation if requests are all set you're basically guaranteed a minimum amount of cpu time the time that you requested um this has an interesting reflection on limits which because that what that basically means is limits are often implemented to try and mitigate noisy neighbor issues where one workload could take cycles away from another workload if requests are all set correctly you're never going to be able to take away from a workload resources it has properly requested um so they aren't necessarily that critical for noisy neighbor situations as long as requests are set properly a lot of people will argue and take the kind of the the position that you just shouldn't set cpu limits at all um as long as requests are working and that's largely the side of the fence i land on i'm open to debate if anybody wants to talk about it all right so that cpu um could put a bookmark on all those memories we'll talk dang it that was an unintended pun we're going to talk about memory next um before we kind of circle back and say now we've learned a couple of things about these core resources what do we do with that all right experiment number two resource settings and memory contention so we're going to do the same kind of thing i'm going to go back to the live demo we still have this kind of observatory running but i'm going to fire it up with some memory resource workloads and these workloads i'm gonna well let's just let's just take a look applying a directory full of yaml a couple of different services for two different deployments one that has best effort no requests and no limits one that has some requests but no limits and one that has both requests and limits those should be running they are they are both running on the same node so there is potential for contention and let's take a look at the specific values they're requesting uh we've got again best effort no requests uh we've got for the one that has requests and limits it's requesting one gigabyte of memory and the next one down is requesting a gig and also limiting itself to one gig this node recall has eight gigs total we'll start ramping this up in a minute we'll get closer to eight we're not very close right now let's run those so i guess we'll go over here first and just take a look at what is going on switch over to the memory namespace um right now the baseline consumption is small so i'm going to bump that up it's 85 megs each let's bump that up to something more interesting oh yeah in case i wasn't there i would go there but i'm already there i mean i said it's about 500 megabytes on each of these workloads in this case i'm actually just restarting these things with an environment variable to set that baseline consumption um we can see that happen here um tangent one of the most annoying things to figure out in this live demo was how to make these queries for prometheus show only the pods that we're running and get rid of the ones that weren't because C advisor keeps them around for about 30 seconds um sometimes you can see the restart sometimes you can't because the timing is that tight on the way the metrics work in this case you can't actually see them restart but i guarantee you they did restart um so uh first experiment one of these things has a limit of one gig what if we take it this is the request and limits pod or deployment and we're going to have it consume an additional gig on top of its existing 500 megabytes for up to 30 seconds um i'll give this one away it's going to get killed it's going to get killed real fast there that goes post requests refresh this it starts consuming just a little bit it's that yellow one and that that down spike that was it dying um but we can confirm that we can observe over here uh i'm going to take a look at the pods for that request and limits deployment uh note the one restart 13 seconds ago and if we describe that pod and uh i'm just going to grep for the container information here at the very end of the uh qctl describe we're going to be able to see a couple of things no i don't need to do that current state is started that's great the last state is terminated reason um killed um another bookmark memory bookmark uh um killed it told us exactly what happened it was very clear love that doesn't always do that um we'll see it not do that probably in a minute i say probably because there's some non-determinism here um exit code 137 okay um anybody surprised exceeded the limit it got killed cool that's like the easiest one to demo um i'm going to set up just to demonstrate the note has lots of memory if you exceed your requests but don't have limits you're not going to have a problem as long as there is memory although what you're creeping into uncertain territory here um so i'm going to ask each of these guys to consume an additional gig this is the one that has no requests and no limits and this one has requests but no limits and visually we should be able to see that um here we go so here's what when we got killed and then now we're having two of these pods consuming an extra gig uh using one one and a half gigs each for the no request no limits and for the requests but no limits just demonstrating that that works mostly because it means containers running on your nodes can use as much memory as they want if they're not going to hit their limit if the note has the memory available so the next interesting question is what if the node doesn't have the memory available um there's a lot of things that could happen here i'll explain some of them later but just to kind of show show the simplest one i'm going to change the baselines here i'm going to change the one that has requests to request a lot more i'm going to have it request 2.2 gigs and have it set the baseline to two gigs so it's going to request a bunch and use some but be within its requests um should be able to see that there we go okay it bumped up now it's at 2.1 gigs again requesting 2.2 and i'm also going to set the requests for the one that has requests and limits i'm going to bump its request to 5.2 gigs set its limits a higher at six gigs that gets updated it's not consuming yet because it didn't change its baseline it just changes requests um just confirm that so it restarted to get that new request and limits now we're ready now i'm going to start having it try and actually consume all that memory it requested but this is going to put it real close to the node's actual limit of available memory so it's going to try and consume an additional 4.4 gigs so that started um i've actually there's actually two or three different things that can happen when i do this um it's usually one thing we'll see if it's the one i expect in which case my slides will match um but it has happened before that it isn't the one i expect in which case i'll call it out on the slides we're climbing we're rising we're getting close to the eight gigs of memory that the node physically has available um and that was actually a lot faster than usual um what happened here suddenly almost everything disappeared um timing wise we can see that it was a slight something uh this this no request no limits pod actually did get killed it got killed slightly before the other which interestingly they got killed too um and looks like we're getting a restart now so it took a minute but those containers are running again that's interesting one of them is running again uh oh this is remember i said it's non-deterministic this is actually one of the more unusual things that can happen so uh there we go okay now one of them is restarted is using its baseline there goes the other one okay cool all right now we're back to the situation that i'm i expected it to eventually get to but basically a bunch of chaos just happened um and yeah there's only one node running in this cluster but everything seemed to happen on the same node and even though some stuff crashed it stayed on the same node we'll get to that um but first i'm going to shut all these down no i'm not i'm going to tell you something about them this is why i write myself a demo script there's too many commands um i wanted to take a look at that kubectl describe pod again and take a look at that um container status so we see again the current state is running the last state is terminated remember i said remember umkilled this doesn't say umkilled interesting but it is exit code 137 which trust me on this that's umkilled all right let's talk about what happened and sort of what that means after see if i'm right about i think i do kill it now yeah now i kill it all right there it goes so the review um if you have memory limits and your container exceeds them it will be killed at least that works as expected um containers unlike cpu if a container doesn't get memory that it tries to allocate stat something's going to get killed um memory is not a compressible resource you're not going to have that like ability to just do it later in this case um what gets umkilled and how um that's the interesting thing from a cluster stability standpoint is there's not if if you start running out of memory you this is not great because you don't have a lot of determinism in exactly what's going to happen so let's talk about what level of determinism you could have and why you don't always have some um starting at the cubelet so not linux kernel at this point but the cubelet best case scenario if your nodes start running out of memory because workloads are misbehaving they don't aren't there's there's their over provision and the requests aren't quite being honored correctly um is the cubelet might notice i say might because the cubelets really only going to look for this kind of thing once every 10 seconds and all oftentimes in this overconsumption consumption situation 10 seconds is is way too it's way too big a gap um another fun thing is what the cubelets doing is it checks every now and then to see if the nodes available memory has reached an eviction threshold the eviction threshold defaults to something tiny it's like 100 megabytes if you're within 100 megabytes of having a problem then it'll start evicting things on modern nodes that's nothing um i don't recall specifically what the various cloud providers end up having this set to i don't believe they leave it at 100 megabytes but it was interesting to try and experiment with on the defaults um if in the event cubelet notices this it will start performing evictions evictions are good because cubelet gets to decide what's the least important workload on this node i'm going to kick you off and you're going to go back to the scheduler and get sent somewhere else probably not here because i'm going to trigger a condition um such as uh did i say what it was called here memory pressure i don't think i listed it um the basically says don't don't schedule anything on me i've got some problems that's not what happened in the lab um this was too fast uh and i say probably because again there's some non determinism here yeah question how does it determine least important so there's a number of factors that go into that um a big one is quality of service so we have guaranteed we have burstable we have best effort guess which of one of those is the least important if you have best effort pods running on that workload they're probably going to get booted there are other signals you can provide um off the top of my head i forget the specifics but there's basically priority um you can assign pod workloads priorities and those will also be taken into account in terms of what gets evicted um i don't think that i have used those a lot personally because like i said it rarely gets to this point on on clusters that i'm familiar with because what you usually get instead is you usually just reach um killing where kubelet missed its chance um the linux kernel has stepped in and it is just going to start start taking over and killing things as needed um the interesting thing about determinism here is kubernetes does its best with various cgroup settings to organize um prioritization for the um killer but it doesn't get to pick the linux um killer at the end of the day is what decides what to kill um quality of service um we'll have a huge effect on what gets killed first best effort stuff basically gets a signal uh a prioritization influence to the um killer so it's at the front of the line um burst burstable is after that guaranteed is um at the end but um killer also takes into consideration things like how much memory is actually being used by these processes because my goal is to free up memory um so if it reaches that point you don't actually get a total amount of control especially with cgroup v1 i'll mention cgroup one versus two in a second um but in cgroup v1 you there's there's only so much control that you get and so if you get to out of memory that's just bad yeah correct yeah so a quick definition a best effort quality service pod means that it has not requested any resources and has not set any limits burstable means that it at least has requests it might have limits but if it does the limits are not equal to the requests to get to guaranteed which is the highest quality of service that means you are last in line to get um killed you need to request both cpu and memory and they need and the limits for cpu and memory also both need to be equal to the requests you can't just do one you have to do both um which is interesting if you don't think that cpu should have limits it means that guaranteed quality service is kind of out of reach for you unless you're going to lean into the um over provisioning higher costs anyway we'll get there um so this is probably what happened in the lab um I wanted to highlight the fact that it didn't say um killed because this is there's a bunch of nuance about what goes on long story short there's some very detailed articles here if you're at all curious about what actually happened that go into detail um but you got to read what sounds like a phd thesis or at the actual code in order to find out like why why in the situation does not actually say um killed all right so the last summary before we kind of talk about the uh the last piece that just kind of wrap it up great we learned something about requests and limits but what do what do we do with that um memory requests are also very important um memory isn't guaranteed by requests one of the things I didn't emphasize in that lab test everything was within its requested memory nothing was exceeding its requests but we still got killed um there's some things you can do to configure cubelets so that's less of a possibility but requests don't guarantee um memory at least today uh because of the consequences over provisioning of memory it has a has a higher consequence than over provisioning of cpu for most kinds of workloads so maybe we should be a little conservative on the the limits there limits are helpful more helpful for memory mostly because it gives you more control over saying misbehaved workloads are the ones that should get killed and not the ones that are potentially you know mining their own business but have a noisy neighbor situation uh last note before that we moved to the last section um everything I just talked about is true when you are running on a kubernetes cluster with on an os that uses cgroup v1 um cgroup v1 cgroup v2 is the same basic system a few differences um cgroup v1 has some terminology changes versus two in terms of cfs so cfs shares like I explained earlier is changed to cfs cfs weight it's exactly the same thing basically but different um nomenclature I found shares kind of confusing I found weight less confusing I don't know why they made the decision to change it but that's one note um memory gets new controls memory gets new controls that actually help you better say things like if you are under your requests you should you should not get umkilled the umkiller should go after everyone else but you um which is not true on cgroup v1 so there's more determinism in v2 once we get there I I showed v1 because most of the major cloud providers aren't using that by default yet but I get the sense that we're very close uh mini cube uh is usually able to do that um so just a note on that um finally a call out to other kinds of resources we talked about the abstraction um the abstraction is what kubernetes is trying to present for lots of things the built-in ones are fmrl disk memory cpu it's commonly seen these days to extend that to gpu but it's a general model you could in theory extend it to almost anything else if you wanted to as long as you provided the implementation for it um I won't talk a lot about it right now because we're almost out of time but just wanted to call out uh it is a generalizable abstraction so um back to raffa story like what do we do okay we understand we we understood conceptually before that um that overallocation on nodes was gonna was gonna cause some pain now we have a little bit deeper understanding of what actually happens when that goes on um what do we do to try and influence um our users um of kubernetes platforms to to work on this or what can we do to kind of maximize our stability and resiliency in our clusters in terms of resource requests um sorry uh it's kind of hard um as the ecosystem matures there are other options but we'll start from where raffa did um most people that I talked to about this who've used kubernetes for a while they typically start with you I think I was actually asking um talking to somebody before this talk when you start introducing people to kubernetes you might not even mention requests because you don't need to set them to run a workload so you might end up with a lot of pods in best effort I think my the guy tyler talk on the talk before mine tyler specifically called out one of his slides you end up with clusters all the time that have tons of best effort pods um so you don't bother setting them but at some point that's going to fall down um you might start getting some performance issues your cluster doesn't scale well you realize you got to set something stage two at a large organization is you say everybody's got to set requests but you know what you don't have to think too hard t-shirt sizing you either choose small medium large um and that's good enough in a way we go uh that tends to change when the cost pressures from that still ends up resulting in largely over provision clusters when the cost pressure from that comes down from the top people typically advance to something else stage three manually tune every workload whether you know usually it's pretty irregular and it's based on signals like oh this one's getting um killed or this one's getting throttled or this one's throwing up at the top of the list of the most expensive workloads in the company um in which case you should go take a look at it but it's largely manual um on stage three um through all of those stages options that are out there to try and help your organization um do well at this are basically uh there were some excellent presentations the presentation before mine did a good job of running over a lot of options for this is basically about creating policies including policy as code that can influence developers or application owners behavior um to guide people into paying attention like rafa wanted to he said he wanted to um you know point people down the path of getting it right um there's built-in resources called limit ranges that you can use to ensure things always have requests even if they're just kind of dumb defaults um there's resource quotas that can be used for if a if a group like one of rafa's customers um back at the bank uh over had over consumed themselves they might be forced to stop and take a look at their allocation before continuing without affecting others if they had a quota in place just for their for their application um tools like caverno can be used to define much more granular and specific policies that again the the objective is to influence um user behavior on kubernetes and kind of force people to invest time and effort in looking at their resource requests if you're going to do that um it is really important to have some kind of tooling that helps them be successful with that even the little grafana that i was running um is something that can can really help is you know that you're going to set requests according to what's actually being used but you got to be able to tell what's going on um so any kind of tooling that actually helps people understand that would be pretty important um the last thing that we're seeing more recently um oh there will be a shameless plug from my employer at the end of this um is try and automate this um a lot of it is are things that you would um in theory we should be able to have machines help help us do if not do absolutely for us um so how would that work um ideally a tool would so the v the vpa the vertical pod autoscalers an example i would say it's not a very mature one in terms of its full set of capabilities but it's an example of something you should be able to observe and collect utilization data for more workloads calculate and uh generate tailored request settings for all of them and then automate um actually managing that in production um this article is something that um as part of what forced me into giving this talk honestly uh was uh basically talking about the premise of the developer experience regarding it um it doesn't provide a solution but it kind of talks about the problem more in depth um this is the shameless plug the reason i got to spend all this time sort of digging into these details is i do work for a company that's trying to create exactly that sort of automation tool we're not the only one there are others this is my favorite um so if you're at all curious of uh sort of other ways to do this rophe didn't have tools like this in his day he had to kind of go the route of trying to influence his application owners and developers to do it themselves um but i firmly believe that whether my company storm forge does this or something somebody else wins out or if there's an open source project that gets good enough um this is likely the future of resource management kubernetes um unless there's some unfun insurmountable intractable problem that i have not seen yet because this is there's way too many workloads it's way too hard for developers to manually tune all of this but it is really important for stability and reliability on clusters i do think automation is where this will eventually end up um not sure when we'll get there um with that uh i've almost hit my timing i don't think i've run out run over too far uh and so um feel free to connect with me on linkedin i'm happy to i'm i'm a terrible linkedin person i don't chat very often i don't look at things but i i do promise to try and be more engaged for a little while after the conference if anybody does want to just talk about any of this stuff um there is also a link to if you want to try um my my employer's free trial product there's link to that uh and otherwise we're open for any questions people may have there's a lot that i didn't touch on so feel free if even the question wasn't directly addressed to something that i said in the talk i'm happy to try and tell you what i think about it um if anybody wants to to go take particular direction yes okay uh let's see if this uh works um i was wondering if there's any way that uh you can provide back pressure to the process in the container to free up resources that it is using so we have uh a monitoring uh pod that um uh when it gets a lot of logs coming through it it blows up and gets really big and takes up a lot of memory but then when the traffic falls off it doesn't free up its memory it keeps it you know happy that it's got lots of memory this is all within the limits that we've set and um is there any way to um have kubernetes reach in and say garbage collect you idiot and garbage collect you idiot i like that um so i'll mention first one thing that i'm aware of i don't think it's a direct answer to your question especially depending on the application like if it's job or something um one cool thing about the c-group v2 implementation is there is an additional control where if a pod starts exceeding the memory it's requested the operating system will start giving a little bit of pressure to try and free up some of that memory i don't think that'll necessarily work for maybe i don't know what what the application stack was that your that your application uses second vector sure yeah um i am not aware of a kubernetes mechanism to do that um if you didn't catch the talk before mine that might be of interest he the tyler did an overview of a lot of tools related to this general idea of um i want to do things like redistribute resources he mostly talked things about things on the pod level rather than the resources in the pod level um so i don't have it i don't have anything that comes to mind myself but there was a some very interesting things in the previous talk mm-hmm if you see it's ballooned as a limit yeah okay so um if if anyone's watching the corner of the question is basically if we see it doing something is there any way we can go in and sort of change its limits or requests and kind of force it to um um so when you say change limits or requests my mind goes to the actual resources there is a notable thing that's in alpha right now called in place pod resizing it doesn't get you what you want but that's the only way to change limits or requests right now at the moment to change actual limits and requests the pod is going to die like the pod has to get restarted um yeah um short of general kubernetes solutions like sidecars that actually just sit there and watch for things like that i don't know of any tools that specifically would address that does sound like an interesting use case though also an interesting idea of solution um to to go straight to if it's if it's over consuming let's just shoot it i kind of like the aggressiveness any other questions so for those of us who still have to do this manually um one of the things we discovered early on is uh you know when nodes start running out of memory you know things get unpredictable sometimes the thing that the kernel decides to kill is something important to the stability of the entire node you know rarely but it would happen so our our solution to that was limit ranges have a feature in them you can set the ratio of how what this the limit to request ratio is and it can't exceed that so we've basically made the requirement for memory requests must be the same as limits to prevent that kind of situation um is that's something you see other people do is or has have you seen other strategies to kind of address that kind of situation yeah so i think that's a great example of it's using limit ranges specifically but it's one of those like policy as code to kind of choose a certain force people into a certain behavior um i we have seen people who do basically exactly that who just force memory to be one to have a one-to-one ratio um how they do it varies um limit ratio limit range is one thing verno can do similar things as well um it's it's tricky because on the one hand i say tricky because um in terms of knowing when you need to increase uh boom well actually the more i think about it i don't like there's not really any super downsides to that except you're not gonna be able to over provision at all um but yeah we have seen other people doing that for sure um and that avoiding the chances of killing something important is a is a good call out um especially if there's any sort of non-determinism it can it's one thing to kill a neighbor workload but it's something else if you accidentally kill something critical to the system like cubelet i was curious if you could this might be a more basic question on kubernetes um but you touched on like eviction so when a process you know gets killed off and it gets sent over i guess reschedule perhaps to another pod like how fast is that does is kubernetes like precede all pods with all possible containers or is memory copying that have like yeah so working backwards kubernetes does not by default magically seed all uh pods on every node this sounds to me like you're talking about like fetching images like is it ready to start on another node immediately potentially no there's absolutely no guarantee that it's doing that unless you've set that up yourself um in your cluster so if something gets evicted from memory i don't recall if there's any if it honors any sort of graceful like it does it give it a grace time as it shuts down to like gracefully close it might not um so now we can send it back to the schedule is the scheduler is going to try and place it as fast as it can that works as fast as it works for anything in your cluster but yeah if it lands on a fresh node that doesn't have the image it could take a little while for anything that gets evicted to actually spin back up again um it's still better than just a random umkill but any other questions i can think of one i i see here that you've got a solution that tries to automate the setting the resource demands but uh how does that work is it entirely based on assuming that future behavior will be pretty similar to past a second question to that is is there anything that solves the problem of you know you're you're planning on having users give themselves um their own labels for what the demands are how do you keep people from being pigs in the organization just saying i want the max all the time if there's no cost back pressure on them yeah um so again working backwards cost back pressure especially in large organizations you're gonna most organizations that i've worked with who are like an enterprise scale do have something in place to do some cost back pressure um some of the organizations i work with uh though especially if they're the organization has a lot of money and whatever they do is very important like the best back pressure we've actually seen is because typically the the application owners are not the ones who are experiencing the pain of this it's usually the platform owners um the platform owners are often just looking for any way to as you said give back pressure to the application teams oftentimes it's just showing them that what they've requested if you know say how why are they how do we prevent them from like just asking for too much one of the best ways is to to have a clean and easy way of showing them that what they've requested isn't actually being used visibility transparency is one of the only things you can do if there's not any other direct costs there but we're kind of talking about a people problem now as opposed to a technical problem which keep working backwards under question again the people problem thing i think automation tool in terms of like how does automation tool handle this i want to call it first that the objective is to reduce load on people and so an automation tool is typically going to work pretty well for like 80 percent of workloads there's likely to be a few workloads or a few circumstances like black friday is coming up and there's no way that we could have predicted that by looking at the last two months worth of data um where somebody's still going to have to step in but any automation tool's job should be to try and reduce the amount of work that people do to only exceptional circumstances as opposed to everything the way that our particular tool works is it is we collect about we observe about a month's worth of data every time we generate a new recommendation there are some ml algorithms that do have forecasting and concepts of seasonality they can tell that your workload is busier on monday versus saturday and stuff like that but outside of that we you'd have to have user inputs or policies that say things like this workload needs to be as reliable as possible versus this one you can maximize for savings it's a complicated you have to it turns out the answer to your question is slightly complicated um but uh we a tool an automation tool in addition to just forecasting ideally should be responsive it should address things like oh this thing just got um killed five times in a row i think it might need something new um even if uh it's it's forecast its previous data didn't show that um i think you might have asked more but that's all i could remember okay last call we might have time for one more question if there is any if not one thing uh you just briefly touched on was autoscaling and i could imagine a scenario where with requests and limits you're sort of restricting supply but of course the horizontal part of the scalars going to try to increase demand based on possibly a different set of metrics seems like you might get just maybe you could comment on how autoscalers yeah those things kind of work together so autoscaling and specifically horizontal pot autoscaling is what i'm hearing because there's you know three different core main means of autoscaling in kubernetes um yeah so what we were just talking about is the individual requests and or limits for a pod but when you have a horizontal pot autoscaler in place its job is to try and make the utilization oftentimes there's other metrics but oftentimes match some ratio against the the requests that have been configured in the event that you want to do multiple things where you want to say great i want the horizontal pot autoscaler to be working but i also want to pay it pay closer attention to this individual pods resource request and or limits that's a i think mostly with that keys the first place my brain takes me is remember that the the vpa project kind of the existing standard public one specifically states that because of potential tension there like it doesn't play nice with horizontal pot autoscaling you pick one or the other you don't have to commercial products the ones including storm forage and i'm sure others do vertical pot autoscaling in conjunction with horizontal pot autoscaling it's an interesting problem because either you just preserve the shape of the existing behavior when you start vertically sizing pods in which case you have to adjust the the horizontal pot autoscalers targets or you have to start getting a little bit more now i'm now i'm speculating as opposed to talking about built software or you have to start changing what those metrics are depending on the number of pods running like maybe if there's more pods running you're more okay with having those pods more saturated as opposed to if there's only a few running you need to have more overhead mostly i think but that all all that response tells you is that that's a complicated interesting question but you knew that okay thank you read that was a great presentation please join me in thank you so this evening we've got a couple events there's a general audience game night starting at 8 30 if you brought your family along that that's in the exhibit hall but if you brought your family along the first hour is deemed family hour there are games for kids they're starting right now if you need to stick around until the adult game night starts for another hour there's a session on how to volunteer for scale i don't recall the room but i think it's in the other building and then at 10am tomorrow casey hand mirror who's a physicist scientist with caltech and jpl is giving a general talk that sounds really interesting on uh technology opportunities so thanks for attending today um maybe i'll see you at game night if you stick around go check the name real quick i should have mentioned that because i figured somebody might be interested