 Hey everyone, let's get started. Welcome to this session on bin packing pods and managed Kubernetes. Thank you for being here. I'm Vinay Suryadevara, and I'm a senior software engineer at ClickHouse. My co-presenter is Jenfei Hu. He also works as a senior software engineer at ClickHouse. So on the agenda today, we'll start with a short introduction to what is ClickHouse and ClickHouse Cloud. Then we'll talk about the overview of the problem that we are facing with our infrastructure and why we used bin packing to solve it. Then we'll get into the details of how infrastructure is set up, what were the various approaches we used to solve our node utilization problem. And then we'll also talk about the rollout we did across our fleet, what were the savings we achieved, and some of the learnings that we realized during this whole exercise. And then we'll end with Q&A at the end. So what is ClickHouse? ClickHouse is an OLAP database. It's used mainly for analytics use cases. It's used to generate aggregations and visualizations on your data. And it works best with mostly immutable data. It's been in development since 2009. And it was open sourced in 2016. And it's gained a lot of popularity since then. And it's one of the fastest growing GitHub communities. ClickHouse is a fully distributed database. So it supports replication, sharding, multi-master, and cross-region setup. So it is production ready. Last and the most important thing about ClickHouse is that it is extremely fast. Using various techniques such as column-oriented storage and state-of-the-art data compression, ClickHouse provides insights into the customer data in milliseconds, which makes it one of the fastest databases out there. So now that we have an idea about what is ClickHouse, what is ClickHouse Cloud? ClickHouse Cloud is the serverless offering of ClickHouse. It has various features such as idling and auto-scaling so that you can bring down your compute when there's no activity on the cluster. And similarly, auto-scale the cluster when there are workload spikes. We also have compute and storage separation so that you can scale compute independently of storage. This is how a ClickHouse Cloud instance looks under the hood. We use Kubernetes as a compute layer, and we use object storage for the data persistence. We currently still use PVCs for some metadata, but we are in the process of moving away from that. We're currently available on AWS and GCP, and pretty soon we'll be coming to Azure as well. And later in the year, we'd also have the bring your own cloud offering in which the customer data can stay in the customer VPC and customer cloud account, and they can use the ClickHouse Cloud control plane to deploy ClickHouse in their account so that large enterprises who do not want to part with the data can use this solution to deploy ClickHouse on their premises. So now that we have an idea about what is ClickHouse and what is ClickHouse Cloud, what is the problem that we were trying to solve? And before we get into the problem, the title of the talk is bin packing pods in managed Kubernetes. So what is bin packing? In general terms, bin packing is an optimization problem in which items of different sizes must be packed into a finite number of bins each with a given capacity in a way that minimizes the number of bins used. So if you look at this picture, on the left side, you see that we have an initial set of objects that individually have some weights, and on the right, we have bins with max capacity. And using bin packing, we can fit all of these elements in only three nodes instead of four saving us the cost of one of those bin. Specifically in Kubernetes, bin packing refers to the process of efficiently scheduling pods onto the nodes in the cluster in a way to maximize the resource utilization and minimize the number of nodes needed while also satisfying all of the constraints of the pods. This helps us save cluster resources such as CPU memory and can lead to a lot of cost savings and improved efficiency. So now that we have an idea of what has been packing, what is the problem that we were facing? So in our cluster, in our Kubernetes clusters, which is multi-tenant and present in multiple regions, we noticed that the node resource utilization, CPU and memory were non-optimal across the fleet, and this meant that we had higher infrastructure costs because we were running more nodes than was required. So we explored a few different solutions, and the one that we finally settled on was to update the Kubernetes scheduler scoring strategy to bin pack pods onto the fewer nodes. Then we had to roll out this solution to all of our Kubernetes clusters which are present in multiple regions, and also it's a multi-tenant feed so that we have multiple instances, click-house instances on the same node. So we had to find a way to reliably load out these changes so that existing customers do not face any disruptions. So we'll talk about all of these in detail in the coming sections. So before we take a look at the infrastructure setup for ClickHouse Cloud, some of the terminology that we'll be reusing across our slides. ClickHouse Keeper is our replacement to ZooKeeper. It is used as a coordination component for our server pods. ClickHouse Server, these are the compute pods for ingest and query. These are the pods which process the data and process them to the object storage. ClickHouse instance or cluster just means that this is a custom resource in Kubernetes that consists of Keeper and server stateful sets. Both ClickHouse Keeper and ClickHouse Server have similar needs with respect to bin packing. And so, for the purposes of this talk, we'll only focus on the server pods for simplicity, but all of the approaches that we used for server pods can also be transferred to the Keeper pods. When we refer to a node, we just refer to an EC2 Azure or a GCP VM. Node utilization, so when you say node utilization, it means it's the sum of resource request of all pods on the node divided by the allocatable resources on the node. The key thing to note is that we are not looking at the actual usage of the pod, but the resource request, which is the minimum resources required by the pod. And finally, CubeScheduler is the default pod scheduler for all the pods in the Kubernetes cluster. So, let's take a look at how we set up Kubernetes in ClickHouse Cloud. We use managed cloud Kubernetes in each cloud service provider, so EKS for AWS, GKE for GCP, and AKS for Azure. We run multi-tenant Kubernetes clusters, and what this means is that pods from two different ClickHouse instances can run on the same node. We also use namespaces for isolation so that pods from two different instances do not interfere with each other, and we use Kubernetes network policy via the CiliM-CNI solution to prevent cross-namespace network calls. So, continuing further, we use Cluster Autoscaler, which is a pretty well-known component for dynamic node provisioning, so for node scale up and scale down. Most of the ClickHouse server pods are in a similar shape in terms of CPU to memory ratio, and so we have a homogenous workload for the ClickHouse server node groups. We also make use of over-provisioning, which is a concept in Kubernetes where we reserve some extra capacity for workload spikes so that when these spikes come in, server pods can be scheduled on these extra nodes, and we do this by scheduling some low-priority preemptable pods on these nodes, which can be evicted if server pods need to be scheduled, and these evicted pods now will tell Cluster Autoscaler to provision more nodes for buffer. And lastly, we have a weekly release schedule for all of the pods in ClickHouse across our fleet. This is important because it will come up later, and we do periodic rescheduling and restarting of pods. So, we deliver all the ClickHouse image updates via this weekly release schedule. So, this is how a sample Kubernetes cluster would look like. We have nodes which are VMs in all three AZs because ClickHouse instance has pods in all three AZs for reliability. We also have a few system workloads that are deployed on these pods so that we can monitor the usage on these pods. We have a few ClickHouse server pods that are already running on the nodes, and we have an operator in the Kubernetes cluster that is responsible for managing the custom resources. So, when a new ClickHouse instance is created, it creates a custom resource which is managed by the operator and which, in turn, creates stateful sets, and the stateful sets create the pods that they manage. Similarly, when another custom resource, when a ClickHouse instance is created, the same process is repeated and the pods are created again. And we see that pods from two different instances are living on the same node. This is what we meant by multi-tenant. So, this is how the infrastructure would look like in a Kubernetes cluster in ClickHouse Cloud. So, now that we have an idea about how our infrastructure looks like, what is the problem that we were running into? So, when we did a regular audit of our fleet, we noticed that the node utilization in the fleet was only about 40% to 60%. And this was bad because it meant we were running nearly twice the number of nodes we needed to run the same workloads. So, a few factors in our cloud setup that contribute to utilization. We have a dynamic fleet, which means that since we have idling enabled, ClickHouse clusters which do not have any queries or inserts running on them for more than 30 minutes are now scaled down to zero pods. We also have autoscaling, that means pods might occasionally need to be restarted with a different size and different CPU and memory. And finally, in the regular service lifecycle and creation and termination, we have pods being provisioned and deep provisioned as well. And all this means that utilization is something that keeps changing over time and is quite dynamic within our Kubernetes clusters. So, the goal, which after identifying the problem, the main goal for us is to increase the node resource utilization across the fleet and via that we can realize some cost savings. So, we have a few approaches to improving the node utilization. But before we talk about the approaches, any solution that we actually adopt needs to satisfy a few requirements. The first and the most obvious one is that it needs to increase the node resource utilization across the fleet. The second and the most important one is that for existing customers, we want to minimize disruption to their workloads. Since we're a database, we might have long-running insert queries on pods that we do not want to interrupt. And we also do not want any degradation in the experience for customers when they're provisioning new instances. Since now we have fewer nodes in the cluster, we do not want the node provisioning time to be an additional burden for the customers. And the last requirement is that our solution needs to be multi-cloud friendly. Since we run in three major clouds, we want one solution that works for all three of them and not a bespoke solution for each cloud. So, those are the requirements that any solution that we adopt should satisfy. So, let's take a look at what are the different approaches we explored. The first one was overcomitting the resources on the node. So, in this case, we set the resource requests to be less than the limits of that node. And this means that the pods of the pod, this means that the pods will be QoS burstable, which means that their usage can vary from anywhere from the resource request to the resource limit. By decreasing the resource request, which is the minimum amount of resources required by the pods, we can schedule more pods on each node. The key assumption here is that not all the pods use these resources up to the limits at the same time, and hence, we schedule more pods on each node. So, looking at an example of how this would work, let's say we have a node with 64 GB total memory, and we have four pods with resource requests of 16 GB each and memory limits of 32 GB each. So, now, on this node, we can schedule four of these pods. And when they only use up to their resource requests, everything's good. But technically, if you look at the limits, we've committed 128 GB on this node, which only has 64 GB. And if all the pods start using up to their maximum limits at the same time, then that's going to cause issues. And for that very reason, we did not end up adopting this approach. And like I mentioned, if all the pods start using up to their maximum resources, it can either, in case of memory, lead to own kill, or it can lead to throttling of the CPU. And this is not good for us, because in Kikau's cloud, we have a QS guaranteed setup, because the customer sets the resource limits, and we guarantee that they get what they paid for. And so, if we change that to QS burstable, it will not be a good experience for the customer, and it can lead to the noisy neighbor situation. The second approach is to make sure that the user doesn't have to use up to their maximum limits. So Cluster Order Scaler has a setting where we can specify the utilization threshold. And if on a certain node, utilization falls below that threshold, then the pods on that node are evicted and that node is marked for scale down. So it doesn't mean that scale down happens immediately. It considers various things as pod disruption, budget, de-installations, et cetera. It's just marked as a candidate for scale down. And this is a very handy way to do this. But we didn't end up adopting this approach. And the reason is that this violates some of our requirements. Frequent evictions of pods on nodes is too disruptive for stateful workloads such as ClickHouse, which is a database. It also violates our requirement of not interrupting long-running queries on these pods. And we also do not have enough fine-grained control on evictions. So we looked at another component called de-scheduler, which is a Kubernetes 6 project. It's meant to be a counterpart to kubescheduler, where when kubescheduler makes a decision on where to schedule a pod, it looks at the state of the system at a certain point in time and it picks the best node for that pod. But the state of the system varies over time and after a certain point, the original scheduling decision might not be valid. But kubescheduler does not go back and re-schedule the pod. And this is where de-scheduler comes in. So de-scheduler can look at certain constraints that we give via plugins and it can evict pods on nodes which violate those constraints. One such constraint or plugin is the high node utilization plugin using which we can set a certain threshold in the node and if the usage threshold falls below that for that particular node, the pods there are evicted. So you can see that this is actually quite similar to the cluster auto-scaler and this solution also does not work for us for the same reasons where we have frequent evictions and we have interruption of long-running queries. So we noticed that we had the same problem with both approaches 2 and 3 where we are focusing on evicting the pods and scaling down the nodes but not focusing on where the evicted pods are being scheduled and if they just go on to another lesser utilized node, we are back at the same problem again. And they also have frequent evictions of pod which I already mentioned. So we decided to focus on optimizing packing during the pod scheduling phase instead of thinking about evictions. So we started looking into the Kubernetes scheduler a bit more deeply and we noticed that the way that Kubernetes scheduler does scheduling right now, it first identifies all the nodes that this pod can fit on and then it does the scoring phase for these nodes where it picks the best node based on the highest score and by default the least allocated scoring policy is used and that means that when there are two pods that there are two nodes that a pod can be scheduled on Kubernetes will pick the cube scheduler, will pick the node with the least allocation. So it does this so that it can evenly distribute the load across the cluster but the flip side of this is that this makes nodes scaled on unlikely because there is a certain amount of threshold and cluster autoscaler cannot reclaim the nodes and so we have an inefficient cluster with higher number of nodes than required. So in the same way in Kubernetes scheduler in the configuration where we set least allocated Kubernetes also offers an option for most allocated scheduling in which case in the same scenario that I just talked about if a pod needs to be scheduled onto two nodes if you use the most allocated scoring it's going to schedule the pod onto the node with the higher utilization and it actually turns out that the best solution for us was actually the simplest one and this looks like a very promising candidate. So this is just an example of the config file on how we would update this in the control Kubernetes control plane. Inside the plugin config section for the node resources fed plugin we update the scoring strategy to use most allocated scoring and we provide equal weights to CPU and memory but that can be changed based on your requirement. So let's take a look at how this would look in our cluster. Let's say we have three nodes with Clickhouse server pods on all three of them and if for any reason one of the pod is restarted either due to idling or idling or autoscaling. So now Clickhouse when it tries to schedule this pod again if it's using the most allocated scoring policy it will first find the nodes which can fit this pod which is node 3 and since node 1 has the higher utilization it will then schedule the pod onto node 1 and that means that node 3 can now be reclaimed by cluster order scaler and now our cluster is in a state where we only use two nodes instead of three and the nodes are packed better. So this looks like a promising candidate. Let's take a look if it satisfies all of our requirements. The first requirement was that we need to increase node utilization across the fleet. By virtue of packing higher number of pods onto lower number of nodes we get better utilization across the fleet. The second requirement was that this needs to be low impact to the customer. When we update the scheduler scoring policy this does not immediately trigger a restart of all the pods instead in the pod lifecycle whenever the next restart occurs naturally that's when the pod will now be scheduled onto a node which has more utilization. And the third requirement is that our solution needs to be multi-cloud friendly and since this is a Kubernetes native solution this applies equally to all clouds and so all of the requirements seems to be matching for this particular solution and so great we just need to update the Kubernetes control plane to set the scoring policy to most allocated and we should be good, right? Well turns out it's not so simple. In AWS and Azure we do not have access to the Kubernetes control plane to set this particular setting. GCP offers a version of this called optimized utilization that essentially does what we needed to but this only works in GCP and this is not portable to both AWS and Azure. So given this road block how do we set up the most allocated scheduler policy when we're running on managed Kubernetes? Turns out Kubernetes has a handy way of doing this. We can run a secondary scheduler that is natively supported by Kubernetes. It runs in parallel to the default Kube scheduler and we can annotate the pod to mention that it needs to be scheduled by the secondary scheduler. So what we can do is that click house server pods can be run by the secondary scheduler and the non click house pods can still be run by the default scheduler. We use the upstream Kube scheduler image. We use the upstream Kube scheduler image and we just tweak the most allocated scoring strategy. It's having some technical difficulties. I'm going to stand here. So we use the same Kube scheduler by just tweaking the scoring policy and we deploy the scheduler in the Kube system namespace so that we can prevent evictions and scale down. So this is how we set the scheduler up in the managed Kubernetes. So now that we have a solution that works for us and we also know how to set it up, I'll hand it over to Genfei to talk about how we did this roll out in a non disruptive manner and what were the learnings and cost savings we realized. Thank you. Thank you Vinay. Now we already know the answer for the problem just the question is how we can roll this out. So before we dive into the details let's take a look of how the roll out works. We create the deployment for the secondary scheduler next to the default ones. And just creating that scheduler basically a know-all but we need to update the clear house pod to use this most allocated scheduler. That would be just a one-time change and from that point on all the clear house pods we start to use this most allocated scheduler. And the pods will be restarted but we have pod disruption project configured. It's just a one-time migration. For our production safety we will start from some smaller cluster first to gain some experience and then we graduate roll out to our entire production. And finally after that is done we do the evaluation and measure how much savings we achieved. Alright so for the scheduler set up we create the deployment of these three pods and the purpose for three pods is for the high availability. The pod itself is basically the same Docker image as the upstream. We just need to update the scheduling policy. We also enable the leader election. The idea is that when the primary pod crashes the second one can notice that and then take over the scheduling decisions. We actually did some reliability test. Basically by constantly schedule a lot of pods and in the middle of that we kill the primary scheduler and then see how the secondary ones scheduler pod can take over. We find that it can just work nicely so good to go. We also need to do some measurements. There are two types of measurements we need to do. The first one is the ad hoc analysis. There's a tool open source one called EKS node viewer and that is dedicated for the EKS resource analysis. This is the screenshot you can see from their web page. They show the CPU and memory utilization, number of the pods running on the nodes, or so even the cost for different machine type based on the AWS pricing data. That comes very handy for us when you are doing some ad hoc analysis during the road. The other type of the measurements is for after you finish the road to the final evaluation we need something more sophisticated than EKS node viewer. For this one we use our internal data warehouse. And the interesting fact that it also runs on the clear house cloud because we love our product and this data warehouse instance we show the cost over time so we can do more comparison before and after the road. And the data warehouse we import the AWS cost and usage data periodically. We also use SuperSet, an open source software as a UI layer on top of this data warehouse for some analytics. All right. So having all these measurements and preparation down now it's time to road. As I mentioned we start from the smaller cluster first. Actually after road is out to a smaller cluster the initial savings were not that significant. But we are not too worried because for a cluster to run you need to have some flat cost anyway to run your system workloads and you also need to have your machine type. Those are just there you cannot avoid. But the important part is that we gain the confidence how to operate this in our production settings so that we can proceed. So this is a EKS node viewer diagram for a larger cluster before the road. The important thing I want to highlight is that first the average utilization for the node is around 50% which is not very high. Also the utilization per node is ranging from 30 to 60%. This is something we want to improve. This is the diagram after the road. As you can see the utilization for the node average has improved from 50 to 70%. During the road VNA just shows a clear hotspot being stopped and then rescheduled again to a more packed nodes and by doing so for a while for the entire cluster some nodes have been saved and cluster autoscaler cleans them up so that that is where the 10 to 15% cost savings we achieved with this road. All right, so having done all those things for our production clusters now time to look at the cost savings. In this diagram for our data warehouse software the X axis is the time and the Y axis is the money that we pay for the easy to compute machines. You can see on the right-hand side there is a very obvious deep that actually comes from our road which is quite promising. So with all this you may have one question left. Well, this is very good but how does the utilization change over time? We understand that right after the road when you pack the cluster the utilization is very high but what would happen then things change over time. You may remember we have service being idle or stopped or terminated those process can create this schedule some paths from those and create some holes in our VMs, something like this which drops our utilization. Luckily we have these regular upgrades for our clear house software which essentially try to update the clear house server paths with the newer Docker image versions and that is down in a controlled manner and gradually change one path at a time with disruption budget but overall the important thing is this basically move the path to the most allocated ones and achieve the same packing effect we just saw. So this brings the utilization back to a higher point. We do this every once in a while for our upgrades. So now I want to share a few findings and issues we have. I will go through them one by one. So first system workloads we find that sometimes an event in node utilization even if it's very low the cluster auto scaler still doesn't reclaim and scale down that node. Why did that happen? After some investigation we found that nodes actually run some of our system workloads such as Argo CD and AWS CSI controller all those things. And these workloads they have use of ephemeral volumes like a local host path or kind of storage on the node and because of uses of that cluster auto scaler will not be conservative trying to not evict that part and so which prevents the nodes being reclaimed and saving the cost. The solution is very simple because we know the nature of the process. So we just add a save to evict to the system workloads so the cluster auto scaler can then reclaim those nodes. On scheduler paths this is very interesting actually at some point we notice that certain paths are just stuck at the pending state and the cluster auto scaler and most allocated scheduler we deployed have different opinions on these pending paths. Basically our most allocated scheduler think oh there's no space for me to schedule this greenhouse path it will be pending however cluster auto scaler also think I don't have to scale up because you can just evict that over provisional path with lower priority what Vinay just mentioned for this for provisional and we want to figure out why this happened. So we actually searched and found these tags from Kubernetes document the summary is that when paths are created the scheduler will form an queue and trying to schedule the pending paths we are also trying to find some victim paths with lower priority to evict in order to accommodate this path this does ring a bell to us because you mentioned the scheduler and remember our clean house uses the most allocated scheduler our over provisional however is still live as the default settings use the default scheduler this create basically that two paths use different scheduler and we think that may create kind of independent workloads and unable to be evicted for the over provisional so the solution is very simple then we just make the over provisional also use the most allocated scheduler right after that we see the clean house part being scheduled all right co-start time by co-start time I mean clean house part being request from wake up from the idling or getting started in the first time the time to take the part being ready and ready to serve the customer request in theory as you can see now the cluster is more packed so then a new positive coming is more likely to trigger a new EC2 machine being provisioned and so co-start time can increase because of the EC2 machine load machine pre-setup time and we measure that and confirm that it did not happen and we think that is because our over provisional buffer gives some buffer space for eviction and allows the possibly in schedule result provision in no time a new node the likelihood is reduced finally as mentioned GCPs they have a native option called the optimized serialization of the customer at the cluster level you just enable this which achieve the same effect however for Asia is similar to the EKS you cannot change the control play for this configuration have to do this yourself deploy a secondary most allocated scheduler all right time for the summary so we are in a multi-tendent Kubernetes hosting environment doing a source product for our customers and our goal is to use the resource utilization for the Kubernetes clusters to save some cost however while doing so we would like to minimize the disruption to our customers we evaluated a few approach and we end up selecting Kubernetes scheduler with most adequate scoring strategy figured out a few issues during the route and that save us 15 to 20% cost which is great some take away message if you want to do something similar we think these are helpful that you can consider the Kubernetes scheduler scoring strategy to see if there is something that you can use the second one is to use the QS granted so that if you want to avoid the noise neighbor situation and the third one is take advantage of your software upgrade process so that you can do the scheduling as an opportunity to improve your utilization the last one is the over-provision that gives you some buffer time when you have a more packed cluster that's pretty much about it and we also have a blog to details about how this is done in our technical blog come and check it out if you are interested our colleague Manish he will talk about how to use Kubernetes seed for sets and how to do the scaling challenges here just in this room what a coincidence tomorrow afternoon and last but not least we are also hiring if you are interested in building a cloud offering for the database and open source technology come let us know we are interested thank you thank you I have a question about the practical way to do it you are just using the scheduler name spec inside do you have anything else? you just specify the policy scheduler name in the policy spec that should match whatever you specify for your second scheduler that should be enough so you are putting that on your click house database pod and not on the system pod and the rest of the pod sorry can you say that again you discussed the fact that you wanted to keep the default scheduler for the other pod I guess you are talking about why is it my question is more why is it interesting to keep the default scheduler for the rest of the pod the question is why do we even need to have the default scheduler the answer is that you need to explicitly opt in for our most allocated scheduler if you are deploying some third-party deployments for example Grafana or those things they their default values still need to work and I guess yeah I guess it's just what works and their footprint is not that huge anyway so we are just keeping it safe and yeah that's it yeah we haven't felt the need to use the scheduler for the other type of workloads that we use in the main cost savings that we discussed and the highest number of pods usage is for the click house server I would like to ask if this also works in a multi-cluster environment does it make a difference or really not multi-community clusters yes so we haven't explored that exact scenario yeah our click house workloads currently is not deployed in multi-cluster fashion so every scheduling decision is down within a cluster scope so that's not a challenge for us yeah I think it would depend on how scheduling works in this scenario thank you very much I wanted to clarify about the secondary scheduler was it a controller or did you just redeploy the cube scheduler yeah we just redeployed the cube scheduler like you mentioned it's a deployment in which like three pods run with the same cube scheduler image and a custom scoring policy you can take this mic I wanted to ask you said you wait are your workloads click house workloads are rescheduled naturally and do you plan or know how you could enforce this automatically so you don't have to wait until new releases are done or something do you have another controller for that or so when you update a pod spec all the pods that are using that spec like when you update the scheduler to use the secondary one it will trigger a restart automatically for all of them in our case we do that in a more controlled manner because we have the pod disruption budgets we don't want to restart all of them but if you don't care about that you can just do it by updating the pod spec okay but that sounds like you would do it manually and it's not automatically done somehow so we are so first of all our release process is done automatically it's managed by the second question and second thing is that it's not just release automation for example auto scaling or node drain or whatever reasons that customer change or configuration that requires restarts that we are all restart this pod and pack it to another more packed nodes which take advantage of this effect yeah but we automated it like after the pod spec was done in our code base we used the release process to automatically update it thank you