 We would like start from the question. Why should we should optimize costs when using open sheet the answer for this question for Anyone is like different. So there is no one answer. It could be your manager manager wanting you to optimize You may want to optimize costs. So it's it depends We have our team. We have two core reasons why we want to optimize costs and why we did it The first one is that our service open shift as platform open shift. See I Is going big every month every year we execute more and more and more and second one is that we have specific use case while open sheet the shift is nicely tailored for For default use case our use case is a little bit specific. This is why we want to Optimize costs So why our use case is different? You can see here our growth from June 2021 to June 2023. We had two build crasters 23,000 job definitions and 200 repositories and now we we grow we have six build crasters 60,000 job definitions and more repositories our users every month execute half million of tests and 0.1 million of fmr clusters ad hoc clusters are created the important question is how to start optimizing your infrastructure costs and The answer is that every cosmetic decision the decision should be made only after you carefully analyze your own data and Understand your own use case. So every use case is different. We are giving here some Tips hints, but this is what you should do first. You should really understand your own data How to do that? There are several methods. You can analyze your metrics and review alerts. You can Use Prometheus Grafana for that Alerts coming from the infrastructure your own alerts. You can inspect clouds cloud costs explorers for GCPA in AWS for example, and you can craft your own Your own analytics tools for example, we can use Google Studio for that Are there any wins the easy wins that you can apply To Achieve something relatively quickly the answer is yes First and kind of obvious easy win is do not execute not needed tasks It may be obvious. Yes, just not run things but if you are really going big and you have multiple users that are using the platform, it's easy to get lost and Lost track of the things that that are happening in your platform. For example, our own use use case was that we used to run our end-of-life tests for older races because they were abandoned and we detected that and disabled them or lowered frequency Second easy win is minimize worker nodes. This is coming from our own use case really But it may be also in your case that you do not need free Worker nodes by default that are in open shift You can set consider to scale them down Even to zero if it's needed not to waste resources. It can be done manually You can eliminate it automatically you can use auto scaler or you can do it at the installation stage the next easy win is associated with AWS you can change the The type of CPU we did this for our platform We changed from inter CPU CPUs to AMD CPUs and that bring us benefit because In that case, I'm these CPUs are cheaper on AWS The performance is more or less the same. So nobody was hurt by doing this This is also referring to the previous slide. So if you change not only The Intel to IMD for example in this case, but you also inspect what are the machine types? And for example, you are running for a lot for a long time your note you might discover that you are running on on the old machine type and in that case you can inspect and try to upgrade to something better Which being brings better price performance This has of course indirect influence on cost-cutting efforts. So you will see it You can of course do more Much more if you consider easy wins, but let me let me give two more hints The first one is to benefit from savings plan committed use use discounts in case of GCP and reserved instances in case of AWS For AWS, you can choose GP free as your different storage device and that is very general Volume that should be ideal for most of the applications all right, so But how can you do like? Try to save more costs by using the specific machine types in any cloud provider As we all know that all cloud providers have different kind of Types, but most of our users our customers They are don't care about what to the type they are going to use, but they are trying to Actually use what is by default in Opensif, which is a general purpose, right? So how can you try to optimize that area based on what operations you have right? first of all you will need to Stop using the default one That oppressive is using and technically you will need to investigate your infrastructure how The operations are using the resources. Are you using CPU? Are you using memory? What are the operations that you are doing? So this is for example, if you have Something that is using a lot of CPU that you might most probably need to run their operation in the CPU optimized instance type in the In your cloud provider, but how how it's that the you're going to start is With OC tool you can easily start monitoring your infrastructure on your pick ours and to realize exactly how You are going to use it and technically like Jacob said before you will have also Prometheus data and other metrics and other monitoring logs in order to Understand exactly how you're going to pick out those those types, right? Consider switching to arm nowadays we The arms at times are way cheaper than other types, but this is with comes with a hidden catch right but If your operations can work on this architecture or you care actually to Work on different architecture, but if you choose arm nodes, you will be able to save a lot of money based off of the instant usage, but Nowadays we added up Having heterogeneous clusters in OpenShift which does exactly that heterogeneous cluster is going to give you a control plane nodes in apt-64 and It is going to give you the ability to run Other machine sets in a different architectures arm architectures and other Architectures that you can use you can use the heterogeneous cluster to just have arm working nodes if that's feasible, right? But how to scale the OpenShift installation to decrease cost, right? Most common mistakes that users do Is that they don't pick the the infrastructure they we want to spawn based on their needs and by default OpenShift is using specific amount of resources with the general stuff and Default configuration, but what if your infrastructure just needs to Use only workers This is hypersift you can use hypersift to do that What if your infrastructure is wants to make it to maintain or spawn then destroy ephemeral clusters? You can use Hive to do that Hive is really good There is another option too that you can run OpenShift in a single node, but I don't recommend that Yeah, so another thing that you you can consider is to apply the node down pressure in your nodes This is happening by default by your autoscaler But there are few cases that autoscaler is ignoring and we need to identify them So if your pod Is using a local storage? That means that the storage exists in the specific node Therefore the pod can't be evicted or transferred to another node So the node will be there alive with the pod running it so You won't be having the node now pressure feature in that case Also, if the pod is not not owned by a recognized controller, right? If you have like a just single node running and it's not like in a deployment or a state set or demon set And of course if you have a PDB configure, which is the most annoying stuff the pod disruption budget you Prevets of evicting specific pods based on your needs So you will need to have your configuration done in order to make sure that the node put down pressure is Happening correctly and you can bypass that by applying specific labels to your pods and also you can Enable the high node utilization in your schedulers configuration, which is a CR that we have in open sif and one one of the important ones is that you if you will be able to identify The long-running processes or your nodes This is a big trap Because if you have different kind of long-running processes and they are scaling in a specific nodes those nodes will be kept alive By the auto-scaler because the long-running process is running. So This is another way that once you identify that you can using tate solaration or node selectors You can control where your long-running process will be done. So It would be better to have them all in one node, right so you can keep the auto-scaler Do his job one other thing that we actually using in our infrastructure is starting using spot instances and Surprisingly using spot instances can reduce dramatically the cost and We have also the option in the machine set in open sif to Specified it. There are few benefits few drawbacks There are different kind of services that can provide you a spot instance But I can give you an example for spot IO here because spot IO is a guarded return of investment paid service Which they will charge you based on how much money you saved So this is really good and you will save money, but there are some benefits basically that You can completely replace your machines and machine auto-scalers for specific tolerate workloads and The spot IO can scale down unnecessary nodes just to decrease the cost and But there are few drawbacks, of course and if you don't pick ours the spot instances can become unavailable then the machine API won't be able to Fall back to on-demand instances and you will have a deadlock there so Before moving to spot instance instances you have to make some investigation of course in order to make sure that Your operations work correctly Yeah, okay, so the last question we want to ask today is how to optimize data transfer to benefit from our rates And here is the example The example is pretty easy cost on down of downloading from storage bucket And as you can see if we are here is AWS and GCP as an example We are in region one in AWS and we download The download Things from from S3 bucket. We see that cost is free or almost free if we If we go to the different region on AWS the cost is of course higher But if we download things from GCS bucket, which is completely different different Different platform, of course the whole cost is the highest Why is that it's that because of the colocation? So the cloud providers encourage us To host as much as possible But by imposing charges if we consider alternatives So if we have multiple cloud providers as it is in our case Optimizing costs should involve prioritizing communication with the within the cloud So how to do this? Colocate systems and data in the same region Preferably in our BT zone Here are the direct examples coming from Here there are direct examples coming from our own Infrastructure, so we have task and job dispatchers that are dispatching things based on cloud or region We have engines proxy and the one that is that is shown here is a registry pool for cash so Let's say that we have external registry and we have a note on in this example AWS and we want to Download image we go through pull through cash pot and for the first time we download the Content from external registry, but by the way, we are serving in in region one As free bucket and the next time if we are I don't know retesting or running the thing We will have redirect in place that will redirect direct from pull-through cash pot to S3 bucket and S3 bucket will serve directly the content to note on AWS and by doing this we We allow less costs while Coming to the external registry because we target it only once if we want want newer count content It is extremely Efficient in our setup because images are changing and people are rewriting things so Yeah, that's working another thing that I want to really briefly mention is eliminating the not gateway cost in cloud systems ingress is Not subject to fees if it doesn't pass through an out get away But by the by default open shift have not gateways So if you want to consider this option as a cost-saving option you have to know that Modifications are not officially supported, but you can do this in AWS as well as in GCP and I think also in Azure There are multiple more topics that we could talk about But some of them we briefly mentioned and some of them we consider not to include here as not being so General advices. Yes, so we can we could definitely do more in the area of cloud storage in Storage instances cloud form functions. We can consider using bell metal machines Explore more savings plans and reserves instances, but in the end We decided only to mention those areas in our presentation and That's it. Thank you very much and now it's time for questions Okay, so the question is how much effort we put into it and how big Were the savings? the answer for the first question is that it was an effort of entire team plus staff engineer and Also other teams were involved. So the effort was really huge 10-15 maybe 20 people working. It was not that we were working On a daily basis on that yes, we had some topics we identified them and we we introduced them so It this initiative is running for one and a half year, right something like this and Savings in the time when we In the time when we introduced the most of the savings because of course plans are changing and our Infrastructure grows as well, but by the time we introduced most of the savings Yes, so the big batch of savings the day we introduced them or the week the month that we introduced them It was around 60 percent Yes Try to start So we're talking about a big infrastructure right about in a smaller infrastructures Having in your mind using the cost is a way on concept why while you're designing an architecture in your infrastructure so You want your infrastructure to expose metrics about the resources that you are using you you need different guide of monitoring and Logging system in order to maintain that and once you will understand your needs of your infrastructure You will be able to Time-to-time reducing the cost or keep doing the more efficient approach, right? So So there it you the cost reduction comes every single day and Incrementally you improve in their situation Now actually the question is if What are the monitoring slats we are using and we are we are using? Grafana for dust boards and we are using Prometheus to expose all our metrics and inside tools exposing custom metrics in order to gather what we need and We have we are using AWS Cloudwatch in order to set All the logs they are using vector and we analyze those logs in adjacent format So we have some query stuff. So we are able and in order to understand exactly what is happening to our infrastructure Yeah, some time-to-time. We had said these ideas to start using something else, but Since it's working don't tell anyone, you know Everything is presented also in UI, which is like a look at studio Google studio So we have a UI with query set up and yeah, it's working But you can start from something simple you can start with OC to Monitor CPU to monitor memory and that is also a good starting point Yeah, but the customers don't want to use OC, you know, they want to go The answer to the Okay, sorry the question was when is the right time to start doing Savings and when is the right time to stop? if you Like I will answer the second one so Maybe you will never stop Yeah, and the answer to the first one it depends really on your Infrastructure and on your costs if they are too high for you They are too high you should analyze if you analyze and you think you can optimize something then optimize Yeah, can you repeat the last Generally I Guess that I don't know if I understand collectively the answer was what motivate the question was what motivated us to start, right? Like to simplify what motivated us to start start so what motivated us to start was Your manager will come and say reduce the cost, you know Yeah, if we want to simplify you can say something like this Yeah, that that's it can be that there you are of course your service is growing and you reach a point that You hear for that for the next I don't know quarter or year you have to give 10 20% more or something like that. Yes, and You start to think what to change Yes, not not to give this money to the cloud providers and there are options to to talk more about that I encourage you to see us in front of the doors. Yes, and we can talk more because it is really wrong long topic Yeah, and one thing that it's worth mentioning is that once they cost start to increase based on because your infrastructure is Extending the costs are increasingly a really high rate. So one day you will see that like maybe 20% of your cost is up because of that and Then in panic you will start analyzing your data and so on. So that is why we Advise you to start doing that from scratch like when yours wants to Deploy something or have a cluster and you might want to consider it yourself in What kind of configurations of default configuration? I'll have to remove in order to improve the efficiency of the cluster in the course right so This is like we are encouraging you to Install an operative cluster in more depth knowledge, you know It's also important to notice that we are not doing this on a daily basis, right? We are trying to identify areas and our users know about that They know that we are trying to identify areas and that is not nice that the users already know that we have this initiative and It was nice to see several days ago that they started to email thread what we can do Yeah, to further optimize cost they have even they have some ideas, right at the beginning if your users do not care You are on your own but now if they are joining they're giving ideas That's nice and if they don't care they can increase your cost like we have a case where a team was Spawning a new ephemeral cluster just to run a unitist. So and that was spawning every day Yes, yeah, so the question was what is about the prioritization of workflow? Well As I told during presentation in our case at least some of the workers were abandoned So, I mean end of life releases. They were just running because nobody cared. So that is their first good advice Disable them. Yeah, you cannot count on your users always that they will do it You have to sometimes monitor yourself So the I didn't hear like you can you repeat the question? The question was How we made you get users if we made out to care the thing is we have initiative We removed their jobs like in a Saturday night But the email thread that I saw it was like Coming out of nowhere. So I didn't motivate anybody to think about that. So that's nice. That is I'm Speaking that it's nice. Yes Sometimes they are affected though because that they know that some something will happen and there are doubts. I think that we had some doubts when we changed machines from Intel to AMD, but They were cleared after some time. So yes Yeah, why when it comes to cost like all teams are a little cooperative also them We did some tuning in their jobs and Try to and also our team is in costly communication with all our stakeholders And that makes us easier to communicate with them and let them at least know why we change their Cronjo bidder ball or why we change our architectures in a specific clusters, but A lot of that our users are completely free to do whatever they want Including attack vectors some cases, but we trust them, right? Yeah, we try to trust them not to limit them Any other question? All right guys, thank you very much