 Good morning, everybody. Welcome to our panel discussion on Cloud Computing's first economic recession. Let's talk platform efficiency. And let me start by asking a question. How many of you have a mild efficiency panic at your companies, at your organizations? All right. So you are at the right place. We are a group of end users. And we also have an efficiency panic at work. And we've been spending some time doing optimizations and figuring out what's the best way to make the platform efficient. And back in the day, it used to be capital expenditure. Like, you'd have a data center. The spend was predictable. And these days, with the move to cloud, it's a pay-us-you-go model. And it's really becoming an op-ex, operational expense, and impacting companies greatly. So I think that's the reason why we have this, and the increased focus on optimizing and squeezing more. And doing more with less is the mantra these days. So in this talk, we are going to be talking about three aspects, so culture, operations, and design. So these are the three broad categories that we are going to be talking about today. And with that, we'll introduce ourselves. And I'll hand it over to Todd to facilitate the panel. Yeah, thanks, Aparna. Yeah, go ahead. Phil. Hello. My name's Phil Whitrock. And I work at Apple on Kubernetes. I'm Nagu. I'm working at Zalando in the Cloud Infrastructure Department. I'm managing two teams, Cloud, sorry, Continental Platform and Cloud Cost Efficiency Teams. My name is Aparna Subramanian. I'm Director of Production Engineering at Shopify. And I've been involved in platform efficiency efforts for the past year and a half at Shopify. Hi, I'm Todd Ikenstam. I'm Principal Engineer at Intuit, working in our core systems team. That also includes our Kubernetes platform. Well, let's get started talking about culture. Obviously, the goal is ultimately to increase efficiency and reduce costs. But where do you really start? How can you get started? How can you gain organizational commitment to improve efficiency and reduce the costs? Phil, you want to start? Yeah, I can take this one. So one thing I think is helpful to start with is start out measuring where your big wins are. Where do you want to focus? What's going to move the needle a lot, but it's going to take a long time to do? Maybe not going to move it as much, but it's very easy to get done. And then from there, figuring out who the right folks to engage with are. What are the right teams? So you can start moving forward. Yeah, and I can add to that. So this has to be somebody's problem, right? So there has to be some center of excellence or a Phenops practice. And they have to be in charge of making sure that everybody knows that it is a consideration that they have to worry about. So often we run into the situation where it's everybody's problem, but it's nobody's problem. So I think having the central team is really important, but it's also important to understand that it now suddenly doesn't become only the central team's responsibility for making sure the platform is efficient. It has to be a collaboration between engineering, finance, procurement is the team that you have probably that is negotiating contracts with your cloud vendor or other vendors. So I think it really has to be a collaboration between all these different teams and that's how you bring about that awareness and accountability. Those are great points. And I would like to just add on by saying, you would like to drive the cost ownership through financial practices, as Aparna said. You have to have budgets based on accounts, teams and business units. You could also bring awareness in terms of how important this cost efficiency first are, because most of the times we would hear from the business that delivering a feature is more important because that's gonna bring in money to your company. But you have to sort of equate the cost efficiency with your GMV or revenue so that they understand how this impacts the bottom line in the end. And as Phil mentioned, you should also measure everything. At our company, we kind of have a lot of metrics. One of them is kind of looking into the resource level like per VCPU cost or per memory, gigabytes of memory cost. Or we also have unit cost based on applications like card or search or things like that. We can also have very higher level metrics like connecting the infrastructure cost with your GMV or revenue and then see whether it makes sense to invest or spend so much. Yeah. Yeah, great, thanks. Actually, let me get a show of hands. How many people know what their cloud bill is? Like how much are they spending in the cloud? Okay, to me, that's the first step. Like you need to know. You need to know what you're spending. And the old adage holds true. You can't really manage what you don't measure, right? So first measuring that. Let me see a show of hands. Who knows how much a particular service or application costs, not the whole? Okay, a little, okay, good, good. So I think that's the big challenge is taking that cloud cost, that big bill and attributing it to individual teams, individual applications because only then, when you have that visibility, you know where you have the opportunities to improve. So at Intuit, we have a dev portal where we track all of our different software assets, whether it's a service or an application and all those have some asset ID and that asset ID is propagated and tagged to all the resources that are required to support that service or application. And then we aggregate all the billing data and attribute it based on that. And so that gives us a fine grain and be able to put up a number in front of development teams and actually roll that number up to various directors and VPs and so forth. Because it's not enough to give the top level, the CTO or the CEO the bill, right? You need to give that visibility to the people who can actually make decisions and make changes to how the system operates. At that level of visibility is really the first starting point when we started looking into things more closely at Shopify, we were able to see clearly from the cloud bill, you can see what are the different projects, what are the different clusters. But it's not exactly helpful, right? Because if you have a multi-tenanted platform, you want to know how much is app A costing and how much is app B costing. And if you have shared platforms like database platforms, logging and observability, these are all multi-tenanted systems in its own. So you want to be able to attribute that cost per platform to each app. And I think that's where, then you can drive accountability and go to that team or go to that director or VP and hold them responsible for making the changes necessary to improve their efficiency. So how else do you guys give visibility to the cost to your organizations? Yeah, as Todd mentioned, tagging is a great way. That's probably the only way to understand who owns a particular resource. We also try to attribute the cost at the most atomic level, like application, and then you can just work backwards to teams and business interests and stuff. As Todd mentioned, you cannot manage what you don't measure. And I also believe that you won't get any attention to work on these kind of topics if you don't report on them. So you have to create reports that are catered to specific use cases and users. For engineers, I believe they would like to have some real-time data, anomaly detections and alerts and stuff so that they can react really quick on cost incidents. Whereas for managers and management, they could be more of a week or we cost changes or some sort of reports. We also try to sneak this into our ORMs, just make it a topic, and then also we have BRMs with our cloud vendors or even within the internal business units when they are talking about business review, then they talk about the infrastructure costs so it gets more visibility so that we can react on them. Yeah, and I'll add that once you have that visibility, it's important to get down to the level of unit cost, right? Because if AppA is costing $500 and if I'm a developer in that team, well, great, what do I do with that information? So if it gets translated to, well, now it's like for every transaction, it's 50 cents and now it's 75 cents, then I have something to work with and continuously track and improve and optimize. So I think once you have this absolute number per application, it is very helpful to get down to that unit cost and continue to track that continuously. So one thing I find, sometimes we'll get some cost alerts that says, oh, this resource is idle, you should go shut it down, but you can't always shut idle things down. Can you talk a little bit more, Phil, what about idle resources and how do you manage? Yeah, absolutely. So what optimal efficiency looks like may be surprising, for instance. Preemption is very disruptive, right, or can be. And so you may have a workload that's around a completion workload, not checkpointing, it's 50% done or nearly done after running for a while and then it gets preempted because you don't have any idle resources and you need to roll something out. And so in that case, like having a bit of headroom so that workload didn't get preempted in the middle of what it was doing, may be more efficient, even though your dashboards don't make it look that way. Passive, if you have active passive DR, for instance, what does optimal efficiency look like for the passive instance? Is it 100% utilization? Probably not. Same thing with HA. So the idle resources may be an artifact of what are the capabilities of the platform you're running on, what does it offer, and maybe that slack just needs to be there for your availability needs. Yeah, and I think it's important to know what that slack, good slack looks like, right? So having that common understanding of efficiency and waste for each application and across these variety of stakeholders that we talked about, phenops, finance, engineering teams is really important. So I'm curious to hear your take, Nagu, on this. So Shopify is an e-commerce platform. And sometimes we have to reserve and scale up all the way because there's like a big flash sale coming up and that time you don't want to be scaled all the way down and you don't want your auto-scaler to kick in and your cluster auto-scaler to be kicking in and doing all of these things. So I think there are these times where you want to protect your reputation and it's not about efficiency. Yeah, so as infrastructure teams, it's sometimes hard for us to understand what the application owners actually wanna do and why they have, I don't know, primary, secondary and what are their HIE needs. So at Zalanda, what we do is, especially in case of flash sales, we have, we are utilizing CRDs to sort of pre-scale our clusters and also scale down when it is done. So this is really effective for us because we can ensure the business continuity and availability at the same time. We can also make sure that when things are done, the house is clean, so yeah. Yeah, that's a really good point and Black Friday Cyber Monday is a really big event for Shopify and that's what we do is we leave it scaled up for a certain period of time because there's Christmas and there's peak shopping season but then once we are able to monitor the traffic that's coming to the platform, the FinOps team pays close attention to make sure that things are back to normal. So I think that's why you need that central FinOps team because there's somebody looking at this every day and reaching out to the appropriate teams to take action. Yeah, I think that's a good segue into our next section of operations but before we move on, I wanted to also bring up that sometimes there's costs that are difficult to see or to manage and one area that's in particular difficult is observability, right? Sometimes developers will just emit logs and metrics without really thinking that it actually costs money to store those things and that can actually add quite a bit and it's also difficult to then attribute those costs back to the biggest offenders of the biggest loggers who are sort of wasting log space or emitting too many metrics but at the same time, there's a cost of not having those logs or not having those metrics if you need to debug or troubleshoot the platform. But yeah, with that, let's move on to operations. So we talked about Black Friday also into it. We also have different seasonal type of workloads. We have TurboTax which is really seasonal around the tax season. We have QuickBooks product where nine to five is very busy and then outside of nine to five is not as busy. So we have a mix of different workloads and I think because CPU and memory and compute resources is a big component of cost, you need to really see how can you make your clusters and applications run most efficiently to minimize costs but at the same time, provide the services that you need to. Anyone else has some thoughts on, how do you optimize your compute? So you talked about scaling up. How do you actually scale up for Black Friday? Yeah, like now we mentioned for Black Friday, Cyber Monday, it's all about protecting our reputation. So we do scale all the way up to the projected traffic and we actually disable auto scaling for that period of time because we don't want the system turning and like Phil mentioned, preemption and all of that is costly to the system. But other times we do leverage auto scaling. So the auto scaling all the way down, right? So we have HPA that we leverage heavily and we don't do auto VPA. We use VPA to, we are the platform team. So we use VPA to recommend what the right memory and CPU that we think should be and we make that suggestion to the respective application team using Slack channel. We are automated PRs and things like that. And then it's up to the application team because they know the specifically the nature of their workload. So it's up to them to review the recommendation and make the appropriate changes. But we leverage HPA, we leverage cluster auto scaler. So for all times other than like Black Friday, Cyber Monday this is the sort of the mode of operation for us. Yeah, so I think auto scaling is kind of a key capability. If you can auto scale your, not only your application but also your cluster up and down. That's the, for cost that's obviously the best but it does come with some disruption. So how can you minimize that disruption? I think a lot of it starts with making sure the apps can be disrupted, right? You can't launch a pod in Kubernetes and expect that pod to live forever, right? That's not the intent of Kubernetes. And there's a, for a 12 factor app there's a concept of disposability which means that any one instance of the app needs to be capable of terminating and gracefully exiting and being disposed of at any time. And so when it gets a SIG term when Kubernetes wants to scale down a replica it's gonna send a SIG term and that application needs to gracefully handle that situation and it shouldn't be a big event but not all apps are perfect and so there are some cases where you wanna try to avoid that but we should always strive to allow that kind of dynamic scaling up and down. What are some misconceptions about auto scaling? It seems like kind of like a panacea where, oh yeah, just do auto scaling but is it really that easy? Yeah, I think vertical auto scaling as you kind of mentioned to Parna as a tricky one like what is, does the vertical auto scaler have the context to know the difference between a failover state and a not failover state or is it gonna see an increase in resource usage due to a failover and say, hey, we need to change the pod spec and start a rollout right now. So the recommendations I think there's a lot to be said for checking those in as part of your regular release process. You can roll it back, it's audited, these sorts of things. The, a lot of the metrics I think get measured at, maybe their averages over intervals, right? Maybe 30 seconds or something like that and so one thing we've seen is if you measure at a one second interval like what's the P95 is actually quite a different number and so that can help with setting your limits in how you think about bursting and these sorts of things. We also kind of don't recommend VPA for our application owners but we do use it for our cluster components for our singleton applications. We could still use them because it doesn't matter if it has a little bit of a downtime but when it comes to HPA we also ask them to be very mindful of the thresholds that they are setting, right? The target utilization that they have to set and also the min replicas that they have to set because sometimes they can be a little bit more generous and then during non-peak hours then this is gonna cost a lot of money so we ask them to be mindful of those things even when it comes to HPA. Yeah, maybe let's pause there. Let me, can I see a show of hands? How many people are using HPA horizontal pod auto scaler or something equivalent? Okay, a lot of people are using that. What about how many people are using VPA or some mechanism to vertically scale your pods? Okay, a few number, a few less. How many people are using cluster auto scaler? How many are running on-prem? So there's some different challenges between prem and on-prem or in the cloud and on-prem but a lot of the takeaways are the same for both. Because even though on-prem you have to buy servers and rack them and take some time, you can't just order more from your cloud provider. It turns out that when you're in the cloud you're probably dealing with reserved instances and you have a certain committed number of servers that or instances you're buying from your cloud provider. So you have a similar kind of capacity planning issue for both cloud and on-prem. I have a question for you, Todd. I know that you've mentioned that in Intuit you have a really nice way to sort of rotate your clusters every once in a while, because the one problem that we've seen at Shopify is with auto scaling all the way down, including cluster auto scaler, your cluster can sometimes become fragmented and you can have, for lack of a better word, empty pockets of available resources that then sort of never gets reaped. So, and I know that at Intuit you have some interesting ways of dealing with it. Yeah, so that's the bin packing, right? So over time, if pods, if you have an application scaling up, scaling down, pods can get scheduled on different nodes. Some nodes could get underutilized and maybe just have only one pod on them and not really taking full advantage of the resources of the node. There's a project called de-scheduler, which would go and actually de-schedule pods and reschedule them onto another node that, and restore that bin packing. At Intuit, we actually, we terminate our nodes every 25 days or so, regardless. And that's kind of a little bit for security and compliance reasons. But it also has a side effect of forcing, all those applications to get rescheduled and kind of trains our developers that, hey, I can't count on these pods running forever. It's okay that they terminate. It's okay that they come back up. And so by doing that, we've kind of helped build this culture and understanding of how Kubernetes works for the developers. Then we also, for other compliance reasons, we need to update the AMI and the security patches of all the nodes. And so we'll also, regardless of the 25 days time period, we'll go through and rotate all the nodes to update. And so all these various things help reinforce the fact that you need to be able to scale. So that's good for operations, but also that also applies to auto-scaling, scaling up, scaling down. Yeah, we also add into it, we're looking into developing a system that takes the recommendations from VPA, taking our historical metrics that we're collecting for each application, and then using that to make some decisions and recommendations for both VPA and HPA. Typically you can't run, or you shouldn't run VPA and HPA concurrently in the same cluster. They're kind of working at cross purposes for each other. So we're trying to build a system that integrates those recommendations and then, as was mentioned, apply those recommendations through the pipeline through using GitOps so that if you change the resources of a pod, that change in resource will start back in your pre-prod environment, get tested, validate that it does work in pre-prod and work its way through the pipeline to your production environment. We don't want to just suddenly change the resources in production without being able to test it first. So that's the approach that we're taking. And we've also open sourced Numelogic and Numaflow, which is an open source project that it's a machine learning and analysis project that runs in the cluster that can take advantage of Prometheus or other types of data available in the cluster to help make those decisions. So we're looking to deliver Zetmore going forward. So, see, I think we'll move on to the next topic. So the next big area is design. We talked a little bit earlier about designing a 12-factor app that allows for disposability, right? So that's almost the first key aspect of design is making sure that your applications are cloud-native and can run on Kubernetes and do tolerate disruptions like that. But what are some other ideas? What are some other things that impact the architecture, design, or coding of applications that might impact cost and efficiency when architecting and designing your system? Yeah, so when we think of efficiency, people often think that it's all at the infrastructure layer and it's the responsibility of the platform team or the infrastructure team to take care of it. But it truly is a partnership with the application teams as well. So something that Shopify has been working on recently is continuous profiling of applications because you don't wanna just tell application developers that profile your application, make sure it's efficient and optimal at all times. So in order to reduce the friction, we have rolled out this continuous profiling feature where every application is getting profiled continuously at a certain sample rate. And using this, it's sort of very easy for them to go look at their profiles and see what their application is doing when. And we've been able to identify opportunities where there's like a static list of objects that gets created every time when the process starts up. And if you look at it, it may not seem like much, but then if you think about how many processes that you have across your clusters, across your fleet, and how many times it gets executed, it's actually quite a lot of unnecessary CPU cycles. So being able to create such tools and enable the application developers to make the right decision is also a key part of efficiency and optimization. So one of the things that we do add into it is whenever we have a new release of our platform, we run it through FMEA testing. But one part of that FMEA testing is running a load test, some known sort of repeatable load test that runs a certain amount of transactions. We use Gatling, just kick off a bunch of transactions from a bunch of different workers. And then we measure how many nodes did it take, and the application is auto-scaling through HPA. But then we measure how many nodes did it take to do that workload. And that helps us identify some kind of performance regression and performance regressions are also quite often cost regressions because if you're suddenly needing to use more nodes to do the same workload, it's gonna cost you more. So that's another technique that we've used to identify and to compare different releases. We've also used that to compare different processor types. So you might be thinking about using Graviton, ARM processor, machines, or maybe Intel or AMD. And I think based on my experience, there's no silver bullet to say, oh, this processor type is always gonna be 100% better for all applications. It really depends on each individual application. So it seems like a little bit of a cop-out answer, but you do need to test, you do need to measure, and those are the kinds of ways you can, by running this type of stress workload, you can identify that. Make a small change, a change the processor type of the node, rerun the test and then see if you're using a net, for the same amount of work processed if you're using more nodes or less. Any other thoughts about designing? I think. It goes back to I think about carefully what you're measuring, right? That when you initially measure workload, maybe the utilization looks great and then the app developer spends a month optimizing all their code and now their utilization looks a lot less good and are things, does your bad dashboard say things are better now or worse now? Cause you don't want them to say they're worse. So thinking about the holistic picture is always important. Great. Yeah, I think it's obviously as platform engineers, you don't have visibility into inner workings of the applications that are running on Kubernetes oftentimes. So it's really a partnership if you're working in a platform team along with the development teams that are using that platform. So you have to form good partnerships and see. And I see that we're running close on time. We did want to leave some time for questions, but just wanted to summarize some key takeaways. We have the three pillars of platform efficiency that we talked about. We have culture as different things here you can see on the screen, operations and design. So hope that gives you a good framework to think about cost optimization and how you can do more with less. So with that, we'll open up to any questions. Oh, sorry. One last thing, I have to do a quick pitch on people are still taking pictures. But I'll show that list again. Also would encourage you to scan this QR code, sign up for the end user meeting. Every other week we do this, almost the same kind of conversation that you just saw here. So if you're a CNCF end user member, it's a free opportunity to join and learn and share. So with that, we'll open up the mics for any questions. Oh, no, there. So my question would be have, so these institutions where you come from are quite big. So have you ever worked with spot instances on production? So have any of us worked on smaller companies because we're all relatively large companies? So yeah, I mean, we do have a healthy usage of spot instances at Zalando. We run all our test clusters on spot and we have also introduced a way to opt into using spot instances for the production as well. So yeah, it's a great benefit. You save about 70% of the cost. And for the workloads where you could tolerate downtime, then it's obviously a great choice. Question over there? Yeah. Well, thanks for the talk first of all. Very interesting, thanks for that again. So I tried to convince my colleagues, my developing colleagues, hey, could you optimize this and that? And I heard you had the same problem. They say, hey, we are paid to build features not to optimize stuff. Is there any recommendation besides showing up the cost of what an individual unit app, whatever does, to actually convince people that it's worth investing their time into optimization? Yeah, I think at the end of the day, it goes to your company's bottom line, right? So if you have more, if you're spending too much on infrastructure and cloud, that means you have less money to spend on hiring more developers or doing other useful work for your company. So I think that part should make sense to people that we can do more if we can save on this cost, we can use that same money elsewhere in the organization. And also, I think, but you're right, it's difficult to balance feature development with cost optimization. And so I think when you're doing these things, you need to approach it as something that you put in the backlog and you take up as a normal part of work, not something like coming from the side and changing all the priorities. You need to plan it as part of your roadmap to make these improvements and work on it incrementally. Yeah, and the other pitch is like engineers take pride in building scalable and performance systems, right? So scalability, performance are all like good things and everybody takes pride. And similarly, I think it's important to take pride in building cost efficient systems. And what is the unit cost per transaction or per thing my application is doing? So I think starting that conversation within the company and say like we as a team, we take pride in making and building and operating efficient services is another good way to encourage them. Yeah, that's a good point, especially if you have a dashboard that's showing the cost. Engineers can take pride in, okay, last month it was this, this month it's that. I can see a difference, right? You can really see the work that you do making a difference in the cost. So yeah, good luck. Like when framing might be when, right? Instead of saying do we do this at all, it's like when do we do this? Is it never right? And when probably isn't like when it's way too late, you know? Yeah, great, thanks. Yeah, thank you. Have a question on this side? Yeah, so my question basically is you spoke about the cooperation between the application teams and the platform team, okay? So generally, who realize the responsibility for the real cost? Like should the application team, should the platform team just mirror the cost and saying the different teams has this cost per services? Who is the responsible for deciding that a certain service is costly more than any other services and they should reduce it? Is the application teams themselves? It's difficult. I mean, at the end of the day it comes down to business value and what is your ROI? So if you have a service that is generating a lot of revenue, obviously that's allowed to spend more money probably than something that's not as important but not all critical services generate revenue directly, right? So it's difficult to really put a number on something and to know but it's something that you have to arrive with a consensus between business and engineering. So with that, I think we're out of time for questions but we'll stick around and feel free to come up and talk to us up at the front. Thank you everyone.