 I just realized that we, as room organizers, didn't actually have MCs, so I'm just going to MC myself, next speaker is me, yay. We need to figure out MCing for the others. For any speakers who haven't heard this before, if you want lapel mics, you can get miced up back there, basically five minutes before your thing. Also in breaks, feel free to test the laptop. But this works pretty nicely. So now to the actual talk. This is me talking about business observability at Grafana Labs. If you could close the other. So I need to start with a disclaimer. Those slides are not as polished as they usually are when I give a talk. I got a pretty substantial concussion 10 days ago and got cleared to travel last Friday. So this was basically me doing those over the weekend. Normally they look nicer. In the words of Mark Twain, if I had more time, I would have written a short letter. They would be prettier, not as text heavy. So sorry, but it is what it is. So setting the scene and going in strong on purpose, you'll see why on the next slide. We have always been very cost conscious in everything we do, which set us apart from most of the other hyper growth and all the buzzword things. And we never really or no, we never bought growth with basically free money, which was flowing out of the central banks. We always were quite conscious about not just losing our optionality in the end. Because the thing is if you are cost conscious at all times, you do have the optionality to increase your spend, turn down your spend, and you actually can choose. If you don't have this, you've seen this in the market. That's one of the for the finance people. The runway is more like a highway. And we never had layoffs. We never completely stopped hiring or anything. All of this is not just me saying, hey, we are great. This actually has a reason why I'm going in so strongly. Because those two slides are basically for your manager. Because I do have some experience talking to potential customers and such about those topics. And there are a few strongly held misconceptions. So by me saying, hey, we did the thing right. And this is what we tell you as harsh truth. My intention here is to enable you to also take this back internally and tell people in your company, hey, listen to this external people or this external person if you don't listen to me. Because I have been to a few Phenops things, not like as a speaker, at least not physically. But I've seen quite a few and most of them have an absolute overhang of money and finance people, not of engineers, not of tech people. And this is a substantial strategic mistake on the part of both the finance people and the tech people. It's much, much easier. And finance people hate hearing this, but also they're kind of relieved because they don't have to do the work. It's really easier to get engineers to do Phenops and to do proper cost control than to get finance people to do this. Because it's basically just something else which you have to track and which you have to optimize for, but it's not something which is completely new or unheard of. It's just a number you have to optimize for to put it very, very simplistically. You don't actually have to start to understand how operations works as a finance person. You have a completely different starting point. But also to be very blunt, and again, this is also for taking this back into your companies. Same as with all the other buzzwords of things which have been successful for other companies, it's not just free. Like you can't just slap the new label on the old thing and hope something else happens. You actually have to change parts of what you do and how you do it very deliberately. Ideally, this flows into your culture because else it's not going to be sustainable for very long. So speaking of culture, there is a saying that culture is stretchy for breakfast. And in my experience, this is very much true. Unless you have a proper culture which also goes in this direction, it's going to be exceedingly hard to do more than basically a hay fire. And you won't be super successful. You might have one or two things, but it's not going to be substantial. And one of the best ways in this and in everything else to actually change the culture within a company or pretty much whatever group is accountability, transparency, information for everyone. So it's very important that you don't hide those numbers and every team sees their own number and finance maybe sees not even everything or whatever. You need to actually make this discoverable for everyone who has an interest in this. And you also must push this towards the people who should be having an interest. You will see this a little bit in a second. People also observe what behavior is rewarded, who gets promoted, who gets called out in the weekly or the monthly or whatever. Those things are knobs which you can turn or which your manager or whoever can turn. Put that one engineer who did some cost saving onto a pedestal and say, hey, this person did really great work and do this in front of all the engineers. And the others also want to have a piece of that cake. So what are you going to do? The thing which gets rewarded. Quite a few probably heard me speak in the past. And if you did chances are I did yep about SLOs and SLO reporting and how this creates a culture of accountability and all those things. The same is true for cost. So for those who do not know at Grafana, we have a weekly SLO report, which is an email which just goes out every Monday morning European time to everyone. All the engineers, senior leadership of the company, sales people, basically everyone in a leadership position or within the engineering org gets this email. We had a great week, everyone gets the email. We had a shitty week, everyone gets this email to create accountability and just have this cadence of this is the information, this is the ground truth across the company. And we have the same, we introduced the same for cost. So every Monday morning, we get now get two emails, one for cost, one for SLOs. I mean, we get more emails, but you catch my drift. And again, this goes to all the engineers, it goes to all the company leadership, everyone. The contents look back over the last 30 days. And if you care, take a picture of this or download the slides after or watch the recording, but those are the things which we found useful to highlight. The to-dos of the team, just again, transparency and just accountability of, hey, did things actually change and improve? Total cost, total idle percentage, like how much are we wasting for stuff which we are paying cloud providers who, who don't actually provide anything or is it just not being used? Cost per cloud provider, growth decrease for all of those. Yeah, I mean, you can read it yourself, you probably already did. Also all the prices which we show in this are list prices. We don't show the discounted prices. You can look them up if you want to, like we don't hide them, but we don't care that you optimize for your, for the discounted prices. Of course, discounted prices are going to change over time anyway. We like to think we're pretty good at negotiations. Also things change, like moving stuff between different vendors and technology change. We also want this to be comparable over long-term and also it can just be highly unfair if one team just has a workload which is harder to discount for whatever reason. So everyone gets to see the list price. And the idler resources are accounted to the platform team. So the platform team also has an incentive to get stuff improved. It's not just like magic away note, this is actually cost which we, which we track. I see two people smiling in pain. Also, and this is not part of this report, but this is also important attribute costs to your customers as well. Cause by doing this you can automatically or you automatically get into this position where you can actually do proper unit costs, where you can do margin calculations per customer. And all of those things actually flow all of a sudden from the engineering org and you will have very happy finance people who actually work to give you more resources for your efforts. Of course, they are the people who have the money and control the money and they also get information which they need. This is actually going to benefit both of you. Also identify if you have like specific customers who have X amount of resources statically assigned to them. If we have a drop in data ingestion for whatever reason, engineering and or account teams immediately find this and you see those conversations like, hey, why does this happen? What's an outage? Did the customer have any issues? Let's reach out. Okay, this is deliberate. We scale down the things which are contractually obliged like talking to the customer and everything of course. But why should we and by extension the customer pay for stuff which is just unused if they currently don't need it. So I was talking about working with finance. One thing which is going to be very surprising to quite a few of you. It was also very surprising to me. At the end of the day, unless we're like talking orders of magnitude or something crazy, finance will always choose a predictable spend over a lower spend, always. Of course, their issue is they need to plan this and not just for the next month, for the next year and beyond. So they will always optimize for knowing more than for spending less. And this is just something to keep in the back of your head course for most engineers. This is like heresy and it doesn't make any sense. But from their perspective, it actually makes sense. And start by covering the obvious. Like for example, by technology, by team, by whatever and break this out in a very easy way. Our finance people get the same cost report. And once you have this relationship and once you have this trust built on top of this, cause honestly, you're not the experts in what finance needs to do their jobs. Unless you're a finance person, it's not in the audience, but whatever you get what I mean. The thing here is work with them and extract what they actually need and give it to them. Cause most other parts of your org are going to be really, really, really bad at giving finance what they want and when they want it. So if you can automate this, if you can have a pipeline which finance comes to appreciate over time. Again, magically you get more resources, cause all of a sudden finance is like, let's give those people more money and more resources. This is highly beneficial again to both sides. So how do you get the data? A few cloud providers give you nice metrics on this. A few cloud providers don't. I think we can say that no cloud provider has a strong incentive to give you perfect visibility into your cost structure and immediate ways to substantially decrease the spend for the long run, for obvious reasons. But some offer you more, some offer you less, but everything you get just ingested and persisted so you have it. We like to store the raw data in Prometheus and Mimir directly cause this is literally what they are made for and it's just metrics at the end of the day. So as long as the storage is available, you can just do whatever with those metrics and it's really quick to run queries in this tiny amount of data from the perspective of a proper observability tool. And also you start being able to correlate more with other things like your revenue, like your outages, like all of those things you can make more and more interconnections and much more easily if you have everything in the same backend. Also in the spirit of full transparency and Mark and Juanjo can talk about this later and come to a tush later. We are also currently evaluating what precisely we are using open cost for cause it's relatively heavy for what we need. So we are looking at how we do this. Heavy both in the amount of processes and code which is being run and in the amount of promcule queries which are being run against the backends. And also there's a little bit of a focus on Kubernetes. So for example, things like how much does this object storage cost or this network or things like these are not easily done. So we are currently looking at this. So a few stories from the trenches. When Flair was still very secret no one talked about buying Pyroscope. Well, not no one, but whatever. Different topic. When Flair was still a secret within Grafana and not officially launched. So we have the system of hackathons. Again, as I said, those slides are not super structured. We have the system of hackathons three times a year. Everyone, doesn't matter if they're finance or engineer or user success or whatever get a week of time to do basically whatever. Like obviously if they're on call and such this needs to be taken into account but beyond those obvious constraints they just get up to three weeks per year doing whatever they want. The basic rule is at the end you have to present what you did and what your success or your results were and if you have no results, whatever you're not going to get fired or anything but that's the thing which we put at the end of the thing. So one of the engineers decided to take the hackathon as an opportunity to learn profiling course. This is new and cool and let's see how this works. And they really didn't know what to actually profile. So they just looked at our Kubernetes and all the different cloud providers and everything. We're like Promtail is one of the things which we run the most. Let's look at Promtail and talking to a few low key engineers who maintain Promtail. They dimly remembered using string handling functions which are like super inefficient but it doesn't matter cause it's not in the hot path like it's only on the sides. They didn't have time to optimize this and just moved on. Well, we actually use this a lot. It was on the hot path a lot. So this one single week of a hackathon and just giving someone the time to just explore on their own how to optimize stuff led to savings of five figures per month. I'm not telling you the exact figure but it was substantial. And on the drive culture change through recognition. That hackathon entry was one of the finalists and we made this part of the, back then it was the cloud forth nightly. Now it's the engineering weekly doesn't matter but like this was showcased. So even if someone missed the hackathon results during the all hence this was also showcased within the engineering org again and again as an example of good work being done to just drive this behavior throughout the org. Check your defaults. Another team which shall remain nameless deployed a new workload. I see people laughing, they remember this. And other teams needed to basically model their workloads on the workload of this team. So they just complete the template. Unbeknownst to most of the people for testing reasons the initial workload was given extremely high IO guarantees. So for those who don't know at your cloud provider you can basically say how much guaranteed IO you want to have for this instance and you just get this as reserved resources. IO is expensive so getting this reserves is really expensive. Long story short we had super high guarantees on pretty much every single thing which we ran across our infrastructure at some point for a short time. But like until someone actually looked and saw where the money was going and broke this out by different aspects of how the cost is structured no one actually realized that this is there a substantial chunk a embarrassingly large amount of money per month and per whatever timeframe went into this thing and we literally did not use it we literally didn't need it but we needed someone to look to actually find it because we're just an honest mistake. And yeah it was terrible to fix cause to change the template just rebuild everything you're done like this is a matter of hours. I wanted to have like for example I wanted to have a nice picture of a dog here for a spot so most of you will have heard of chaos testing. And chaos testing can be hard and it can be expensive to implement and it can be hard to get buy in for upper management to allow chaos testing and everything it can be really hard. But you can also get your chaos testing completely for free you can actually get it in a cash positive way by just interrupting your normal services on purpose randomly through the cloud provider cause you just pay them less. So one thing which we have started doing is for workloads where this is defensible we put part of the workload on interuptable instances preemptable instances spot instances like whatever the name is with your provider. For those who don't know this means that the cloud provider can switch off this machine this container this whatever at any point in time. I think even without telling us before but maybe they do yeah okay with sorry okay one minute warning we have the experts here. So you get one minute warning and then everything is gone. And we deliberately use this and we basically just have a portion of our services with the instances ready to be driven against the wall by the cloud provider at any point in time just to save money. And yes you have to like on demand scale up at the back and everything to cushion for this and yes there are scenarios where if there is resource contention maybe you're stuck in the hole and you cannot actually get all of your resources back because they have been preempted away from you. So unless you have proper SLOs and proper SLO budgets you cannot do this kind of thing. We do and we are pretty good with ours so we have this budget and we can actually start investing part of this SLO budget into cost savings. This also drives again a lot of cultural change because all of a sudden you can have conversations with finance about why you should be having SLOs because this will save you money. Like those kinds of conversations and this kind of holistic win-win-win-win-win across the organization is a good example of how you drive a change of thought throughout your company by telling people hey there is something beyond this which you actually really deeply want and currently everyone is looking to cost cut for obvious reasons so almost everyone. So this again is just a good example of where you can push towards having proper operating principles with something completely unrelated at first glance. So yeah all of this takes a village. If you don't have hackathons I mean you can't probably just implement it at your company but you can at least take those slides as a external validation of why this might make sense within your company as well. To just have this as a try maybe even scope to cost or whatever but like just to have this to give people who have good ideas actually the time to try those ideas and try and implement something. Also and I mean as you saw I worked at Grafana and everything I do believe that shared reports shared dashboards and everything do build a shared understanding and shared language and shared mental model of your thing whatever your thing is across the whole company because I wouldn't be working here but the thing is as with everything else when it comes to operations observability everything you start building a shared understanding you start having engineers who can actually have a informed discussion about cost trade-offs just as part of their normal design doc process because they already have all this information at hand and they already know everything about this. We have a platform team which rolls out a secret amount of new regions per year and basically provide a baseline and all the software engineering team built on top of the platform not a huge surprise these days and we also have a platform cost team. This is not something you do on the side you do it in a hackathon, okay fine and maybe that means you get hired into the platform cost team but you can't do this consistently on the side. This needs focus and you need to actually be able to focus on this for a long term and it needs good engineers. You can't just like put level ones in a room and be like okay figure this stuff out. You need to actually invest proper resources and proper engineers into this for the long term. Again use those slides, talk to your managers about this kind of thing citing those slides. Also as a little bit of a more or less impromptu idea Juanjo just opened a Slack channel so if you want to discuss more of this about this on Grafana Slack there is no cost observability as a channel so if you want feel free. Speaking of Mark and Juanjo there's going to be another talk on this topic with much more in depth of actual queries which we run and how we think about this of the matrix of what actually makes up your spend like the rate of your spend and the amount of what you're using and then the product of this actually being the thing which you need to optimize for and how to optimize for both of those orthogonal vectors to then have the lowest total amount of spend. I'm making a mess of this but maybe you get what I mean. So yeah anyway I gave you enough time babbling to take a picture of this and that's it. So now I'm taking my MC head on I don't actually know when the next talk is we can probably optimize a little bit for the in-betweens for questions or we can't 10, 25. So we have 10 minutes so we can probably take five minutes for questions there's a microphone if anyone has questions or we just get 10 minute break whichever you prefer. Thank you. Going once. I mean this is more scary to be so we don't have a runner if you can come up here or you can just yell and I repeat. So the question was is if there was a moment in time when we realized that this is actually something which a team needs to carry instead of just having this on the side. I mean yes there was certainly a moment I don't know the precise moment I think it was pretty early. We do like one of the things we have internally yes we have a lot of people who are able to do a lot of things at the same time but also we realize that this is not ideal. So outside this like one of the things about the hackathon process is to identify ideas and a few of the cost-saving things which you've seen here and which you will be seeing in the future are coming from hackathons. So we deliberately like bubble up those short hay fires of motivation and innovation but if slash when we are in this for the long run we identify our owner and in this case it was pretty clear we don't actually have anyone dedicated to this so I think we start by hiring Juanjo for this and then build a team from this course we realized it's not just the one man thing or one person thing. Anyone else yeah. So on the topic of cost optimizations I saw that you mentioned profiling and I'm curious I imagine to get a good view of kind of where you can save money you want to profile probably everything is that what you do at Grafana or do you profile just some work clothes can you talk a little bit about that? So the classic engineer answer it depends. The thing yes we have continuous profiling and yes we profile like the intention is to be able to profile everything at all times but not to always do it because it's also like it's just a cost trade-off of how much you invest. The thing which you should get started on is literally look at your Kubernetes or whatever you have. What are your heavy hitters? What do you run the most? And then start looking at those few things and once you have established some internal experience some internal success also it makes it easier to defend this to your manager blah blah blah blah the usual then go from there and basically grow into where you want to grow but start with the heavy hitters because you have the highest likelihood of being able to save substantial amounts of money. Makes a lot of sense, thank you. So we probably have, yeah, just again yell. Sorry, can you repeat? So you're talking about the customer? Okay, so the question was about the provisioning for the customers and that this should be automatic and also downsized automatically. Yes and no. In the generic case, yes. I was careful during the talk but probably got whatever. There are certain situations where we have contractual obligations to have X amount of resources for a specific customer because that's just part of what they pay for. And in those cases, even though we have a contractual obligation to have X amount of resources or ingest or whatever things have been defined in the contract, we still, if we see a drop in this, still have this conversation with the customer because the worst thing which happens is they say, no, this is contracted. I want to maintain this. The best thing which can happen is it goes down a little bit, we go down a little bit and maybe they get a discount or then the renewal gets cheaper or whatever. Yes, it's a special case. But also to be very blunt about this, while we have the ability to autoscale both up and down in many regards, I don't think that this is universally true. So for the vast majority of the people in this room, it's probably true that they need to do this manually anyway. Yes. So the question is if the weekly cost email automated or is it manual, it's absolutely 100% automated. And this is essential to the use of this tool. There's this parenting tip of saying, not that you say, hey, you have to go to bed, it's the clock saying you have to go to bed. With SLOs, it's very similar where it's not me saying your service level is shit, it's the clock saying your service level is shit. And same for cost being too high. The thing is you take out the human component out of this thing gets sent, no matter how good or bad the state of the world is. It just happens every week. It just happens. And I have whole talk tracks around those social dynamics of how you drive change with this, but the short version is it must be fully automated and it must be removed from human editorialization or anything to enforce this cadence of this is just sent out to everyone, blameless, but relentless. I think the back one was first. So the question was, this is easier to do if you have good service boundaries and good team boundaries and ownership boundaries. How can you do this when you don't have them? Again, I have a whole talk track around this topic. The short version is once you start producing numbers and once you start sending those numbers out to everyone, in particular, the people who control the money, you're not going to stop sending this report. Of course, those people want to maintain the report and they want to keep having this report. So you completely flip the conversation of, hey, you think this is wrong. You think this is wrongly attributed to you. You think you're actually better than what's over there. Great, let's work this out. Let's break out your part of this. And you basically put the pressure of making those determinations and creating proper boundaries and everything onto the wider group. And again, through this mechanism of the clock saying this larger group is bad, you will drive this dynamic of everyone wanting to either improve or getting out of that group automatically. I think we can do one last one, then it's probably. You already had one, I think. Hey, yeah, this microphone is very intimidating, by the way. Yeah, we probably will have a runner for the next one. So I had a question about, you mentioned using Prometheus for storing your cost data, and I was curious how you dealt with extrapolation on increase in rate, or do you all just deal with it or use gauges for it or something like that? I don't know if we have counters or gauges. What do we have? Okay, it's counters. So we deal with it. Over the long term, it doesn't really matter anyway. If you look at, like with my Prometheus maintainer head on over 30 days, honestly, it doesn't matter too much. Like, yes, it's not going to be mathematically precisely correct. Like if you use R or pi signs, yes, it's going to be even more perfect, but it's more than enough for what you need. Thank you. Thank you. Also, if you can stand up, you can also find those too.