 Thanks for coming. This is telemetry is not having to hope. It'll be a little more informal, but hopefully fun nonetheless. And double thanks for all those who made it past the dramatic title. So this is telemetry, at least looks like it to me. And it can be pretty complex. This is actually an example of some stuff that I use day to day. And don't worry, by the end of this talk, we're gonna make sure that we understand every bit of it. We're gonna start right here in the middle. That was a joke. This talk is not about being an expert. It's actually about not being one. And so we start way, way back when I was very fresh on the computer science scene. And this is me in the way that exactly four bullet points can describe. And so as I said, our story starts with me fresh out of a coding bootcamp at my first job, which is a startup whose employees you can count on your fingers. And so we had just launched a B2C app, our first, actually. This is very exciting for us. So one of the things that we had was the ability to log in with other providers. So think log in with Facebook, Google, et cetera. Now it's important to note that at that time I didn't really have access to the degree of information of telemetry and monitoring that I do today. And so it was really just system logs which were centralized, but then whatever kind of stock EC2 metrics that we got out of AWS. And so what's the first thing that you do? And I still do this to this day. We deploy, we go live, and I tested it out. And ultimately there's no real substitute for that, for that visceral feeling of knowing that something works. And that still holds true today. But so great, that's awesome, let's head out. There's a beer garden down the way and we start, we wanna kick off early. And we're about halfway there when we start to get reports that people can't log in. And oh no, right, my heart sinks like a stone. And I am very worried, so we all get out our phones. And again, let's try to log in, we log in. That works. Huh. So what do we do now? And for me, this is actually in the story where I made the wrong choice. So I doubled down and I could see how bad of a decision that was, but it was something that I understood in retrospect. I did that because I felt like I couldn't afford to be wrong, I felt very much out of place in the industry at my first developer job. Like, whatever I was working on had to work. And that's probably for another talk, but it's okay to feel that way. And coming back to it, for everyone who's curious what it actually was, who's the culprit, it's a clock drift, right? And I think that everyone kind of has this experience at some point or another. It's when two machines can't agree on what time it is and if they're using the same system, this can cause pieces to fall apart in ways that you don't expect. And so, for our story, what actually happened was that it fell apart in a way that a portion of users couldn't log in. It was a relatively small amount, but problematic nonetheless. And really interestingly, this kind of thing gets right by the kind of testing that we were using. Realistically, we could have logged in a bunch of times and never seen it, and we didn't see it. So that probabilistic nature made it really difficult, really, really good match for getting past our defenses. And so, we're back here now. This is still telemetry. But how do we make it helpful and intuitive in a way that's helpful beyond when your investor's walking and you wanna point to something mounted on the wall? No, helpful in a way that becomes the first thing that you reach for to understand your systems. And to really get there, we have to talk about telemetry as something that's way beyond successes and errors here. It's abstract. It can be abstract and still be helpful and intuitive. And so, we're gonna go over just a couple examples that I wish that I had and try to relate them and set that up for the rest of this exercise. And so, the first one here is something that we're all familiar with, whether we know it or not, latencies, the difference between a website loading like this and a website loading like this. And everyone wants to be part of, to use, to build the former and hates using the latter. And that makes sense. The next bit here that I wanna talk about is I call them forks in the road. It's really whenever your application needs to make a decision to do one thing and not another. Do you have an item on hand or do you need to go back order it? Check if it's an inventory. That sort of thing. And so, all of these things that your application does and when it makes these decisions, the timing information, the frequency, it can help you cross-reference and correlate and ultimately build a story. It's sort of causality. This is how I understand that I'm a pain to be around whenever I'm hungry. And then the last one here is back pressure. So, how long is the line to the bathroom? You know, I'm at a conference right now and there's pretty boring talk. It's probably zero, now's my time. Take the hand y'all. So, these are a few examples that I kind of wish that I had at that point. But so, why are we talking about it now? Why did they end up, to be fair, prom con is probably an aspect of that. But it's also the burgeoning complexity in every sphere, kind of assaulting us on all sides. And ultimately, everything is consumable these days. We have hobbyist sites that are highly available and are distributed via content delivery networks across the globe. All of our cloud vendors, all these cloud platforms that we run our software on. Sure, everyone has compute for sale but now everyone also has all of these other provisionable building blocks of distributed systems. So databases, queues, cloud functions which can execute thousands, millions of times concurrently. And that's just our cloud vendors. I'm not even to talk about the API economy. Think of all of the SaaS companies that are consumable in a programmatic fashion. This becomes our authentication services. All of these integrations, even in the case of my employer, are telemetry and monitoring. And this choice is empowering and incredibly freeing. But the cost of having everything at our fingertips is that we depend on everything. So as an example, I have two 99% available legs and the chance that I can walk that they're both functioning at one point in time is less than that. And so complexity is the reason. It enables us, but we have to pay part of that back as a cost of understanding it. And by the way, no shade of Python here. It's just, I love this example. And so open source is a really, really good fit. It handles this complexity very well and don't worry. I'm not gonna get philosophical about open source here, at least not too much. And I see how far the drop is from my SOBOX. I'll try to stay pragmatic. And the first thing that I see here is the cost of learning. It's really, really low. Generally, you're not paying for, at least in a monetary sense, sign up fees or memberships, additional headcount. You have to pay a different kind of cost. And this extends through to the design, the fact that when you start with a new tool, you're probably not bootstrapping a SaaS account somewhere, you're probably running it on your own machine. And so, yeah, that extends into the design. It forces us to understand some degree of complexity. And that sounds wrong, doesn't it? Isn't the whole point that we abstract everything away so we don't have to deal with it? I'm not sure. I think that some degree of complexity is important for us to understand. It makes sure that at least at a high level, we're agreeing to the paradigms, the choices that we're making for new tools that we adopt. And this extends really into our second point, which for me is definitely due diligence and makes things like smoke and mirrors really a lot harder to hide. You can go look at the source. And then finally, transfer the ability to learn something we've shown that you can pick up open source. It's a lower barrier. You can pick it up, you can learn it, you can get hired to do it, and then you can find somewhere else to do it too. You can take these things with you. And that's incredibly valuable. And not just for the individuals, for the software engineers amongst us, but also if you're an organization, it allows you to hire more easily. But okay, this is the one philosophical bit I'm gonna get into here. If anyone has built internal tools at a company before and loved them, right? Or you worked at a company and you used this internal tool that you thought was fascinating. And then you change organizations and then you can't use it anymore. That is incredibly debilitating. And so it allows you to persist in this, whatever you're building, to see the benefits transfer across all these organizational lines. And that really lifts all of us up. And so all of these effects of open source really compound into a set of really low bearish entries. This makes new ideas more feasible, which is super valuable. And for me at a really small company initially, this wasn't Prometheus, but it was always open source. And the idea here is that we can use this to build on top of each other and we can start building ideas and not dependencies. It allows us to prototype, I have to have new garage projects that can start at such a higher level comparatively. Before we go on to the next slide, when I was putting this one together, you can see Prometheus is at the top and then there's a couple of templated ones in. I actually had real examples in there. And then I realized, I had this epiphany that that would be the worst idea to get nerds knifed on stage for because my choices are not the same as everyone else's. But anyways, Prometheus is pretty much at the top there. I think it's a safe bet at PromCon. I'll deal with the fallout there. And I think Prometheus hits a lot of these points expertly. It's incredibly simple to start with. From the Prometheus server, it'll watch your infrastructure for you. And okay, what do I really have to do to kind of continue setting that up? You can get some immediate benefits using these X borders. And that's generally the term for how we get data into Prometheus. Be a bit of code that runs somewhere and exposes something. I left a couple of examples here, which are some of my favorites, but there's too many to count, official and unofficial variants. And so you can pick and choose. Start to use these more and more until you are actually comfortable and you're seeing a lot of the value from this. So what I guess that means is that Prometheus wears complexity very well. And so now we're at the point, say, we have half a dozen exporters. We're monitoring a lot of our infrastructure this way. We're really seeing the benefits of it. But wouldn't it be great if we had some of those abstract metrics from earlier that we talked about, things that really expose the domain-specific logic of our applications? What if we had an exporter for our own apps? And you can do that. Go instrument a little code, write a little more code alongside the code you already write to tell yourself the things that you want to know. And it's pretty easy. And then you can finally bring all of these things back together and you can alert on them, let yourself know if a condition is or is not met. And that's very powerful, right? The ability to draw a line in the sand somewhere and say pass this point like I need to know something's wrong. But really, we can also invert that idea and say if we knew a point past which something is wrong, past which where we need to be alerted, we can also say that everything before then is technically still okay, right? That's good enough. And so what I'm trying to get to here is the idea that using telemetry only for disasters, only for remediation, only for figuring out what's wrong when I have an outage, that leaves the majority of the benefits of telemetry on the table. And so this is one of my favorite parts about it. It's the ability to use monitoring proactively. To say not just use this when it's wrong, but also raise the bar when things are right. How it allows us to be really effective with the limited time that we have to work effectively, to work on our applications. And so in this example, this is a kind of method that I'll use pretty often. So I'll have some idea. Oh, wouldn't it be interesting to know X, Y, or Z about my application that I don't know now? I think it might behave weird in this scenario, so I can go add instrumentation. And when I do that, then it starts to trickle in. I start to be able to see that. I start to be able to answer these questions. I start to be able to understand if I'm asking the right questions. And it doesn't always work out, right? Sometimes we go back to the drawing board and that's fine. Sometimes it does, except or roll back, no worries. We're learning something either way and it's really, really valuable. But so I realized that only in conference talks are all changes purely beneficial like that. So a lot of the time we have to make choices that are a little harder, a little bit more complex, but you can still use this methodology here. So what if you have one part of your application or one component, we decide to run more of them? You know, a higher degree of parallelism. Well, maybe that allows us to process more work to increase the size of that funnel proportionally. But maybe that also just pushes the bottleneck to the next stage of our system. Or what if we didn't even need to go that fast? What if the cost, the monetary cost, the cost of having more nodes, more replicas outweighs the speed benefits that we get? We're still under SLO, so is it really worth it? You can kind of answer these questions. Another one that, this is kind of a canonical example I'll use for this is we were evaluating different compression algorithms. We were wanting to change our compression algorithm and let's say we change from algorithm A to algorithm B and algorithm B is faster to compress and decompress. That means that our rights and our reads are faster. That's great, right? Who wouldn't sign up for that? But maybe we're paying for costs in different ways. So now it turns out that the compression ratio was worse. So we've got more space on disk. And guess what we're paying for? Disk. And so it's kind of a complex dance that you can make to be able to understand these things, but you can instrument them to be able to answer this question, to be able to transcend sometimes just the engineering aspect to be able to understand the business case of is this the right change or not. And so ultimately, I think it's an iterative, monitoring driven approach, not just towards remediation, but towards development that really gives us a lot of benefit here. And I find myself increasingly using telemetry for this as opposed to debugging problem scenarios. And instrument first, right? Ask questions later, I guess. So thank you for coming to my talk, everyone. I'm obviously, I have a lot of good things to say about Prometheus. It's definitely empowered me and changed the way that I think about and develop software, but I'd love to hear any questions that you have or anything that you're using it for. Come grab me in the hall and I'll take questions now. I'll take a pity clap too, you know that would work. Yeah. Thank you.