 Thank you all for, thanks for the warm welcome. Appreciate it. Really excited to be here. As I said, my name's Keith. I'm one of these computer people, and today I get to the pleasure of describing a framework that I helped develop at Slack to quantify end-to-end product health. What you can expect from my presentation here today is an effort to simplify quantifying quality to the maximal possible degree. So we're gonna come out with one number. Bigger numbers are better, smaller numbers are worse, and this is a huge simplification, of course, but that's the task we set for ourselves here. There's gonna be some actual math involved. I promise it's not gonna be, I'm not a mathematically sophisticated computer person. I'm like one of those people that smashes things together until they work. We'll just be adding things and dividing them and stuff, but I know that math traumatizes some people. We'll go nice and slow. And I also wanna emphasize that these are relatively untested ideas, right? So we applied this to our work at Slack with some success. Since then I've applied it at a couple of other companies I've advised, but it's new enough that I'd love to hear of other people that apply it, whether it's successful or unsuccessfully, so please get in touch if you end up taking any of this seriously enough. There's a little bit about me in terms of why you might care about my opinions about these things. So prior to sort of falling out of honest work and ending up in venture capital, I did a bunch of computer stuff. You use my code. I was one of the first employees at VMware for nine years. I was relatively early at Facebook where I helped found Facebook AI research, and I was chief architect at Slack. And this framework kinda comes out of my work at Slack where we had a quality crisis that we'll be discussing in a second. Since then I've been working with startups on the venture side of things. I'm working with early stage startups in my role as a general partner at PebbleBed. And for growth stage startups, where this talk is probably more directly applicable, it's a bit of a post product market fit set of problems we're discussing. I'm on the tech advisory board at Iconic Growth. So whichever side of this barrier you're on, right if you're pre product market fit, it probably looms in your imagination really large once you've made something that people actually care about and pay you money for and rely upon and everything. And if your post product market fit amazing, you've probably scrapped for years to make this happen. And whichever side of that fence you find yourself on right this moment, it turns out there's a bunch of dragons waiting for you on the other side of that fence. So you have all these great things going for you that you weren't sure were ever gonna happen. People actually pay you money, you have a functional engineering organization that actually delivers things from time to time. Maybe you wish they did it faster, but they're delivering it. You have load, right? You're thinking about sort of actually running out of capacity sometimes now because people are using the thing that you built. And on the other side of that, there's all these problems that didn't seem to be problems before, right? You think of yourself and your team as a pretty good bunch of engineers. So you expect that you build things that are fast and your experience of it is that the things are actually fast. When you play with it, it works pretty well. When you play with it, it doesn't crash. But you know, you look on Twitter and people are saying that this thing is slow or it uses gigabytes of RAM on my laptop, why is that true? Or it crashes for me and so on. And this is a pattern I've seen in a bunch of companies both kind of that I've worked at and that I've advised that's a form of a trap, right? And before you're doing product market fit when you're like just building new products, your intuitive notion of quality suffices, right? You basically can just use your own gut, your own instinct, you play with it. If there are other users, there are other users that are probably a lot like you because they're probably drawn from your social network. They might actually be in touch with you socially. They might write you an email or something if you break things badly enough. And so you don't have to think about what's good or what's bad. If you just know what's good, right? It's a psychological experience, right? Your mind does it for you. And all of these wonderful things happen as you make this transition to product market fit. But one of the things that's a little bit scary that's happening and that isn't initially perceptible is that the quality of user's experience got decoupled from the quality of your experience. So now it's possible and in fact guaranteed that there's a class of folks out there who are never gonna talk to you who are having very different experiences of your product than the experience you have when you play with it. And this doesn't happen all at once, right? This sort of is a slow process. And as that gap grows larger and larger, right? As you drift further and further between your experience and your customer experiences, you're kind of dying the death of a thousand cuts quality-wise because you're not seeing the problems. So the motivation for the very radical simplification of I'm about to undertake here, right? Where I turn this very nonlinear, very human, very subjective, reasonable people can disagree notion of what good is into one single objective number is to try and get our, is to try and unstick ourselves from this trap and provide some kind of directional North Star place to motivate motion from. And a thing to get about these problems that isn't initially viable and a sort of a little bit of a threat to our ego if we built this thing from scratch is part of our view of ourselves as builders that we're really competent at what we do. We have a lot of pride in what we do, right? Everybody wants to work on things that are high quality. Everybody wants to deliver things that are amazing to use. And the reality about this trap that you're in is actually those problems were always there. You just couldn't see them because you weren't using them in the pattern that your users are using them from. You weren't using them on the devices that the users are using it on, the networks they're using it on, et cetera. And another way of using this trap, by the way, is as sort of an applied psychology notion of that you're deprived of feedback, right? When psychologists go out and try to ask how people learn, one of the important ingredients for successful learning is timely feedback, right? You do something, you find out whether it made things better or worse. And when you don't have users, you get timely feedback from your own nervous system. Once you bring users into the picture, you completely open that loop up, first of all, right? You don't actually learn anything about what users are doing organically. So this is an effort to bring it back. And having built this up enough, the framework has a catchy name to me, at least, it's SPQR, I was a Latin geek in high school, so this is not Sanatis Papulescuay Romani, the center of the Roman people. This is gonna end up standing for security, performance, quality or liability. We'll see why in a second. So we want this thing to be our source of feedback about quality, speaking for all of the mute masses of customers out there that are never gonna bother to tell us why they churned, that are never gonna bother to tell us why they never used our product in the first place. Our goal for SPQR is that it's timely, when we find out about it before users find out about things, that it's able to detect sub-perceptual quality changes, it's able to detect quality changes that might be relatively subtle, not the kind of thing you tweet about necessarily or not the kind of thing that you'd have to change your planning for the quarter or about. We'd like it to be something that you can actually compare across relatively long time horizons. And this is sort of a strange desire here, but imagine that you've spent a quarter deep prioritizing future things because you're feeling to spend a lot of energy on performance and reliability. Well, how did that work out? Were you successful? How successful were you? And so on across spans of years in some cases. If this thing's working right, we can actually use it to decide what project to do next. Or we can actually sort of bake off, okay, this thing has this much of this number on offer, this thing has more of it on offer, so we're gonna go after this thing and so on. And this feels kind of impossible at first because we're talking about a psychological event that's, you know, you're feeling of quality using something, right? That sounds like something that doesn't make sense to quantify it, and in many cases, it doesn't necessarily. And when people get this impulse to quantify quality, which is like not a new impulse, that's usually when the dashboards come out. Right, so the dashboards come out and you're saying, well, okay, we need a robust view of this. So let's track these 16 performance metrics, right? P99 of this API thing, you know, memory utilization of that thing in the API, this proxy for client performance, the mean time to complete this action. And let's get, you know, five different notions of reliability. We'll have uptime over here, we'll track uptime really closely, and we'll, you know, check for the rate of 500s over there and measure client crashes here. So now I've got all these nice signals about reliability and so on. And, you know, you need to collect this data. This data turns out to be input to what I'm gonna propose. But it makes all these things almost impossible to compare, right? So, you know, a week elapsed, performance indicators three and two are up, whereas performance indicators one and five are down, and performance indicator four didn't change. Was that a good week or not, right? Did something, did the performance work we did make a difference or not? And the unifying concept here is gonna end up being a psychological concept. And this is like just a term I made up, by the way. So don't use unacceptable experience as if it were like technical or as if other people are gonna have any idea what you're saying when you say it. But an unacceptable experience is the, is gonna ultimately be the events that we're trying to count in this framework. And there are those events where the user considers a different way of solving her problem. And this is an incredibly general framework, right? This goes beyond software or computers or whatever. If you're a car manufacturer, what users of, you know, customers of your car have unacceptable experiences. Concretely, drawing some examples from Slack, right? One of the core experiences of Slack is I send you a message and you read it. If that takes more than let's say 10 seconds, the thought might cross my mind. Maybe I should just use SMS or maybe I should just use WhatsApp or something. So we're talking about that cognitive event where somebody considers switching off your platform to something else. Since this is a psychological event and since, you know, neural computer interfaces are still in their infancy, we can't directly measure that. But the things that we know about why people use software or why people don't use software means that there are proxies for this that we can measure. And we're going to, there's gonna be a little bit of like a tech output on the next slide here, right? So if anybody is sort of math-phobic, let's prepare ourselves mentally, take a breath together, put our thinking caps on. Okay, I promise it's really not that bad. So, we can't directly measure the unacceptable experiences because we'd have to reach into the minds of our users and we're not able to do that very well yet, though, you know, watch this space. The question is, can we still work with this? And I think the answer ends up being a qualified yes. For the purposes of this framework, at least, we're going to treat the only kind of good experience as the experience where nothing bad happens, where all the bad things fail to materialize. We know that a bad thing can happen because it takes too long to do an operation, a bad thing can happen because my client crashes, a bad thing can happen because the operation has to do failed because it went to some sort of broken behavior, some long list of things. For the sake of discussion, like asking you to suspend disbelief for the moment, let's pretend we can enumerate them all. Well, that's that biggie sub-i, right? That's all the problems in our system that are possible for user to hit right now. And as a further, you know, belief suspending assumption, imagine we can actually estimate how frequently they occur, right? The p of i. So the end of the health metric I'm proposing here is a probability. It's the probability that user has a good interaction with your service or product. And that ends up being the probability that nothing bad happens, and that's all this funky thing with the big pie is saying, right? If you're not into seeing big products here, this is just saying, hey, let's pretend all the bad events are independent and the probability that things go well is that none of these independent bad things happen. So if we're going back to our original ideas here, can we actually list this problem? Can we actually do what we propose here and enumerate a finite number of these problems? We kind of can. And our ability to do so is limited partly by the kind of problem that we're trying to characterize. We'll get into that in a second. We definitely can do a better-than-chance job of estimating how often they're happening. And we'll get into that again in a second. Both the estimate and the list are going to be dependent on the kind of problem we want to catch, and the zoo of problems I was able to enumerate seems to have kind of four big piles. There might be a fifth or sixth pile out there waiting to be discovered. Please let me know if you do. But it seems like this captures kind of the bulk of problems that people want to quantify and prevent. For the balance of this talk, I'm going to assume that you're on board the observability bandwagon, right? So if you have not sort of done basic instrumentation, have some kind of log store that you can query that can answer questions about performance or how many endpoints are 500-ing per minute and things like that, please go do that and then come back to this. The vast majority of folks these days have done this already. That's sort of one of the areas of practice that's leveled up a bunch over the last few years, which I'm really happy to see. There's a lot of great tools out there for this now. If you haven't done that yet, please go do it. But you see a lot of people talking about tools and visualizers and things that allow you to ask questions of this store. This is about asking a better question of this store, not about sort of your raw ability to do that. So please check with your instrumentation vendor about how to use that very valuable piece of software in the way that you'd like to. So we're going to start out for the kind of events that I have kind of the most confidence in our ability to estimate, which is raw reliability. And this is roughly, you try to use the service and it actually does what the heck it's supposed to do. We can more or less directly measure the successes here, typically from, if you're a web service, this would be looking at the back ends, ratio of 200s, status responses. There's a little bit of another funny factor in here, which is just the raw uptime. So if you take the whole site down, for instance, you have a multi-hour outage or something like that, eventually people stop using it. So you stop believing the numbers that are coming out of the successes at a total because people are out there struggling to connect and failing. So you multiply an uptime factor in there, but hopefully uptime is just one most of the time. Every once in a great while you've got an outage and you have to multiply that in there. So this is nothing but a success ratio times uptime. A slightly more subtle one is performance, which is tragically parenthesized R there, it's actually parentheses P. So performance requires that we select a parameter. So this requires that we have a notion of events that complete fast enough. So for this one, you have to make a kind of editorial choice. What is the core thing that you do? I used the example before of Slack of sending a message and somebody else receiving it. And then you have to make this subjective choice of, well, what is fast enough? And then this is nothing but the rate of fast enough events. And if you hear this and it sounds a little stressful to imagine selecting that parameter of like, well, what is fast enough? In practice, it's not that sensitive, right? So to take the example of Slack, like sending you a message if we're both on good networks, it should take like hundreds of milliseconds. It shouldn't take, so whether I say the message is slow after three seconds or slow after five seconds or slow after 10 seconds, it ends up not being that sensitive to that parameter because if it took way more than a couple hundred milliseconds, it's just really slow and something's wrong anyway. The quality one's a bit of a mess. I promise we'll get through this. We're getting further and further out of the limb here in terms of our ability to actually estimate occurrence. You'll notice all of these are behavioral, right? All of these are actually based on some amount of dynamism, some amount of what's actually happening on the site. And I'm defining quality here as narrowly the software behaves as intended, right? So for instance, a quality problem here is like a bug, you know, you click the reticulate splines button, the even ones reticulate, the odd ones don't, you're surprised, you wonder if you should use some other tool, that's the negative event. For this one, I'm using a blend of if you have a customer support organization, that's a great signal. Customer support tickets radically under sample the actual occurrence of these problems. So you pick some big fudge factor, like a hundred or a thousand or something, and say the vast majority of people running into a problem are too busy or not inconvenienced enough by the problem to actually report it. But this is a reasonable, somewhat timely signal about quality problems that are live. And then you wanna have something out there that takes into account the actual state of the bug backlog as well. So the alpha and the one minus alpha part is an editorial decision about how you wanna weight these two components, where there's basically a weighted average between some weird multiplier on the customer tickets that reflects how much, how often you think people call in the problems they run into with the bug database. And the murkiest of these is security, right? I feel a little bit bad even talking about security here when I talk to professionals in the infosec world, they feel they're still trying to figure out what a good metric for security would be. So this is really just kind of the bug database approach. And I feel bad even talking about security in this framework in some ways because there are some security problems that you wanna treat essentially like a safety problem, right? So if you're a nuclear engineer, there's a class of sort of incidents that the facilities you're responsible for that should just never happen. And so for some kinds of security issues, this is not the right treatment. But for things that are survivable that are mitigatable risks, I think this is an okay way to think about it. I think just multiply them all together and that's your end-end health. Since there are probabilities of events that we're claiming are independent, this is the right way to combine them. As we'll see in a second, this kind of is a multiplication really piles up really quickly. This is a brutal way to sort of try and measure your health. It ends up with a very pessimistic impression of how well your service is working. That's okay, we're not really trying to accurately measure how often people are frustrated with your service necessarily. We just want something that's, first of all, directionally correct, and secondly, easy enough to measure across time. We built these as a bunch of presto queries on the data warehouse at Slack and fished those out of existing data sources. You already have a bug database most likely. You already have an observability stack most likely. If you're not storing them somewhere or where you can query them and join the results, this might be a great time to start. And what that ends up looking like is something like this. So, just trying to read this out here. The two that are very, very close to one are, make sure I'm getting this right here, are security and reliability. Right, so there weren't any sort of massive site outages during this time. The one's very close to green. The green one that's just moving straight across was quality, so at this time it was sort of dominated by the bug database term. We weren't moving many things in and out of the bug database. The lower one that's blue is performance, which you can see right in the regime we were in here, people were having a lot more bad performance experiences than there were other kinds of bad experiences, right? So, 0.98 or so. And when you multiply those all together, you end up with this thing that's much lower than any of them, once we're not averaging these, right? We're, they're all multiplying together to make things worse and worse. And you get that red line that's the end to end health. Since there was a lot more variability in performance, which is pretty typical, that's subject to weather, subject to load. You see that that's the one that's sort of driving the moment-to-moment impulse here. And a quick aside that's gotten, that I've found useful since sort of creating that graphic several years back at this point, people in the reliability business often market their systems with quote-unquote nines, right? Or a system that is 99.99% available is described as having four nines. And one that is 99.999 is described as having five nines and so on. And it turns out not to go past five nines because then it's like five seconds in a decade or something and it doesn't really, it's not a very credible claim. The reason people end up doing this is because it gets confusing to talk about probabilities that are really close to one all the time, which is what we're doing here. So for example, your numeric intuition might be that 0.98 to 0.97 is the same kind of change as going from 0.98 to 0.99. But they're actually really different. And one of those represents sort of cutting the problems in half, the other one represents cutting them by 10x. So moving to this nine's way of talking about things tends to improve your numerical intuition a little bit. So for example, silly little Python implementation of that, it's the negative log base 10 of the probabilities. The difference between 0.98 and 0.97 is about, you know, a little less than a little more than an eighth of a nine, whereas the difference between 999 and real three nines is a whole nine, it's 10x. This can sort of help us to gain an intuition for how much harder things get over time is get closer and closer to one. And I found this to be a more useful way to kind of look at the charts and to talk with people about comparing options because otherwise you're just dealing with these quantities that are all really close to one and they kind of mush together in your mind. So I provisionally claimed those texts off the goals as we set them forth earlier. And then subjectively, I think that definitely happens with this is that it improves the quality of discussions around these kinds of problems. This goes from so-and-so saying, ah, the site's slow this week and so-and-so saying, I don't know what you're talking about, it seems fine to me, or seems the same as it did before. To us saying, yes it is slower, no it isn't slower, and why. It can help us select projects to some extent. One of my favorite uses of this kind of thing is as a confidence builder to go after something that's gonna involve a little bit of risk. So if you're going into a period where you're gonna be doing a lot of aggressive feature building, it can be nice to know that you're in a good global health place to sustain that kind of an effort for a while. And something I've learned just more recently with this is that there are bad questions to ask with this. And the bad question to ask with it is a question, how big should Bob's raise be this year? And the reason this is a bad question to ask with this and with many other metrics that we'd like to keep around to understand health with, is that this is a behavioral metric, right? And so Goodheart's Law applies. Goodheart's Law from Economics. Once something becomes a target, it ceases to be useful as a measure. If we wanna preserve this thing's ability to be a real North Star that we can honestly assess the health of the site with, we need to keep people honest around it and make sure we don't build perverse incentives to cheat on it. And it's not that anybody is necessarily even a bad actor here. It's not even that people are gonna necessarily try to subvert these numbers. It's just once people's jobs and livelihoods and promotions and status and everything else start to depend on this, it makes it much harder to keep it honest. So you need to have a little bit of management discipline around. These are the indicators that we use for measuring impact for the case of promotions and these are the things that we use to assess the global health of things. So I appreciate you taking the time to listen to me here today. If there's, if you're curious enough about this to take a whack at it, there's a lot of details here in particular numbers and so on. I would encourage you to zoom out from those a little bit and just focus on one class of unacceptable experience and start counting it and start dividing it by the times that it's acceptable. That's all you need to do to get started. I think the first metric gives you, you know, probably about 50% of the value that you get from something out of this. And I'd love to hear from, from any experiences that you have, any nuances that you discover from doing this kind of work. And I appreciate you taking the time to listen to me here today and thanks so much.