 Welcome to your last session of the day. Apologies for standing between you and beer. We'll try to keep this high energy and fun, and send you off to the reception in high spirits. First, hello. My name is Christine. I'm the CEO and co-founder of Honeycomb. I care a lot about observability, but a lot of the techniques that I'll talk about in this talk can apply to whatever tool you happen to be using. Software in 2023 feels more like magic than it ever really has before. We are all here at this conference to explore what new magical capabilities we can add to our applications and user experiences. And for better or for worse, it feels like every CEO and CTO out there now turning to their teams, asking how LLMs can be incorporated into the core product. Lots to be excited about here. It's fun to be squarely in the middle of a phase change in progress. Everything is new to everyone together, but there's also a reality check. There being a lot more demand for AI functionality than the people who have expertise to build it, and that means software engineering teams are often the ones who are picking up the task to just figure it out and build it. And to be clear, from my perspective, this is something that's okay. As a generalist, software engineer, the fewer silos we can build, the better, the more we can transfer existing skills, the better. So this is my bias coming into this talk and about why we're trying to pull in existing techniques to leverage in this new space. Because on one hand, using and building on top of LLMs, it's a lot like using any other black box that you interact with via API. Lots of consistent expectations we can set about how we make sense of them. And in essence, we've learned how to develop against these black boxes and turn them into testable and mockable and reliable parts of our systems. But there's one key difference between having your application behavior rely on an LLM versus like a payments provider. How predictable the behavior of that black box is. And that difference, that lack of predictability breaks apart all of the different techniques we've built up along the years for making sense of our complex software systems. We have all these techniques to ensure correctness of the software that we put out in the world. But where with a normal set of APIs, you can fairly tightly constrain the inputs. You can reason about what inputs are valid and invalid when we build on LLMs. We are often literally inviting free form, open user input in whatever language and words they want to use. Reproducibility becomes a lot more challenging. Regressions are going to happen, especially if you have a model team or public API that is actively improving the model that your application is depending on. And with a tightly constrained black box API, again, you can reason about it in a way where I have a bank account. I'm gonna debit $5 from it. That bank account should have five fewer dollars in it. It's much easier to reason about and debug than a thing that is literally meant to simulate the full flexibility of human expression. So ensuring correctness is a lot harder. And I have heard that the right way to solve these problems is to build an evaluation system, to evaluate the effectiveness of the models in your prompt, which is absolutely a thing that you can do if you have an ML team to do it. But most of us aren't ML engineers. And the promise of an LLM exposed via API is that we shouldn't have to be an ML engineer to incorporate this technology. Rag pipelines make this even more complicated when we're adding additional context to our applications to help LLMs return better results. And it's just more work that your application does and introduces even more unpredictability into how these applications behave. So this turning upside down of our worldview is happening on a literal software engineering systems, systems engineering level, because these black boxes aren't testable or debuggable in a traditional sense anymore and there's no solid sense of correct behavior that we can fall back to. It's also true from a meta level. There is no environment where we can conduct our tests and feel confident in graduating that to production. Even normal product development practices have to be turned inside out. You can't just be like, oh, I have this early access group and I like how they're using the product and now I can release it to the wild and that will be a fair simulation. When the possible range of inputs is anything a human can dream up, all you're doing by limiting the folks who are able to use this feature is by constraining the set of inevitable failures you will encounter when an uncontrolled and unprompted group start to do things in your product that you never expected them to do. So should we give up on everything we've learned about building and operating software systems and embrace the rights of the prompt engineers, as the specialist skill said. If you have been paying attention to the title of this talk, the answer's obviously no. Because we already have a model for how to measure and debug and move the needle on unpredictable and qualitative experiences. Observerability. And bear with me a little bit. This term has become so commonplace. It's kind of fallen out of fashion to redefine it. Someone who's been talking about it for a while, hear me, I think it'll help some pieces fall into place. The term originally comes from control theory. It's about inputs and outputs. And it can often feel overly formal when we're talking about software systems, which it still emphatically applies to. But it pretty literally feels applicable to a system that can't be monitored or simulated with traditional techniques. Less formally, I think of it as a way to compare expected behavior against actual behavior but in live systems. And to ground this in things we may be familiar with, for a standard web app that you might be using, whether this is your bank website or something else, we can capture what is happening inside the software by looking at the arguments we're sending it, the parameters we're sending it, some metadata about how the app was running, and what was returned. And this lets us reason about the behavior that we expected for a given set of parameters in order for the engineers to debug and reproduce the issue that the actual behavior, when the actual behavior deviates from that expectation. And the thing about black boxes, these external APIs that we can use, is that even if something, it depends on something that's outside of our control, we can extend that technique, look at the inputs and outputs, and reason about what's happening and why and how what we're feeding it influences this behavior. The approach becomes the same for LLMs, unpredictable and non-deterministic as they are. It is a blanket statement that in complex systems, software patterns become unpredictable and change over time. With LLMs, that assertion becomes a guarantee. If you use LLMs, your data sets are going to be unpredictable and will change over time. And key to operating sanely on that magical foundation is having a way to gather, aggregate, and explore that data in a way that captures as accurately what that user actually experienced. But accurately, I mean expressively, right? And things that reflect what your application is trying to do. That's what lets you reason and build and ensure a quality user experience on top of LLMs. The ability to understand from the outside why your user got a certain response and how to prevent hallucination and what your application can do to ensure as high quality experience as possible. Because observability ultimately is here to create feedback loops that let you learn from what's really happening in the code. The same way that we've learned how to work iteratively with tests, observability lets us all ship sooner, observe those results in the wild and wrap those observations back in the development process as quickly as possible. So a little bit of a reality check. Why are we listening to me? Besides, I'm on stage and you have to and we don't have beer yet. Well, Honeycomb is a SaaS tool and we happened to release our query assistant, small feature built on top of LLMs in May 2023. I'm gonna say small. I think the customer impact was quite high but only took us about six weeks of development. And we dedicated about eight weeks afterwards to iterate on the experience. And I really like calling out this last piece because what we were able to do was really lean into watching real users in the wild use the feature and improve it live with that feedback. A quick overview on what this did. Our product has a visual query interface that lets folks interact with their telemetry. We believe that point and click. It's always easier for someone to learn than an open text box. But even so, there's a learning curve to use any UI. We're really excited about using LLMs as a translation layer from the human intent into what the GUI actually needed. And because we wanted to preserve a lot of the interactivity and explorability, we wanted folks to, of course, be able to tweak and iterate on their output. You can see, okay, you type in the thing, we do the translation, you can keep iterating. Because of this, there is no, and I think this is true of a lot of features, functionality built on LLMs, there's no concrete or qualitative result that it rely on that says, great, this feature is good, it did its job. It's a very qualitative sort of ask, hey, user, did this help you? Was this result helpful? Did it move you along in your investigation? And using this philosophy of observability, really gathering feedback and thinking about things from the user perspective, we were able to capture qualitative feedback and use that to posit some higher level product goals, like, will this help retain new users? Will this smooth the learning curve and measure those results and feed that back into our iteration? Spoiler alert, we hit these goals, we were really thrilled by how this was received. And yet, we still meet a lot of teams who are still in the early stages that they're exploring with LLMs, who were not as confident as we were to ship something live and see how people used it. And I think that this is rooted in the fact that when you're comfortable with this approach to observing a system from the outside, measuring results and expecting to pull them back in quickly, you can have a level of confidence in what you put out there that flexes really well to the uncertainty of this new world. As we were building our future, these are some of the things that we learned. Imagine for anyone who's been building on their own, these things are not new. Your users will do things that are unpredictable. You will have something that you're improving over here that breaks something over there. And they all reinforce this idea that what we've been doing all along to ship high quality correct software doesn't quite apply. And down here, there's a blog post that goes into this in much greater detail. So what do we do? What can we do about this? Well, step one, sort of the underlying piece of observability is leaving a paper trail for yourself. How and why your code behaves a certain way. I think of instrumentation as just like documentation and testing. It's another way for your code to explain itself back to you. And in this case, again, it means capturing as much logic, things that are unique to your application, your product, your domain, what users might struggle with as possible in order to validate or invalidate hypotheses for why is my code not behaving the way I expect? Why is this feature hallucinating? In a normal software system, this can let you do things as simple as, you know, figure out why latency is going up, figure out why one individual user is associated with unexpected behavior. It can let you do as things as complicated as figure out which implementation of an MP complete problem gated behind a feature flag behaves the best under real user workloads. So it's all about testing your hypotheses on production data. Brought into the LLM world, the same principles apply. If you capture as much as you can about what your users are doing in their system in a format that lets you look at overarching performance and then also debug into individual detailed transactions, you can do this sort of zooming in, zooming out high level trends individual debugging that we need to iterate quickly with unpredictable systems. Here you can see there's a whole range of things that you can capture about the interaction with any given feature. Capturing this sort of data lets you ask high level questions about the end user experience, score the quality of the response, aggregate on that, track progress over time. You can ask and answer high level questions about trends and the latency of those LLMs, what it's actually like to hit that API, whether the problem is on your side or theirs. You can then group metrics that you're capturing on fine-grained characteristics of each request, which then let us draw conclusions about certain user segments who are abusing the feature or using it in new and unexpected ways, or how context that we're pushing through the application impacts LLM performance. The more free form you are also, not having strict definitions for things like errors lets you capture that unexpected behavior in a not everything marked as an error because this is necessarily an error kind of way. When you're iterating fast and you're working with an API that might hallucinate, sometimes it's not an error, sometimes it's just behavior that you need to handle differently. A general philosophy, I have a thing against exceptions. Not every exception is exceptional, not everything exceptional is captured as an exception. It's just the way it is. When you use something like traces to tie all these pieces together, these complex pipelines, this rolling up of context, you can see how all these pieces contribute to a certain user experience. Likely you cannot read this from where you're sitting, but the tiny blue line at the bottom, that finally is the actual call that our service made to open AI. Everything before that point is building up the context to send an ideal prompt over. And all this context is what we need if we have any hope of figuring out why things went wrong, why our users got a terrible suggestion and how to iterate towards a better future. I think the thing that I'm most excited about here is that, again, this isn't a whole new skill set. A lot of this shift in how we build software is already underway. We already have some of the muscles built we need. As a baby software engineer, I used to take a lot of pride into shipping really fast. And I wrote tests along the way because, of course, this is just the way that things were and this is just a part of shipping good code. But in the last decade or so, maybe a decade or more, depending, we've already seen a shift in this conversation. It's not just write a lot of code and maybe write a lot of tests. We're already starting to see phrases like service ownership and putting developers on call and testing and production enter the lexicon of what does it mean to be part of a software engineering team. The developer's scope is starting to include production and what real users are experiencing in the wild. If we take that behavior shift, we've built up over the years, we do some tweaking. The things that developers have taken into ourselves, driven by a test of in development, can really be neatly applied to production maybe under a different name. If we identify the things that we should, the logical branches that we should test for in a test case, well, we can translate that to instrumenting our code with intention so we can capture that paper trail and debug. With tests, you're constantly comparing expected versus actual and why these deviate in production, whether built on the hell ones or not, you can watch for deviations. You can see what happens when your change goes live and see whether that's, again, what you expect. And with TDD, we're expecting to act on the output of these feedback loops, fail fast, expect to iterate. That happens with production too. These are guardrails that we have generalized already for building complex software systems that can be applied to LLMs to greater effect for all the reasons we've gone through already so far. Test driven development was all about the practice of helping software engineers build the habit of checking our mental models while we wrote code. Observability is all about the practice of software engineers, SREs, DevOps, et cetera, having a backstop to our mental models when we ship code. That's the ability to sanity check the LLMs. Necessary when we're never going to be able to have a mental model that is accurate enough to rely on, if like it or not, in this new gen AI world, nothing is predictable, except that there will be chaos taking a photo. I'll have a thing at the end where you all can download the slides if you need. So I wanna highlight one very specific technique that is a little bit adjacent to, but a little bit different from observability, which is a concept popularized through the rise of SRE commonly associated with existing software systems and ensuring consistent behavior. Just in case folks in the audience may not be familiar with this, I think of it as a way to force product and service owners to align on a definition what it means to provide great service to users. Often it can manifest in statements like these, and often they're used to set a baseline and measure degradation over time. You hear them a lot associated with uptime and performance in SRE metrics. But in our world, in the world of building on LLMs, when the LLM landscape is moving so quickly and best practices are emerging, it's almost like the emphasis there shifts. Instead of expecting degradation to be an uncommon case to watch out for, instead, SLOs can, you can flip them and have them help you watch out for, react to, improve on something that you know is going to happen. Having the landscape around you change and having your feature behave differently than what you expect. I'll show you an example. When we released our feature, I mentioned six weeks in development, we gave ourselves eight weeks to iterate. This is a thing that we posited worked and weren't really sure. What we did is instead of setting a baseline and then freaking out if it went down, we set a fairly low baseline. Usually SLOs are measured in terms of nines. Here, we set it at 70 something. We chose to set our SLO to baseline and actually use it to track improvement, saying, hey, this is what we're releasing, we wanna make it better, we wanna make the user experience better, we want people, this would be more useful over time. And we used, again, all those tooling around SLOs to help us identify the highest impact things we could do to move that line upward. This sort of telemetry that we captured, the expressiveness that we were able to rely on to map it back to our application logic is what allows the team to iterate fast, iterate confidently and work around this unpredictable and inherently hard to model API. And by choosing to measure from the outside in, it enabled a level of debugability that you don't get if you think, oh, I'm just going to instrument the LLM or I'm going to build my own model. It allows for this decoupling and moving faster. But you don't have to take my word for it. Two of our customers came to the same conclusion, actually independently. They were already using Honeycomb for parts of their application. Duolingo, language learning app, being a mobile app, they cared a lot about latency. And so when they released their LLM-backed features, beautiful use case, of course, for large language models, helping you learn another language, they made sure that they were able to instrument everything. Both the applications interactions with open API. Open AI, and also everything along the user path that might influence latency along the way. How services talk to each other. Different types of language, which language experience they were in. Intercom, another example here where they are a B2B messaging app. You might go to a website and have the chat bubble pop up, might be backed by Intercom. They similarly were rapidly trying to incorporate LLMs into their features, but they have a really heterogeneous set of customers, really broad surface area to their product. And so what they cared about was user experience while they were changing out some of their plumbing, and being able to track as many possible parameters at a given time so that they could isolate which set of parameters impacted that ultimate user experience while introducing LLMs. And you can see here is just a, I mean, the fields I was able to pull out are a fraction of what they actually tracked. But this is what lets them ship confidently attach cause to effect and really understand how engineering work impacts what the user sees on the other side. So in the end, LLMs break so many of the existing practices that we rely on to ensure a great software experience. But existing, there are ways to take existing software engineering and SRE techniques to make building on top of LLMs make sense. If you're able to capture all the metadata and use it to understand what's happening in your application, you can build great experiences, magical experiences for your users, as long as we embrace that unpredictability and measure from the outside in rather than really trying to expect an LLM to behave like a normal API. If you're curious for more, I am, of course, from Honeycomb. You can go out there, find out more, a little bit more about what we do and how we've helped our customers. If you want a copy of these slides, there's a QR code. It will lead you to not only these slides, but also a list of the links and blog posts and resources that I drew from in order to pull this together. Thank you for your time. I think I've got a few minutes for questions. Or again, don't let me keep you from beers. Thanks, everyone. Any questions? Inform the ability. Users, well, I'm not trying to think about anybody, personally, of course, but users very often skip giving feedback. Or, you know, like, how do you, what do you do in the context of how the app is developed or what you're expecting, what you said, expectations-wise in the UI that help raise the participation rate of users. That is, there are many design answers I can give to that, which would be a bit of a stretch because I am not a designer. I think what I would do if, you know, let's say, like you're saying, I look at the number of people that gave us that feedback and it's only like 3% of the people who actually use the feature. I would look at other signals for what would they do next that indicated it was helpful? They're running queries in our tool. If they are able to move on to their next query, maybe it was helpful, if it was different. If they just re-ran the same query again or what they started with, maybe not helpful. I think GitHub Co-Pilot last year actually released a really interesting study here where they similarly didn't want to rely on subjective feedback. They wanted to look at other signals for was, how do we measure success of this feature? And it's really interesting paper because they initially theorized that I think it was, they were like, oh, if people accept our suggestion and Co-Pilot and move on, then it was helpful. But what they found through a mixture of quantitative and qualitative feedback is people, the real measure of success is whether they accepted but then iterated on it. They like tweaked it. And so it's something that's going to be very dependent on how you are trying to help the user and what you're trying to unblock. I would much rather spend time there than why is this thing hallucinating in the first place. Any other questions? Yeah? You mentioned about the... Absolutely. This is like demo data, but a thousand percent, what you are capturing as that paper trail should, as your feature evolves, the telemetry you're capturing should probably evolve, the same way that tests and documentation evolve. The information and the metadata that you need to understand what's actually happening is necessarily going to be tied to what you're actually doing in the application. If you have a feature flag in the future, you probably want to know if the feature flag is turned on or off at the logic branches. I think that there is... I've met some teams where they're like, oh, well, we have our set of metrics, and it should never change, and these are the blessed set of metrics that we use to understand our application. I feel like that is just trying to build software with one hand behind your back or a foot stuck in the ground, because you're not going to be able to adapt. Certainly, there are some... It's possible to go too far to an extreme, as a team. I want to identify some conventions for which fields mean what and how to have canonical, like this is a customer ID, this is a user ID. But I think modern observability tooling allows for that flexibility and allows for, oh, I want to add this new field because I've added this new path through my application. Does that answer your question? I have lots of feelings about what people should not be able to do with data, mostly what they should be able to do, and some tools make that harder than others, but we can have that conversation while I'm not on stage trying to be vendor neutral. Yes. Is there some tooling? Probably. I mean, there's an infinite number of cool things coming out these days. My challenge with... If I was presented a tool that told me, it would help me figure out why my feature was hallucinating. I would say, cool! How do you know what hallucinating looks like in my application? How do you know what matters in my application? Are you going to let me shove... If I'm an e-commerce site, are you going to let me shove in the merchant and the inventory item and all the things that let me debug it in my systems, as well as all of the standard working with a model functionality? I think in a time of great change, my approach is always, how do we empower the people who are creating the change and the people who are building the software rather than putting too much faith in a tool to solve my problems for me? Again, it's a question of where you draw the line, but I really like seeing what teams can do and we say, here's the data that you need to investigate and really own, from the beginning to the end, what you want to build and whether it delivered that value. Any other questions? I'm going to take off the mic. Go enjoy the reception. Thank you all for your attention today.