 Okay, it's been about another minute, so I think I'll go ahead and get going. First of all, thank you all for coming. I know it's like 5.30, pretty close to dinner time. If you all are hungry, I won't feel offended if you got to duck out, because you just got to eat some food. I'm feeling pretty hungry myself soon, so looking forward to eating after this. I won't try to keep you forever. So, this is gonna be a fun topic, I think, at least fun for me, because this is squarely at the intersection of two talk tracks, I guess. There's the ML and AI track, and there's the observability track, and the two are very firmly intertwined in this talk, so if you're familiar with one, but not the other, then hopefully you'll find some stuff that's useful here. If you're not familiar with either, you might be able to bookmark some of the stuff after you go learn a little bit more about some of the others, but regardless. Let's get going. I wanna talk about, I call it observing an element in production, but this is really about ways that you can make large language models more reliable in production. And the impetus behind this is, you are likely aware, pretty much every organization in the planet right now is looking to build with large language models to some extent, and really, really powerful, but they're kind of hard. It's not magic. There's a lot of problems that they bring, and a lot of challenges that a lot of teams may not have necessarily been prepared to handle. We were one of those, and so I wanna walk through some of our case study at Honeycomb where we learned a whole lot, but I wanna spend a little bit of time on why large language models are hard in production. So I would posit that your average engineer who knows a thing or two about the product that you're working on, if given an afternoon, could probably cough up some sort of prototype that does something pretty useful with an LLM. That's some of the incredible power, and I think it, I just wanna underscore how amazing it is that we have literally the world's most powerful machine learning models just available from a single API call. Like, I could never have dreamed that we had this kind of power available to us, but with that comes a whole set of problems that we may not have knew that we would have. And to extend that, I would say it's actually fairly easy to take a feature that uses a large language model to market for something that's experimental or alpha quality or something like that. If you give a team about a month, they could probably put something together that integrates with the rest of your product, solves a specific problem and does a pretty decent job at it, but the problem is not getting something to market, it's keeping it to market, because that initial launch that you might have, you can create a nice little marketing moment, you can have a nice splash, maybe solve a problem or two, and then your users are gonna start getting used to whatever it's doing. They're gonna expect it to do more, they're gonna wanna do things that you could have never possibly predicted, and they're gonna want it all to be fairly reliable or at least what they think reliable means. Now, unfortunately, you just put a nondeterministic, undebuggable black box into production, and so that kinda goes against the whole idea of trying to make software more reliable. Well, it's not really an either or, and I wanna talk a little bit about that. It's my hypothesis that it's very hard for your typical software engineering team to test and debug LLMs in a way that they would normal software, and that's for a couple reasons. I think the first and probably the most important one is that when you give your end users a blank box and you tell them, type whatever you want, and we're gonna try to make this work in our product, that opens up your product to inputs that you cannot possibly predict. These are modalities that your users have never used before, that you've never tested before, that the English language is basically infinitely expressible, and so they're gonna infinitely express as much as they can inside of your product, so you can't necessarily predict what the bounds of that is gonna be. Can't write unit tests proactively that are gonna be as good as they can be. Now, maybe given enough time, if you spent an entire year building something, you could approximate something that your users are gonna do, but a year in development is nonsense and no business owner is gonna actually accept that as a proposition for building with large language models. Once you're actually there, though, like it's not just about the user inputs because these models, depending on what you're doing, can sometimes be non-deterministic. Now, there's a bit of nuance to that, like if you're using OpenAI, like the majority of us, they are literally non-deterministic, but there are open source models where you can set the temperature value to zero and that can actually be deterministic, but that's a piece that may change in time. But regardless, there's infinite expressibility on the inputs and not necessarily a reliability on what the outputs are gonna be with respect to that same input. You could have somebody input the same thing twice and get two different answers. And that's by design. That's literally why we use this. There's the phrase that large language models hallucinate is something that you'll see quite a bit. I don't really agree with that characterization. I would say that every single output from a large language model is itself a hallucination and it's really just a matter of tuning the hallucinations that you get every single time to be useful to the use case that you actually have in mind. And if you kind of make that mental shift, you can sort of accept that, okay, this thing is kind of inherently unreliable, but we can rein in it a little bit. And the last piece of that, which is there in tax, you can pretty much read it, but with my own experiences with prompt engineering, with something that's live in production for the better part of the year, it's extremely easy to cause a regression because kind of going back to that first point, your users are doing things that you may not necessarily predict. And you're gonna have things that are actually working that you're really not aware of yet. And if you fix something that somebody said it's not doing the right thing, you may have caused a regression with something that was already working and because you weren't aware that it was working, you've now fixed one problem and caused another. This is extremely common in prompt engineering. This is something that anybody who's in production right now with this stuff probably knows and feels quite a bit. And I wanna tell a little bit of the Honeycomb story and how we managed to make our own feature, at least our first feature that used large language models be good enough. It's not perfect, but it's pretty good. So in May of 2023, Honeycomb is an observability product. So if you're not familiar with observability, the idea is that you have a bunch of data from your systems sent to our backend. And if there's something going wrong or if there's something that's particularly slow, you can query that data and you can say, okay, is this, if I'm running an e-commerce site, is my checkout service the one that's particularly slow or something like that? There's all sorts of questions that you can come to our tool with. And what we found with our new users who were coming in is they had heard Honeycomb is a great observability tool. All right, cool. But I don't know how to use it. They would come in and they would know what they wanna ask, but we had a UI that didn't let them actually express that in the first place. And they would walk away and say, okay, I don't know how to use your UI. So one of our hypotheses was, well, what if we put a text box in there, have a large language model do its best to translate that into something that we can get structured output in the form of a JSON object and parse and validate that thing and then turn it into a query that we execute on behalf of the user. We launched. It was a great marketing moment. I definitely do wanna highlight that if you have the opportunity to build something with language models right now, your marketing team is gonna be very happy with you and that will provide good business to the company that you work for. So I recommend doing it. But more importantly, we iterated a ton after we released. And we really, truly stumbled our way into what might be some best practices. There were frankly no documented best practices out there. Prompt Engineering is still kind of the Wild West, but it was even more the Wild West earlier that year. And we think we have a methodology and the beginnings of a tool set that can be used to rein some of this in to make something not just a good marketing moment that your marketing team is gonna be happy with, but actually turn it into reliable software that your end users are going to enjoy using in the long run, rather than just a fun little experiment that you eventually have to shut down because it's too janky. And so specifically, I didn't have the numbers in here. I completely forgot to do that. But we did see actually some good results from doing this, in particular with net new users who come into the product. We see a lot more retention from them using the, if they use the natural language querying feature, they end up querying their data a lot more in the long run and ultimately converting into paid users more than people who don't. And in particular, our enterprise sales folks really love this feature because instead of having to handhold someone through the process of learning how to use our products, they can just say, see that text box? Just plug in your question and see what comes out. And it'll create a honeycomb query for you and then you can manipulate that thing afterwards. And in a way, even if it doesn't do what you want, it kind of teaches you how to use the product such that you can be successful. And our sales folks have basically said, this is great, we love it. This has basically shortened the cycle that it takes to get somebody over that first hump in using the thing. So how did we get there? We practiced observability. And if you're thinking that a representative from an observability company got up on stage and told you that they did some observability for a feature and it worked really well, try not to be too cynical about that. It actually is legit. And when I say observability to the max, I really truly mean it. So whenever Honeycomb releases a new feature, we practice observability for it and see how it goes. But we really don't do it to the extent that we did for this feature, not even remotely close. Very few times in fact, are we leaning on observability as sort of like the primary way to improve a given product, right? Very often, frankly, a lot of our changes are front end changes and we can just directly ask the user like, hey, did this chart thing improve stuff? And they say yes or no and they sort of say, hey, this would be a little bit better. But with large language models, you kind of can't do that because especially we found that users are just gonna use the ever-living crap out of this stuff because you give them a blank text box and they're really just gonna go for it. And it would have been way too much effort for us to actually try to talk to all the different people and let alone understand if we had accurate sampling of their experiences or what everybody wants, so we had to use observability. So starting from scratch, we captured everything as tracing data using open telemetry. So if you're not familiar with the trace, it's actually, it's kind of distributed trace, it's a fancy word, it's a $25 word for a 10 cent concept. It's a bunch of structured logs that are just ordered and correlated with an ID and all that stuff kind of happens for you automatically. The idea is that if you have a given transaction, a process system, you can say, okay, user input this thing, it hops through whatever subsystems are there, maybe it calls another service, maybe it calls a database, eventually it sort of finishes off and the user gets a result. So there's sort of this pathway and what happened along that pathway, that's what tracing enables. We capture absolutely as much contextual information as we can between the point from which a user hits the get query button and they see a query that has actually been executed on their behalf. Specifically, it's actually when we enqueue an object to run against the querying engine, but it's pretty much the same thing. We do retrieval augment a generation, or rag as it's often called, another $25 word for a 10 cent concept, but it's a large multi-step process and at every single one of those steps, we capture relevant information about the decision that we made and the data that we gathered that we ultimately contextualize into a prompt that finally makes its way to the large language model. And then also on the end of that, when we actually get a request, we parse that request, we parse the value in this case, it's a JSON object, we turn it into what's called a honeycomb query specification that then gets validated against a set of different rules and there's a couple other things that we do programmatically to fix things up sometimes depending on if they need to be fixed up and all the way until a query gets executed and there's a little feedback mechanism and somebody can say thumbs up or thumbs down or not sure on the result that we gave. It's a lot of data, but it's also really important because every single one of these things are a part of the end user experience and especially on the input side, every one of those things directly influences the behavior of the large language model. And so if we wanna systematically understand how this model is actually doing out in the real world, we have to capture this kind of data, otherwise we're gonna be blind as to how, why it's actually doing what it's doing. And so in particular, I'll have a visual of this a little bit later, but I say there's pretty much four primary things, the user clicks the get query button, there's a pipeline of context gathering via a rag pipeline that happens, it's fully instrumented, we make a request in this case to open AI and then we parse the data and validate it against a set of rules and then we submit to our querying engine and that whole thing is 48 spans inside of a trace, so 48 structured logs, each of which representing a particular piece of the end user experience and critically, each of those representing something that could potentially go wrong and we wanna make sure that we understand, okay, if something is going wrong, is that because the model is at fault or are we making the wrong decisions when we gather context? Did we validate something incorrectly? That actually happened. There's all kinds of things that can happen. And then we also monitor this thing with SLOs or service level objectives. So if you're not in the SRE space, a service level objective is arguably the best way to monitor software today. You're not monitoring things like, my CPU level is good or my memory usage isn't blowing up, you're monitoring an end user experience. In this case, you're saying, okay, what matters here? In this case, we wanna make sure that the user hits the get query button and a query gets submitted to our querying engine. That needs to succeed, like that whole thing needs to actually happen and it should happen within a certain amount of time because if it takes like a whole minute to do this, then the user's probably just gonna walk away in frustration anyways, so it doesn't even matter if we were correct. And so these are actually two service level objectives where we measure, first of all, that entire time slice and it needs to take less than 10 seconds. And then we call this an error SLO, what's really kind of like an availability or success SLO maybe? The idea is that regardless of what the user input, we should, for the vast majority of times, always have a query that gets submitted against our querying engine that can ultimately result in query results that somebody can look at that it did their job. Now, of course, the key thing with service level objectives is they bake in a budget of failures. So if somebody enters in, like my favorite color is blue, well, we can't really produce a honeycomb query based off of that, that's kind of nonsense. And so that's fine, the large majority of our users are not going to be inputting nonsense, they're gonna try to get some actual value out of this tool. And so that's what service level objectives allow you to do is they allow you to establish what your success rate over a target period of time is and then what your budget for failures are. And critically, it measures every one of those things, what I said, those 48 pieces, those 48 spans, each of those can represent an event that could potentially fail this service level objective and we want every single one of those to succeed. So every step needs to happen and it needs to occur in less than 10 seconds. And the other piece to this, which I think is not really talked about in the LLM space is if you want to iterate with this stuff in production, you need to be able to deploy your code very rapidly. I would say that anybody who's worked at Big Tech probably doesn't have this problem, but the majority of enterprise organizations out there are constantly trying to reduce their cycle times because regardless of if you're using LLMs or not, you want to be able to ship bug fixes pretty quickly. This really dials that up if you're using large language models. In particular, you want to do things like monitor that end user experience. Are we doing the job and are we doing it in the time that we said we want to do it? Okay. But what if we fail? What if there's a case where something is not very good? Okay. Well, that's fine. We can fix that, but we want to be able to deploy that and immediately look at the past 24 hours and make sure that, okay, we're still doing the job that we want to do, but we've also fixed the problem that we thought we just fixed without introducing something new that could have gone wrong, right? And that's why those SLOs are really important because those allow you to make sure that you're guarding against those regressions over time. And kind of as I said before, this is something that you will find is extremely challenging to put into a unit test suite. You need to use an observability tool that's monitoring all of this data that you've captured as traces at a very high level of granularity. And once you can set up sort of this flywheel where you're successfully sort of getting into this mode where you are prompt engineering every day and fixing very specific problems, you're identifying a specific thing, you ship that fix, you monitor over the next 24 hours, okay, did this actually do it? Rinse and repeat again and again and again. This is exactly what we did with our feature from May 3rd all the way up until July, I believe, yeah, all the way up until July. And we literally shipped pretty much every single day. We would isolate a very specific problem and we'd be constantly looking at these SLOs and looking at this trace data. So that was a bit abstract. There was a lot of stuff in there. I wanna give you a walkthrough based off of this feature and actually how it's doing live as of last week. So the screenshots that I'm gonna have in here are actually live from what's actually happening right now with our query assistant feature. So what you're looking at here is our SLO page. So there's a bit going on here. I wanna make sure you can actually have some time to absorb it. It's called natural language query generation availability. So we're saying that the proportion of requests to generate natural language queries that complete without an error. And so we're basically identifying the success criteria where we're saying, okay, that entire 48 step process was able to succeed. And if it failed, for any reason at all, doesn't matter what the error is. It could be something completely unrelated to the large language model. It could be that we expected a JSON object and we didn't get back JSON. It could be that we just had a bug unrelated to any of this stuff that was somehow affecting this feature. Doesn't matter. We wanna capture any of those reasons why something could possibly fail. And so you can see that we have about a 95.6, about 96% or so historical compliance with this SLO, meaning that about 96 out of 100 times somebody hits get query, they actually get a query. And then we have a budget burn down here. You can see we have about 78% of our allocated error rates. And this is something that's calculated automatically by us saying that we want 80% of our requests that we make or 80% of that people hitting that get query button, succeeding over a period of seven days. So this is just sort of an overview page, but it's too much to sort of fit on a screen, which is why I split this into two screenshots here. And so the second one is where it gets a little bit more interesting. So this is Honeycomb's SLO tool, but there are other service level objective tools out there that can show you similar kinds of things. This is starting to get into very specific reasons why certain things may fail. So what I'm highlighting here in this screenshot kind of at the bottom left here is a particular error called MLResponse does not contain valid JSON. So what that meant is when we got a response back from OpenAI, we expected a JSON object, but we didn't actually get one back. So we kind of can't create a query when we don't have the object that represents what the query should be. Oops, okay, well, let's dig into that a little bit. So what we can do is go directly from one of those failures into a querying interface, in this case, it's Honeycomb's querying interface, where we're scoping everything that we're looking at down to exactly those times when that error occurred, when we did not get a JSON object back. And we've also gone ahead and grouped by the user input and the response, so we can see, okay, sort of in these three cases, in this case, over this time span here, we got this is the user input and this is the output that resulted in that error. And so indeed we can see there are three such cases where somebody entered something and there was no response given back by OpenAI at all. Well, that's not a JSON object, that sounds about right, but that last one is really bothering me because somebody input a number of requests where HTTP.Route contains this thing and this thing and this thing. It seems like a pretty valid input, I feel like that should result in a query. And indeed it looks like I have a JSON object there and the response that kind of looks like it's a JSON object, so why did this fail? Okay, well, that's something that I might want to investigate. So what comes next is digging into a specific request, going into each of those 48 steps and being able to understand exactly what happened when we got there. And so this is part of that. Again, because it's 48 pieces, it's a little hard to throw up on a screen there. And then in fact, I've collapsed over 20 of these spans, but you can see that there's a lot of operations that we're performing. They're named in very particular ways such as find all suggested inquiries for dataset, most relevant columns, fetching some embeddings from an embedding model that we have, create chat prompt, all that kind of stuff. All of these things are relevant here. But in particular, in Honeycombs UI, when you select one of these things, this is one of those spans in that trace. Spans contain a bunch of key value pairs that have rich information that you can capture at each point. And so in particular, I actually highlighted the response itself, the actual raw data there, because I noticed that it didn't fit in my table view. And indeed, that does look like it's a JSON object, but it doesn't look like it's finished. There's order.external underscore, and then it's done. Well, shit. That sounds like a bug. We should probably fix that one. I'm the owner of this feature, so when I say we, I mean, it probably should be me. But basically, what I want to highlight about this is I didn't start by knowing that there was this case where sometimes it produces part of a JSON object. I started with a very, very high level. These requests should succeed over a period of time. And then started to narrow down, okay, where are the ones that actually fail? All right, let's now dig into that. Let's look at inputs and outputs and just sort of see what's going on. Oh crap, we saw something, indeed. Let's dig into that trace, sort of walk through each of those spans, and yes, I actually did walk through each of these manually just to sort of make sure everything was going on. It took me a few minutes. And then I found one of these, and I said, oh, all right, well, there we go. It's an incomplete JSON object, can't parse that. Oops, that's a bug, and kind of did the job there. Now I can take this into prompt engineering and say yes, we have a very specific problem that we can go and fix, this wasn't nonsense that somebody input, and in fact the model seemed like it was trying to do something useful. I mean, it pulls up a bunch of columns here that look like they're relevant, it's maybe pulling up too many columns, that might be the reason why. But now I can start forming hypotheses about why is the model not doing what it's doing? And critically, I have the direct user input, and then I have all data in every single piece of context that was assembled as a result of that tracing. And in fact, one of the fields on here is something that we call full prompt, where after we've sort of gathered all contacts and parameterized everything into the prompt, just literally that full string itself is actually available there. And so what that allows me to do is just quite literally copy paste that field, go into prompt engineering, and just replay that request. And be like, okay, is this reproduced? Like what is actually happening here? Do it enough times so that I sort of have my head in the right space. And the insight that I sort of gleaned from here, and when I say I need to fix this problem, I mean, I literally do need to fix this problem because I've not submitted a bug fix for this yet, is due to a variety of reasons, we limit the response that we get back from OpenAI. It would be a whole other talk to describe why you wanna do this kind of stuff, why things like rate limiting matter, why things like tokens permitted matter, and various ways that you can guard against prompt injection attacks and that kind of stuff. Limiting response size is one of your mitigations against a variety of problems that can go wrong when you're building with large language models. And in particular, we say 150 tokens is sort of the max we assumed that that would probably be enough for most responses that we get back. And that's true, but now we're at the point where I actually do wanna fix this thing, and 150 is actually not enough. Furthermore, we've been in production long enough that I can probably be confident that we're not going to have this be the attack vector that finally takes us down. So maybe I can think about bumping that up, or maybe I can instead be, okay, well, what if that output should not have been more than 150 tokens? What if it should have been less? All right, well, that's another path that I can explore there, and I think I really wanna highlight that this is based off of real user data and real interactions that we're getting in our product. This is not something I had to divine from somewhere and hope that it actually represented what users were doing. So the last thing that I wanna highlight is, what I sort of showed you is there's really nothing magical about this, there's really nothing terribly special about it either. You can do all of this today. All of this is powered by open telemetry. If you're not familiar with OTEL, I highly recommend checking it out. It's quite a bit, so set aside some time to make sure you really can understand what it gives you, but it's an extremely powerful instrumentation framework that is also an open standard that works with pretty much any tool. So you don't have to use Honeycomb, you can use pretty much any observability tool out there, they all support open telemetry, and you can do this today. Whatever language you're using that's actually running your application, you can just pick the SDK that supports it. Right now, OTEL has stable support for 11 different languages, so chances are what you're using is already supported. You can add what's called auto instrumentation to your whole system, and you start with that first. And the idea is that that picks up things like, request, incoming, response, outgoing, across your entire system, so that especially if you're doing something like retrieval where maybe you hit a database, maybe you call another service, you wanna make sure you're actually connecting all of those pieces in that pathway that a user's request goes in. That gets instrumented for you automatically, so you can really just focus in the context of your own application. Okay, when we're doing retrieval, this is the operation that we're performing. We're gonna create a span, we're gonna capture a bunch of data at that point, and that's just gonna be a part of this trace that's been created for us, or rather the skeleton of that has been created for us automatically by open telemetry. And then you can just have your dev team in the course of, I'd say, maybe a day, maybe a little bit longer, kind of depends on OTEL setup. Get open telemetry setup and start identifying these are the relevant operations that we wanna instrument and instrument those things. I've seen this many times, lots of engineers, like a senior engineer and a team somewhere could probably get this stuff wired up within a day and start to give you even just the most basic insights about what your users are actually doing. Lastly, I think you can expect the way that you accomplish observability with LLMs to get a lot better over time. Right now there's a lot of really rich automatic instrumentation for all kinds of back end technologies. Those back end technologies have been around for a lot longer than the constellation of technologies that are sort of emerging when building things with large language models. That's gonna change. We are sort of en route, I guess you could say, to building auto instrumentation packages for things like requests to LLM providers, frameworks like LangChain and LamaIndex and all the different C of different vector databases that you might be using and just making sure that automatic instrumentation that captures these are the requests, these are the inputs and these are the outputs that came back can get wired up for you. Similarly, there are standards being developed right now in the open telemetry side of things. Come talk to me, I'm the person who's proposing them. And we're basically identifying naming conventions for things so that for example, if you wanna capture a prompt, the name of that should probably be LLM.prompt or something like that. And what that does is that provides a specification that additional libraries can be built up against so that eventually we can have a pretty rich ecosystem of automatic instrumentation sort of at every level of application that uses large language models. And finally, observability tools are, we're identifying this as like this is a legit use case for observability tools, period. And so Honeycomb is one, I know Datadog is another, there are other vendors out there that are starting to build out better support within their own product once you have that data to support this use case quite a bit better. So that's all I got. And we got just a few more minutes but it's also near dinner time. Feel free to ask me any questions you have, I'll stay here as long as you want me to. But if you leave to go grab food, I will not be offended. Thank you. Thank you.