 All right, welcome to another rag talk. This is going to be Raga Scale. We hope you can learn something new. Excited to be here. So go ahead. Awesome. Thanks, Ricardo. Excited to be here, too. So to get excited, I asked Anthropics Squad why I should be excited for AI.dev 2023. And if we look at this second bullet, I said I should be excited for researchers from groups like OpenAI and DeepMind. But I thought this was an open source conference. They're not going to be here. This is hallucination. And this brings me to the hot take that we should consider these models to be hallucinatory until we prove otherwise. And the rag pattern is the way a lot of us have started to solve this problem. So I'm not going to spend too much time on this. But just kind of keep this in mind throughout the talk. But these rags can hallucinate, too. So here we built a rag on our company website using Pinecone. Shout out. And we asked, who is Shiaq? And a lot of this response is pretty good. But we also got this last sentence about being a member of the Bank of England's AI public-private forum. And this is about a different member of our team named Shamique. And what happened here was a bad retrieval. So we got contacts about Shiaq and contacts about Shamique passed them both to the LLM. And then we got this kind of malformed hallucination answer. And this is one of the core problems we built TrueLens for. So TrueLens is an open source project to track and evaluate LLM experiments. So once you've built your LLM app, you can connect it with TrueLens to start logging and tracing the inputs, the outputs, the metadata. And you can add feedback functions to evaluate these traces to understand the quality of the inputs, understand how well the application responding, understand the quality of those contacts that are being retrieved. And then importantly, you can explore the records, understand the evaluation results, iterate and select the best chain or application version that you want to roll with to production. So how do you go about testing these rags for hallucination? So we use what we call the rag triad. And you can think about evaluations on each edge of the architecture. So between the querying context, we can use context relevance to measure the quality of each retrieved chunk. Going from the context to the response, we can use groundedness to test if the response has evidence to back it up. And we can use the answer relevance to see if we're actually answering the original user query. So to make these more concrete, we can look at a few failure mode examples. So here, we have an insurance-based rag. We asked, how much in loss is this fraud account for in P&C insurance? And you get this answer. The provided context does not mention the specific amount of losses. And in this true lens screenshot, you can see at the bottom, you see we have both context chunks that were retrieved. Their automated score, along with an LLM generated reason on why that context chunk is not relevant. We can also have issues with the groundedness. So here, we have an LLM response. And we can break it down into the different claims the LLM is making. And then look for evidence to support each one. So here, you can see the top and the bottom sentence are scored as a 0 in groundedness, while the middle two claims are scored as a 10. So this gives us an overall groundedness score of 0.5. With hallucination kind of mixed in. So that's not something we'd want. And then last, sometimes LLMs just answer the wrong question. So here, we asked, which year was Hawaii's state song written? And it just gave us the name of the song instead of the year. But we're not limited to that. So we can evaluate for honest, harmless, helpful, all with true lens out of the box. So as we think about some of these things, how do we build them into some requirements for an observability platform? So in the evaluation stage, we want broad app support to evaluate not only rags, but also agents and more complex style applications. We want reliable, comprehensive, and extensible evals to have a nuanced view of how our application is performing. We want experiment tracking to understand when we change that prompt or change our retrieval strategy. How does that affect the downstream performance? And when we go to production, we want a highly scalable, cost effective, and low latency way to measure these things over time. And last, we need a feedback loop to connect our production monitoring all the way with our evaluation. So we can iterate, find the root cause of issues, and improve our application. So now I'll pass off to Ricardo to talk a little bit about why scale is important for Gen than AI observability. Yes, why scaling observability machine learning models is important? So you can look at the MLO, MLO aspects. You can look at all of these stages at a large scale. And some examples are data ingestion. You're looking at large amounts of data. You need to scale that. Pipeline and orchestration and automation needs to be in place to be able to have repeatable processes. You want to have a repository of large language models that supports these large artifacts. You also want to have a way of serving these machine learning models to be able to answer to a lot of requests or hundreds or thousands of requests in a timely fashion. And finally, you also want that monitoring and the evaluation at large scales. So you want to be able, your data scientists, your teams, your different members of your organization see all the different evaluations of these LLMs. So elaborating a little bit more, you want to follow all these SRE aspects, for example, reliability. You have an uptime requirement at 99.x% that you need to be up and running. You need to be able to perform or be able to, evaluations need to be able to respond in a timely fashion. You need to have the repeatability, the automation, all the different processes automated end to end. User experience is another important aspect. Data scientists and people looking at your evaluations need to be able to have consistent results and a friendly user interface to see what's happening. Another aspect is your organization might be growing. So you need to be able to account for that. More data scientists, more people using your large language models, seeing those results across the whole organization. Another interesting aspect is multimodal models. You probably heard a lot about them in this AI.dev conference. For example, models that allow you to do text to image or text to video, those can be very computer intensive. So you want a way to evaluate these models at scale too. Covering emissions is another important aspect. A lot of conversations about being at zero, a lot of organizations want to get there. So there's something to keep in mind also with your evaluations. And finally, cost, it's an important aspect making sure your evaluations are being consistent in terms of the money you're spending and the ROI that you're getting. Yeah, so as you start to think about the sorts of evals you might want as you're prototyping your first LLM application all the way up to deploying it in production, a lot of teams start with ground truth evals. So these are just gonna be the core set of questions and answers that you really wanna get your application to get right. But these don't really scale. It's like hard to kind of extend these to all the different ways your LLM application is gonna be used. So as you start to get your application in the hands of users, you can start to collect human evals. So this is gonna be your thumbs up, your thumbs down. But again, it doesn't really scale because a pretty small percentage of your users are actually gonna click that thumbs up or thumbs down. And as a result, you have a lot of variance. From there, a lot of teams will turn to traditional NLP evals like Blue and Rouge. But these are really syntactic. They rely a lot on word overlap and really fail to capture the nuance of some of the evaluations you'd probably wanna do. And then from there, we have medium language model MLM and large language model based evaluations. So medium language models like BERT can be fine tuned to provide the right feedback. Maybe you can train smaller models on domain specific evaluations metrics like groundiness or relevance. And this can be a really cost effective way to get that metric. And then you can also have large language model evaluations. And those are nice because you can evaluate anything you can prompt. So I can say how toxic is this response and then get an answer. And that's a nice automated way to get that evaluation. And now I'll pass back to Ricardo to show you how you can run an evaluation framework like TrueLens and Cates. Yeah, so the first step that you wanna follow to run this scale is to create a Docker file. So containerized TrueLens application. So here's a sample of Docker file starting with Miniconda. And then you have an environment with Python file that actually runs like a Python notebook as a data scientist who run it but you wanna run this scale so you have to dockerize it. So here's a demo of dockerizing of the TrueLens application. So we just build the container or here we're pulling the container, sorry, the GitHub repository. And here we're building the container. We started with Miniconda base container image with adding the TrueLens project and we're doing a pip install on that TrueLens project. And once this is complete, we're gonna go ahead and run it. As just a Docker container. So that's the pretty basic entity when it comes to cloud native environments. You just run something in a container. I think a lot of people who are familiar with cloud native environments know about this but some of the data scientist community may not be as familiar. But here you're seeing when you're running it, it's pulling that mini LLM to do the comparisons. And yeah, so this is how it would run in a simple Docker container. If you wanna take it a step further, you would actually create a YAML definition of a Kubernetes manifest that has something like a deployment and a Kubernetes service, which is an endpoint where you can access this TrueLens application. So this is just an example. And here's a short demo on applying that in a Minicube environment. So you do a Minicube CTL apply of that TrueLens YAML file, creating that deployment and the TrueLens service. And as you can see, you've already created the deployment. It's up and running. And the service endpoint is an IP address. And now you can see the pod is also up and running. And this could be multiple pods as you scale up. And to get the URL of TrueLens, we use a Minicube service name, and then we just hit that endpoint and we can see that TrueLens is running. So that's the endpoint that we're hitting. And this is what it would look like on a user interface application TrueLens. You have an application leaderboard here, two sample evaluations where you can see things like context relevance, randomness. You can see cosine distance and answer relevance. So all these different parameters. And you can see more details about the individual evaluations related to our feedback functions. We have different feedback functions that we support, like context relevance, answer relevance, and randomness. So you can see all the specific ones about each one of them. And what it would look like running at a major scale, I just scale this up. So essentially you have this running on a Kubernetes cluster. A user coming in through an ingress controller through a service, communicating to multiple parts, and hopefully having something like an HPA, which is a horizontal part on a scaler that allows you to scale up and down, depending on your needs. In the back end, you have a vector database that stores your context. This could be something like Pinecone or something like RDS or any vector database that is out there in the community. You also wanna be able to talk to your model. Notice that here I have a custom model because of cost, if you wanna scale up, potentially you're better off with a custom model because if you talk to something like OpenAI, it might be cost prohibited. If you have like hundreds and hundreds of API calls, it can cost a lot of money. And to store the results of these evaluations, you would need like a highly scalable database, something like MySQL or something like Amazon RDS. And to configure the parameters for through lens, you will have something like a thick maps that are available as a resource in Kubernetes. Cool, so as we mentioned before, you can use LLM providers to run evaluations. So we have connectors with all of the ones you see on the screen and a whole bunch more with through lens. But doing so makes you live with their limitations. So whether that's high cost, really variable latency, there's all kind of issues with running evaluations through big open LLM providers. So what can you do to scale evaluations? What kind of considerations might we wanna take into account? So one thing we can do is run evaluations out of band. So we can log and serialize the records when the application runs, and then at a later time, maybe at night, we can deserialize the record and then run the evaluation on that record. We can also connect to a dedicated scalable logging database. And doing so is as easy as just pointing the database URL when you set up your true object, when you initialize true lens. And then last, we can train and deploy custom models for evaluation. So here, you can see a pretty quick snow code snippet about how easy it is to connect your model to true lens. So you can just define a custom class. And then define your predict function. And then all that's gonna be connected with all the nice tracing and logging you get with true lens. So true lens can run evaluations on any LLM app stack, whether that's Slank Chain or Lama Index, whether your app is using vector databases like Pinecone or Weaviate, and whether it's a RAG or an agentic framework with different tools, we can kind of evaluate and track it all. So give it a try, it's open source, check us out on Github, smash the star button, we always love that. And thanks for enjoying the talk. One last slide, I'm working with the CNCF where we're creating this cloud native artificial intelligence cloud native group. So join the conversation to what we're trying to bridge the gap between data science and cloud native and how are you scale machine learning or AI better and how the communities actually work together to solve common problems. Thank you, and happy to take any questions. We have socks. Yeah, socks for questions. Interesting questions. Yeah, thanks guys. Thank you. Yeah. Oh, nice connection. Yeah, yeah, that's interesting. So the grab-to-disk check we're doing specifically is related to the context that's being retrieved and passed to the LLM. So as part of your RAG, you're doing some retrieval to your vector database, you get context chunks back, and then the LLM is forming a response from those context chunks. So we're just checking to make sure that it's backed up by those context chunks. But you could definitely do more of an external check that says, is it backed up by the confidence chunks and is it also backed up by some external context? Yeah, that'd be really cool. Love to chat more about that later. Yeah. Let's see if I can... Oh, not quite far enough. Maybe I'll come off the stage. Yeah, so LLMs definitely aren't perfect at this. There's some pretty good research that I can point you to a paper. I think it's called LLM as a Judge Chatbot Arena. And basically what they found is that LLMs have more or less like an 80% agreement rate with humans, and then that's also the same rate of humans in our agreement. So I agree with you that LLMs aren't perfect, but neither are humans and LLMs scale. So, but yeah, good question. Yeah, I think it's a trial and error type of thing too, like you have to do experimentation, right? So there's no absolute answer, right? So, but yeah, yeah, exactly. Yeah, if you want to pay the intern for sure. Yeah, yeah, any other questions? Yeah, I think some like really well-defined tasks like language match, for example, are really good for like a classification model hosted on Hugging Face. And then there's, like Groudedus maybe is another example where you could use either. I think we did some testing on like the SumEval dataset and the MAE4, like GPD4 was like 0.08. I think GPD3.5 was like 0.12 than the natural language inference model was like 0.24. So not terrible and like way cheaper, so. Yeah, is your question about scale or is it more about, yeah, you want to use smaller models for the evaluation generally and more targeted, right? So yeah, yeah, because I think open AI is pretty expensive. Yeah, I think short answer is like beyond the scale and cost if it's like a really well-defined eval and there's like a model that exists. And also if you can like train it, if you want to train your own custom model on your like domain, so maybe like you want to evaluate summarization quality on like something pretty niche that's like kind of outside on what GPD4 is trained on, then that better be pretty good to do too. Good question though. Cool, any other questions? Awesome, well thank you guys, yeah. Thank you. Smash the star button.